Log Codes are finally there :)

Log codes

I enforce them in all systems I helped created in my professional work… but completely forgot them when starting Spider. And technical debt…

What? Error codes! Or more specifically Log codes.

And this was painful:

  • Log codes help regrouping quickly logs of the same meaning, reason
  • This is much useful:
    • To discard them
    • To analyse a couple of them and check if there are not other errors
    • To regroup common issues from several services as one (infra issue)

Spider implementation

So, during code refactoring,

  • I added Log code to all Spider logs in Info, Warn, Error and Fatal levels.
  • Pattern: [Service code]-[Feature]-001+
  • Service code is replaced by XXX when the log is in common code

I then added a new grid in monitoring that shows a quick and small analyis report on those logs.

And in one glimpse, I’m able to see what happened and if anything serious need to be looked at in the day before 😉
Handy, isn’t it? I like it much!

Demo

For instance, today, at 11h21:

  • Many errors were generated during parsing process
  • At the same time 25 services instances reconnected to Elasticsearch and Redis at once…
  • This looks like a server restart

Indeed, looking at Servers stats,

  • Spider3 server restarted, and its services were reallocated to other nodes:
    • Huge CPU decrease and free RAM increase on Spider 3
    • CPU increase and memory decrease on other nodes

This is confirmed on the grids.

Before

After

Why this restart? Who knows? Some AWS issue maybe…
Anyway, who cares 😉 !
The system handled it gracefully.

  • I checked on the server itself… but the uptime was days ago. So the server did not restart…
  • I then checked on Docker daemon logs (journalctl -u docker.service)

  • msg=”error receiving response” error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded”
  • msg=”heartbeat to manager { } failed” error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded” method=”(*session).heartbeat” module=node/ag
  • msg=”agent: session failed” backoff=100ms error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded” module=node/agent node.id=75fnw7bp17gcyt
  • msg=”manager selected by agent for new session: { }” module=node/agent node.id=75fnw7bp17gcytn91d0apaq5t
    vel=info msg=”waiting 50.073299ms before registering session

Looks like this node had an issue communicating with the Swarm manager for some time, and the services were reallocated… Sounds like a network partition on AWS.

It is told that you have to make sure your architecture is resilient on the Cloud… Done 😉

Although, I still need to rebalance the services manually… for now 🙁

Back Office code refactoring and configuration loading

Backoffice code refactoring

I recently spent a couple of weeks refactoring the Back Office code.

Nothing is visible on the UI, but maintenance and evolution of the microservices will be much easier:

  • Redis and Elasticsearch DAO
  • Configuration loading process
  • API exposure code
  • Circuit breakers initialization

All these parts are still much flexible, and can be adapted to the various services need, but the common code has extracted in common files throughout the services 🙂

No regression noticed for now =)

Configuration reloading

There was an issue with current setup procedure:

  • The whole upgrade process has been automated at the beginning of the year
  • But you had to look at the updated files after an update or configuration change to know which services needed restarting.
  • Or you had to stop and restart the full cluster.

Now, each service is monitoring its own configuration changes every 2 minutes (configurable). And if a configuration change is noticed, the service restarts gracefully 🙂

Sadly, it does not prevent errors to happens, since the restart should be managed in the gateway first. Indeed, the later tries to call the removed service for a couple of times.

Nevertheless, upgrading the plateform is easier than never 😀 !

Timeline now shows communications over all their periods

Long going issue

There had been an issue from the beginning with Timeline component and its representation:
  • Items that have a duration, like Statuses, Http coms, TCP sessions and so, were only represented on the timeline at the beginning of the item.
  • Doing more with a single query was highly costly, if even possible.
This meant that you could have a hole in the Timeline when in fact, there were ongoing communications in this time.

Things have changed: Elasticsearch improved!

– With ES 5.x, Elastic introduced ‘range’ data type that allows to define a range of numeric values or a range of time, a duration in other terms.
– With ES 7.4, Elastic have upgraded their histogram aggregation to manage ranges 🙂 Yeah!
For range values, a document can fall into multiple buckets. The first bucket is computed from the lower bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
What it means is that, when performing an aggregation over a range field, ES is collecting in each histogram bar all items whose range intersect with the bar.
So doing, a long opened status will appear in all minutes / hours / days it is opened. Which is much better for the graphical representation 🙂

Implementation in Spider

  • Nothing changed on Timeline component, nor on the UI queries, but I introduced date ranges fields in the resource, and now, all search queries use them.
  • And… this is even a bit faster 😉

Demo

We can see a long POST /create_session request, lasting 2.7s.

GUI now uses Dates not Timestamps

The service API is accepting both  dates and timestamp. See Open API specification.
I recently updated the GUI to use dates instead of timestamp to call the APIs.

Why?

Because this is much easier when reading the logs =)
And the GUI do not need the microsecond precision that the timestamps allow and not the dates.

Small but useful change 😉

Upgrade to Node 12 – and new metrics

Node 12 upgrade

I recently upgraded all Spider backend services and Whisperer client to Node 12, and all linked node.js dependencies to their latest version, thus reducing the technical debt 🙂

Node 12 is supposed to be faster on async-await than Node 10.
What does it give us?

In general Node 12 is indeed faster than Node 10 🙂 !
The improvement is quite interesting.

Impact of event loop on perceived performance

I also noticed that the response time tracked by the parsing service (Web-Write) calling the Packets service is increasing way too much when the parsing load increases.

After some consideration, I figured out that this was due to Node.js event loop.:

  • Since WebWrite is sending many requests in parallel to PackRead, even if PackRead is fast, the evenloop and monothreading architecture of Node.js implies that Node takes much time before launching the code responsible of tracking the responsetime of dependencies.
  • I then developped new metrics capture at API level, on server side. And thus I now have both:
    • The real performance of Spider APIs – the time to generate the responses
    • The performance of processing the exchanges from Client perspective

The result is quite interesting:

Spider performance is great 🙂
The event loop has much impact!

Cheers,
Thibaut

Upgrading Traefik v1.7 -> v2.1

This week end, I upgraded Traefik from v1.7 to 2.1.

Traefik is used in Spider as edge router as well as load balancer intra cluster.
When I switched from NGINX to Traefik one year ago, there had been a huge improvement in performance and response time stability.

Now, new version is even better! 🙂

Here are the comparison before and after migration over 4h of run:

In short, there is an average improvement of 30% in response times intra cluster, with a slight improvement for edge routers response times.

Do you still wonder if you should change?

Resiliency of Spider system under load :)

Spider endured an unwanted Stress test this morning!
Thanks, it proved it behaved and allowed me to confirm one bug =)
Look what I got from the monitoring:
Load went from 290k/min to 380k/min between 9:00am to 9:20am:
There were so many sessions at once to parsed that the parsing queue what hard to empty:
Thus generating many parsing errors, due to missing packets (short life time)
But all went back to normal afterwards 🙂
CPU and RAM were OK, so I bet the bottleneck was the configuration: limited number of parsers to absorb the load.
But there is a bug: Redis store does not get back to normal completely, some elements are staying in store and not deleted:
That will prevent the system from absorbing too many spikes like this in a row.
Also, some Circuit breakers to Redis opened during the spike (only 4…):
Under this load, the main two Redis instances were enduring 13k requests per second (26k/s combined), with a spike at 31k/s. For only … 13% CPU each. Impressive!
And all this is due to a load test in SPT1 (the 4 whisperers at the bottom)
Conclusion:
  • System is resilient enough 🙂
  • Observability is great and important !!
Bonus: the summary dashboard:
  • A spike at 600k packets/min.
  • Many warnings in logs showing something got bad.
  • CPU on applicative node that went red for some time, and on an ES node as well.
  • Errors on 3 services that may not be scaled enough for this load.
  • And we see the circuit breakers errors due to Redis that suffered.
Really helpful!

Replicas reduction after perfomance optimisations

I had a task that had been hanging around for months:

Reducing replicas and servers count as a consequence of all performance improvements.

Indeed, Spider cluster was still in a replicas configuration built when implementing in Node 8, with generators and using NGINX as gateway.

At this moment, I need many replicas to overcome some NGINX flaws under hi load within Docker.
Except that now… The cluster is handling between 2x and 3x more load, and constantly!

So I did the work of computing theoretical replicas count, and reducing them until it felt reasonable.

Overall results

And the result is great:

  • From 221 cumulated replicas on operational services,
  • It got reduced to… 36 ! 🙂 I removed 185 replicas from the cluster!

Globaly, it is interesting to note that removing replicas:

  • Reduced drastically the RAM usage, which was the goal: – 14GB
  • Decreased alos the total CPU usage.
    • I think this is due to less CPU overhead in changing process

Thanks to the memory reduction, I moved from x7 M5 AWS instances to x5 C5, which are cheaper for better CPU (only 4 GB of RAM).
I may remove one server still, because the average load is 90% CPU and 2 GB of RAM. But I’ll wait some time to be sure.

Detailed step by step

Replica reduction was made step by step, and metrics saved for each step.

  • You may notice that reducing replicas made services faster !! This couls be linked to the removal of context changes…
  • And then, moving from 7 servers to 5 servers increase the latency again. For sure… when a server does almost nothing, it is fastest 😉
    • There was also an intermediate step, not on this table, with 7xC5. But latency almost did not change.

Traefik issue

After reducing replicas, I encountered a strange issue:

There were some sporadic 502 error messages from one service calling another one. But the error message was captured in the caller, the callee did not receive the communications!

And indeed, the issue was in Traefik gateway. The case is not so frequent, but it is due to the big difference of socket timeout between Traefik and Node.
Traefik is at 90s, Node… at 5s. Once Node TTL is configured longer that Traefik’s the 502 disappeared.
Reference: https://github.com/containous/traefik/issues/3237

Automated setup ! :-)

Spider configuration crazyness

I figured out that if I wanted easy adoption of Spider, I had to improve the setup.

Indeed, the way to setup Spider was cumbersome! If the infra (Elasticsearch cluster, Swarm applicative cluster) can be automated with Terraform and Ansible, the cluster configuration involves many assets:

  • Microservices configuration: 34 files
  • File beat & metric beats configuration: 2 files
  • JWT keys
  • SMTP configuration
  • Traefik configuration
  • Elasticsearch indices provisioning (and maintenance)
  • Swarm cluster configuration: 10 files

All those files having parameters with consistent values, links and a lot of potential failure in case of wrong configuration.
Moreover update of configuration on evolutions was really a pain involving GIT history, diff and so on.

So I decided to automate Spider setup and update.

Result

The result is … WoW 🙂 (not Blizzard one, my own :-p)
Here is the ‘minimal’ configuration file to have a running Spider installation, from scratch:

22 lines 🙂
Neat !

How it works

Original setup

Setup instructions are summarized in 4 steps/commands and all configuration is in one file now. From this file, a bunch of scripts:

  • Validate the configuration
  • Take configuration templates and create the environment own configuration from them
  • Provision or update the indices
  • Create the original admin account
  • And start the cluster 🙂

Here are the setup instructions:

Configuration templates are available in a GIT repo that will receives updates for all new features.

Upgrades

  • ‘make update’ will rebase the local repo with the trunk
  • After which a ‘make config db cluster’ will be enough to update current installation.
  • And of course, in case of new required setup values in setup.yml, the user will be told, and the update stopped.

The indices update is automatic and based on metadata added to the indices. It handles parent templates, ILM policies, rolling over indices, rolling up indices and static indices.

  • Rolling over, reindexing and cache purging is all handled programmatically., and automatic. This is so goood =)

 

What do you think? Do you like it?
I did a demo of ideas and execution to Streetsmart DevOps team today… and I think we’ll target the same kind of improvement for Streetsmart! Yeah 🙂

New saving options – TcpSessions and Http parsing resources – leading to a huge optimisation !

New saving options

Due to upgrade to Redis 5.0.7, a bug that I’d been tracking for months has been solved (on Redis 😉 ), and now parsing quality is 100% 🙂 apart from clients issues.
Then I moved to next phase: it is possible not to save parsing resources any more, thus leading to a huge optimisation in Elasticsearch usage…

For this, I recently added two new options in Whisperer configuration:

  • Save Tcpsessions
  • Save Http parsing resources

No need to save those?

As for most cases, at least on Streetsmart, we do not work with Tcp sessions and Http parsing resources, there is no use to save them in Elasticsearch.

  • These resources are indeed only used to track parsing quality and to troubleshoot specific errors.
  • These resources are also saved on Elasticsearch for recovery on errors… but as there are no more errors… it is completely possible to avoid their storage.

I had to change the serialization logic in different places, but after a couple of days, it is now stable and working perfectly 🙂

The whole process of packets parsing is then done in memory, in real time and distributed across many web services in the cluster, using only Redis as a temporary store, moving around 300 000 packets by minute 24h/24, amounting to almost 100MB par minute!

That’s beautiful =)

What’s the result?

  • Less storage cost
  • Less CPU cost for Elasticsearch indexation
  • More availability for Elasticsearch on the UI searches 🙂 You must have noticed less timeouts these days?

On UI, when these resources are not available, their hypermedia links are not shown. And a new warning is displayed when moving to TCP / Packet view:

Currently, Tcp sessions and Packets are only saved on SIT1 and PRM Integration platforms. For… non regression 😉

One drawback though

Those resources where rolled up in Elasticsearch to provide information for the parsing quality line at the bottom of the timeline. :-/

As I got this evolution idea AFTER the quality line implementation, I have nothing to replace it yet.
So I designed a new way to save parsing quality information during the parsing, in a pre-aggregated format. Development will start soon!

It is important for transparency as well as non regression checks.