Skip to main content

Log Codes are finally there :)

· 3 min read

Log codes​

I enforce them in all systems I helped created in my professional work... but completely forgot them when starting Spider. And technical debt...

What? Error codes! Or more specifically Log codes.

And this was painful:

  • Log codes help regrouping quickly logs of the same meaning, reason
  • This is much useful:
    • To discard them
    • To analyse a couple of them and check if there are not other errors
    • To regroup common issues from several services as one (infra issue)

Spider implementation​

So, during code refactoring,

  • I added Log code to all Spider logs in Info, Warn, Error and Fatal levels.
  • Pattern: [Service code]-[Feature]-001+
  • Service code is replaced by XXX when the log is in common code

I then added a new grid in monitoring that shows a quick and small analyis report on those logs.

And in one glimpse, I'm able to see what happened and if anything serious need to be looked at in the day before ;) Handy, isn't it? I like it much!

Demo​

For instance, today, at 11h21:

  • Many errors were generated during parsing process
  • At the same time 25 services instances reconnected to Elasticsearch and Redis at once...
  • This looks like a server restart

Indeed, looking at Servers stats,

  • Spider3 server restarted, and its services were reallocated to other nodes:
    • Huge CPU decrease and free RAM increase on Spider 3
    • CPU increase and memory decrease on other nodes

This is confirmed on the grids.

Before

After

Why this restart? Who knows? Some AWS issue maybe... Anyway, who cares ;-) ! The system handled it gracefully.

  • I checked on the server itself... but the uptime was days ago. So the server did not restart...
  • I then checked on Docker daemon logs (journalctl -u docker.service)

  • msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
  • msg="heartbeat to manager failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/ag
  • msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=75fnw7bp17gcyt
  • msg="manager selected by agent for new session: " module=node/agent node.id=75fnw7bp17gcytn91d0apaq5t vel=info msg="waiting 50.073299ms before registering session"

Looks like this node had an issue communicating with the Swarm manager for some time, and the services were reallocated... Sounds like a network partition on AWS.

It is told that you have to make sure your architecture is resilient on the Cloud... Done ;-)

Although, I still need to rebalance the services manually... for now :(

Timeline now shows communications over all their periods

· 2 min read

Long going issue​

There had been an issue from the beginning with Timeline component and its representation:

  • Items that have a duration, like Statuses, Http coms, TCP sessions and so, were only represented on the timeline at the beginning of the item.
  • Doing more with a single query was highly costly, if even possible.

This meant that you could have a hole in the Timeline when in fact, there were ongoing communications in this time.

Things have changed: Elasticsearch improved!​

- With ES 5.x, Elastic introduced 'range' data type that allows to define a range of numeric values or a range of time, a duration in other terms.

- With ES 7.4, Elastic have upgraded their histogram aggregation to manage ranges :) Yeah!

For range values, a document can fall into multiple buckets. The first bucket is computed from the lower bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same way from the upper bound of the range, and the range is counted in all buckets in between and including those two.

Reference : search-aggregations-bucket-histogram-aggregation

What it means is that, when performing an aggregation over a range field, ES is collecting in each histogram bar all items whose range intersect with the bar.

So doing, a long opened status will appear in all minutes / hours / days it is opened. Which is much better for the graphical representation :)

Implementation in Spider​

  • Nothing changed on Timeline component, nor on the UI queries, but I introduced date ranges fields in the resource, and now, all search queries use them.
  • And... this is even a bit faster ;-)

Demo​

We can see a long POST /create_session request, lasting 2.7s.

Setup improvement: Manage system versions in installation and update.

· One min read

System versions​

In order to facilitate Spider setup and updates, I introduced system versions. They allow external instalation to stay with a stable version for some time.

Generation​

On request, a script generates a system version by tagging together:

  • Docker images of Services, UI and Setup scripts
  • Configuration templates
  • Indices templates

Using them​

Then, the Makefile scripts of the infrastructure repository allows to:

  • List available system versions
  • Install or upgrade to a particular version

This is for now only the beginning, but setup are getting easier and easier :)

Upgrade to Node 12 - and new metrics

· 2 min read

Node 12 upgrade​

I recently upgraded all Spider backend services and Whisperer client to Node 12, and all linked node.js dependencies to their latest version, thus reducing the technical debt :-)

Node 12 is supposed to be faster on async-await than Node 10. What does it give us?

In general Node 12 is indeed faster than Node 10 :) ! The improvement is quite interesting.

Impact of event loop on perceived performance​

I also noticed that the response time tracked by the parsing service (Web-Write) calling the Packets service is increasing way too much when the parsing load increases.

After some consideration, I figured out that this was due to Node.js event loop.:

  • Since WebWrite is sending many requests in parallel to PackRead, even if PackRead is fast, the evenloop and monothreading architecture of Node.js implies that Node takes much time before launching the code responsible of tracking the responsetime of dependencies.
  • I then developped new metrics capture at API level, on server side. And thus I now have both:
    • The real performance of Spider APIs - the time to generate the responses
    • The performance of processing the exchanges from Client perspective

The result is quite interesting:

Spider performance is great :-) The event loop has much impact!

Cheers, Thibaut

Upgrading Traefik v1.7 -> v2.1

· One min read

This week end, I upgraded Traefik from v1.7 to 2.1.

Traefik is used in Spider as edge router as well as load balancer intra cluster. When I switched from NGINX to Traefik one year ago, there had been a huge improvement in performance and response time stability.

Now, new version is even better! :)

Here are the comparison before and after migration over 4h of run:

In short, there is an average improvement of 30% in response times intra cluster, with a slight improvement for edge routers response times.

Do you still wonder if you should change?

Resiliency of Spider system under load :)

· 2 min read

Spider endured an unwanted Stress test this morning!

Thanks, it proved it behaved and allowed me to confirm one bug =)

Look what I got from the monitoring:

Load went from 290k/min to 380k/min between 9:00am to 9:20am:

There were so many sessions at once to parsed that the parsing queue what hard to empty:

Thus generating many parsing errors, due to missing packets (short life time)

But all went back to normal afterwards :)

CPU and RAM were OK, so I bet the bottleneck was the configuration: limited number of parsers to absorb the load.

But there is a bug: Redis store does not get back to normal completely, some elements are staying in store and not deleted:

That will prevent the system from absorbing too many spikes like this in a row.

Also, some Circuit breakers to Redis opened during the spike (only 4...):

Under this load, the main two Redis instances were enduring 13k requests per second (26k/s combined), with a spike at 31k/s. For only ... 13% CPU each. Impressive!

And all this is due to a load test in SPT1 (the 4 whisperers at the bottom)

Conclusion:

  • System is resilient enough :)
  • Observability is great and important !!

Bonus: the summary dashboard:

  • A spike at 600k packets/min.
  • Many warnings in logs showing something got bad.
  • CPU on applicative node that went red for some time, and on an ES node as well.
  • Errors on 3 services that may not be scaled enough for this load.
  • And we see the circuit breakers errors due to Redis that suffered.

Really helpful!

Replicas reduction after perfomance optimisations

· 2 min read

I had a task that had been hanging around for months:

Reducing replicas and servers count as a consequence of all performance improvements.

Indeed, Spider cluster was still in a replicas configuration built when implementing in Node 8, with generators and using NGINX as gateway.

At this moment, I need many replicas to overcome some NGINX flaws under hi load within Docker. Except that now... The cluster is handling between 2x and 3x more load, and constantly!

So I did the work of computing theoretical replicas count, and reducing them until it felt reasonable.

Overall results​

And the result is great:

  • From 221 cumulated replicas on operational services,
  • It got reduced to... 36 ! :) I removed 185 replicas from the cluster!

Globaly, it is interesting to note that removing replicas:

  • Reduced drastically the RAM usage, which was the goal: - 14GB
  • Decreased alos the total CPU usage.
    • I think this is due to less CPU overhead in changing process

Thanks to the memory reduction, I moved from x7 M5 AWS instances to x5 C5, which are cheaper for better CPU (only 4 GB of RAM). I may remove one server still, because the average load is 90% CPU and 2 GB of RAM. But I'll wait some time to be sure.

Detailed step by step​

Replica reduction was made step by step, and metrics saved for each step.

  • You may notice that reducing replicas made services faster !! This couls be linked to the removal of context changes...
  • And then, moving from 7 servers to 5 servers increase the latency again. For sure... when a server does almost nothing, it is fastest ;-)
    • There was also an intermediate step, not on this table, with 7xC5. But latency almost did not change.

Traefik issue​

After reducing replicas, I encountered a strange issue:

There were some sporadic 502 error messages from one service calling another one. But the error message was captured in the caller, the callee did not receive the communications!

And indeed, the issue was in Traefik gateway. The case is not so frequent, but it is due to the big difference of socket timeout between Traefik and Node. Traefik is at 90s, Node... at 5s. Once Node TTL is configured longer that Traefik's the 502 disappeared. Reference: https://github.com/containous/traefik/issues/3237

Automated setup ! :-)

· 2 min read

Spider configuration crazyness​

I figured out that if I wanted easy adoption of Spider, I had to improve the setup.

Indeed, the way to setup Spider was cumbersome! If the infra (Elasticsearch cluster, Swarm applicative cluster) can be automated with Terraform and Ansible, the cluster configuration involves many assets:

  • Microservices configuration: 34 files
  • File beat & metric beats configuration: 2 files
  • JWT keys
  • SMTP configuration
  • Traefik configuration
  • Elasticsearch indices provisioning (and maintenance)
  • Swarm cluster configuration: 10 files

All those files having parameters with consistent values, links and a lot of potential failure in case of wrong configuration. Moreover update of configuration on evolutions was really a pain involving GIT history, diff and so on.

So I decided to automate Spider setup and update.

Result​

The result is ... WoW :) (not Blizzard one, my own :-p) Here is the 'minimal' configuration file to have a running Spider installation, from scratch:

22 lines :-) Neat !

How it works​

Original setup​

Setup instructions are summarized in 4 steps/commands and all configuration is in one file now. From this file, a bunch of scripts:

  • Validate the configuration
  • Take configuration templates and create the environment own configuration from them
  • Provision or update the indices
  • Create the original admin account
  • And start the cluster :)

Here are the setup instructions:

Configuration templates are available in a GIT repo that will receives updates for all new features.

Upgrades​

  • 'make update' will rebase the local repo with the trunk
  • After which a 'make config db cluster' will be enough to update current installation.
  • And of course, in case of new required setup values in setup.yml, the user will be told, and the update stopped.

The indices update is automatic and based on metadata added to the indices. It handles parent templates, ILM policies, rolling over indices, rolling up indices and static indices.

  • Rolling over, reindexing and cache purging is all handled programmatically., and automatic. This is so goood =)

 

What do you think? Do you like it? I did a demo of ideas and execution to Streetsmart DevOps team today... and I think we'll target the same kind of improvement for Streetsmart! Yeah :-)

New saving options - TcpSessions and Http parsing resources - leading to a huge optimisation !

· 2 min read

New saving options​

Due to upgrade to Redis 5.0.7, a bug that I'd been tracking for months has been solved (on Redis ;) ), and now parsing quality is 100% :-) apart from clients issues.

Then I moved to next phase: it is possible not to save parsing resources any more, thus leading to a huge optimisation in Elasticsearch usage...

For this, I recently added two new options in Whisperer configuration:

  • Save Tcpsessions
  • Save Http parsing resources

No need to save those?​

As for most cases, at least on Streetsmart, we do not work with Tcp sessions and Http parsing resources, there is no use to save them in Elasticsearch.

  • These resources are indeed only used to track parsing quality and to troubleshoot specific errors.
  • These resources are also saved on Elasticsearch for recovery on errors... but as there are no more errors... it is completely possible to avoid their storage.

I had to change the serialization logic in different places, but after a couple of days, it is now stable and working perfectly :)

The whole process of packets parsing is then done in memory, in real time and distributed across many web services in the cluster, using only Redis as a temporary store, moving around 300 000 packets by minute 24h/24, amounting to almost 100MB par minute!

That's beautiful =)

What's the result?​

  • Less storage cost
  • Less CPU cost for Elasticsearch indexation
  • More availability for Elasticsearch on the UI searches :-) You must have noticed less timeouts these days?

On UI, when these resources are not available, their hypermedia links are not shown. And a new warning is displayed when moving to TCP / Packet view:

Currently, Tcp sessions and Packets are only saved on SIT1 and PRM Integration platforms. For... non regression ;)

One drawback though​

Those resources where rolled up in Elasticsearch to provide information for the parsing quality line at the bottom of the timeline. :-/

As I got this evolution idea AFTER the quality line implementation, I have nothing to replace it yet. So I designed a new way to save parsing quality information during the parsing, in a pre-aggregated format. Development will start soon!

It is important for transparency as well as non regression checks.

Upload Whisperers now have their own storage rules, and purge is back :)

· One min read

Thanks to Leo, we just found out that recent evolution with Index Lifecycle Management had removed the possibility to purge the data manually uploaded to Spider.

I then changed a bit the serialization and storage architecture so that Upload Whisperers have dedicated indices and lifecycles.

Thanks to index templates inheritance and a clever move on the pollers configuration, the changes where pretty limited :)

So now:

  • Uploaded data are kept for several weeks.
  • Uploaded data can be purged whenever you don't need them anymore