Skip to main content

65 posts tagged with "architecture"

View All Tags

Switching deletes in Redis from DEL to UNLINK. Yet another quick win :)

· One min read

I discovered yesterday on the web that Redis UNLINK allowed much faster answers than DEL.

Using Spider bulk edit tools, I updated all services using DEL in their Dao layer or in Lua in a glimpse, comitted, build and pushed images, redeployed, and immediately saw the benefits !

I had a couple of 'slow' bulk requests that were taking more than 15 ms, and now, all are below 6ms :-)

Speedy Redis!

Upgrade to ES 7.7

· One min read

Elastic stack 7.7 is out!

I already upgraded Spider Beats, Kibana and Elasticsearch ;)

Speed is there, as usual, it seems even faster than before, as foretold by Elastic team :). What is most interesting is the new Async Search! I'll try it ASAP and keep you posted!

My goal is to use it on the big aggregations on the UI (Timeline and map) to get faster answers and progressive loading!

Using Docker logs

· One min read

I recently switched to using Docker volumes for logs., to reduce installation steps and coupling to infrastructure. I created a single shared volume by stack.

However this implied that logs from stopped containers were never removed... because logs rotation was not doing its work anymore.

I then understood better why 12 factors app practices recommends to only log to STDOUT, and let Docker handle with log storage and removal. And I decide to adapt.

  • Stop JSON logging on files
  • Replace human readable logging on STDOUT by json logging
  • Change filebeat configuration to container mode

At the same time, I benefited from this change in many ways:

  • Traefik, metricbeat and filebeat logs are also captured
  • Elasticsearch and kibana logs are captured on dev
  • All logs have been switched to JSON
  • Filebeat allow enriching logs with Docker metadata, allowing to know easily the origin of the log
  • Filebeat processors and scripting allow reshaping the logs to have a common format for all sources :-) Thanks Elastic devs!

It is all in place and deployed! More observability than ever.

Improving GUI loading U/X

· 2 min read

Hi,

I recently spent a couple of hours to improve GUI loading U/X. First queries were long when no zoom on timeline was active.

I thought this was mostly related to the amount of data Elasticsearch had to load to do its queries, because the HTTP content is stored in the same index as all the metadata.

So I performed a split of this index, and now metadata and content are stored separately, and merged only when requesting a unique HTTP communication, or when exporting.

I divided the size of the index by almost 2, and cost me only one new poller and 2 new indices in the architecture.

 

The changes were good: no more timeout on page loading, but still, it took 11s for 7 days of SIT1...

So, I looked closer

Step2 - Avoiding the duplicated load

In fact, when loading the UI without zoom, the search queries for the timeline were executed twice !

It took me a while to figure out. I did some refactoring, solved 3 places that could cause the issue, one being in the timeline itself. And it is fixed :) !!

The number of queries to load the page has been reduced, and the speed is much better! From 11s, we down to 4-6s. It is still important, but considering that it is doing its aggregations in parallel to more than 350 millions records for 7 days of SIT1... it is great :oO Especially on a 2 cores AWS M5 server.

Replacing ElasticSearch Rollups

· 3 min read

They allow you to build constant aggregate of some of your index and save it for later search. Searching on rollups can be combined with searching on original index at the same time. This offer great flexibility, ... but come at a cost.

Rollup searches are slow :(

Context

I already removed two rollups from Spider because they were I needed more flexibility on the way they were built: I needed the ability to update a previously rolled up result. Which is not a feature of rollups.

So, I implementing live aggregation of two data: Tcp Parsing Status, and HTTP parsing status. Both status are preaggregated at one minute interval and stored in the own rotating index. It required extra development, extra pollers, and new indices, but it works great, with gret speed.

At that time, I decided to let the last rollup be, because it was aggregating Whisperer capture status information, which cannot be updated.

However, I noticed with monitoring that searches on this rollup where slow: around 300ms! For an index of a few megabytes !! Really not was is expected from Elasticsearch

This Rollup search was used to gather part of the timeline quality line information, as was generating some timeouts in the worst cases.

Rework

I decided to do as for the previous rollups:

  • New script in Redis when saving whisperers' status, that aggregates the capture status on a minute based interval
  • New poller configuration to poll the minute based aggregated data and store it in Elasticsearch
  • New index to store the aggregated resource, with its own ILM and rotated indices
  • And all of this is integrated before release in the monitoring. Which helps a lot!

All this was deployed seamlessly with the automated Setup :-) And the result is up to expectancies !!

  • Removal of rollup job searches on the index every second (still good to take)
  • Optimised process with Redis
  • And much faster searches: 70ms in average, more than 75% over a week of measurements.

Conclusion

Elasticsearch Rollups are great to test that preaggregating data speeds up queries, and to work on how to aggregate them. But current implementation of Rollup search is slow, and you'd better:

  • Either use search on the Rollup index (and not rollup search)
    • But then you cannot benefit from searching both rolled up index and source index
  • Either implement your own aggregation process, and optimise it on your use case. Which I did.

Cheers Thibaut

Back Office code refactoring and configuration loading

· 2 min read

Backoffice code refactoring

I recently spent a couple of weeks refactoring the Back Office code.

Nothing is visible on the UI, but maintenance and evolution of the microservices will be much easier:

  • Redis and Elasticsearch DAO
  • Configuration loading process
  • API exposure code
  • Circuit breakers initialization

All these parts are still much flexible, and can be adapted to the various services need, but the common code has extracted in common files throughout the services :)

No regression noticed for now =)

Configuration reloading

There was an issue with current setup procedure:

  • The whole upgrade process has been automated at the beginning of the year
  • But you had to look at the updated files after an update or configuration change to know which services needed restarting.
  • Or you had to stop and restart the full cluster.

Now, each service is monitoring its own configuration changes every 2 minutes (configurable). And if a configuration change is noticed, the service restarts gracefully :)

Sadly, it does not prevent errors to happens, since the restart should be managed in the gateway first. Indeed, the later tries to call the removed service for a couple of times.

Nevertheless, upgrading the plateform is easier than never :-D !

GUI now uses Dates not Timestamps

· One min read

The service API is accepting both  dates and timestamp. See Open API specification. I recently updated the GUI to use dates instead of timestamp to call the APIs.

Why?

Because this is much easier when reading the logs =) And the GUI do not need the microsecond precision that the timestamps allow and not the dates.

Small but useful change ;)

Setup improvement: Manage system versions in installation and update.

· One min read

System versions

In order to facilitate Spider setup and updates, I introduced system versions. They allow external instalation to stay with a stable version for some time.

Generation

On request, a script generates a system version by tagging together:

  • Docker images of Services, UI and Setup scripts
  • Configuration templates
  • Indices templates

Using them

Then, the Makefile scripts of the infrastructure repository allows to:

  • List available system versions
  • Install or upgrade to a particular version

This is for now only the beginning, but setup are getting easier and easier :)

Upgrade to Node 12 - and new metrics

· 2 min read

Node 12 upgrade

I recently upgraded all Spider backend services and Whisperer client to Node 12, and all linked node.js dependencies to their latest version, thus reducing the technical debt :-)

Node 12 is supposed to be faster on async-await than Node 10. What does it give us?

In general Node 12 is indeed faster than Node 10 :) ! The improvement is quite interesting.

Impact of event loop on perceived performance

I also noticed that the response time tracked by the parsing service (Web-Write) calling the Packets service is increasing way too much when the parsing load increases.

After some consideration, I figured out that this was due to Node.js event loop.:

  • Since WebWrite is sending many requests in parallel to PackRead, even if PackRead is fast, the evenloop and monothreading architecture of Node.js implies that Node takes much time before launching the code responsible of tracking the responsetime of dependencies.
  • I then developped new metrics capture at API level, on server side. And thus I now have both:
    • The real performance of Spider APIs - the time to generate the responses
    • The performance of processing the exchanges from Client perspective

The result is quite interesting:

Spider performance is great :-) The event loop has much impact!

Cheers, Thibaut

Upgrading Traefik v1.7 -> v2.1

· One min read

This week end, I upgraded Traefik from v1.7 to 2.1.

Traefik is used in Spider as edge router as well as load balancer intra cluster. When I switched from NGINX to Traefik one year ago, there had been a huge improvement in performance and response time stability.

Now, new version is even better! :)

Here are the comparison before and after migration over 4h of run:

In short, there is an average improvement of 30% in response times intra cluster, with a slight improvement for edge routers response times.

Do you still wonder if you should change?