Skip to main content

Improving GUI loading U/X

· 2 min read

Hi,

I recently spent a couple of hours to improve GUI loading U/X. First queries were long when no zoom on timeline was active.

I thought this was mostly related to the amount of data Elasticsearch had to load to do its queries, because the HTTP content is stored in the same index as all the metadata.

So I performed a split of this index, and now metadata and content are stored separately, and merged only when requesting a unique HTTP communication, or when exporting.

I divided the size of the index by almost 2, and cost me only one new poller and 2 new indices in the architecture.

 

The changes were good: no more timeout on page loading, but still, it took 11s for 7 days of SIT1...

So, I looked closer

Step2 - Avoiding the duplicated load

In fact, when loading the UI without zoom, the search queries for the timeline were executed twice !

It took me a while to figure out. I did some refactoring, solved 3 places that could cause the issue, one being in the timeline itself. And it is fixed :) !!

The number of queries to load the page has been reduced, and the speed is much better! From 11s, we down to 4-6s. It is still important, but considering that it is doing its aggregations in parallel to more than 350 millions records for 7 days of SIT1... it is great :oO Especially on a 2 cores AWS M5 server.

Replacing ElasticSearch Rollups

· 3 min read

They allow you to build constant aggregate of some of your index and save it for later search. Searching on rollups can be combined with searching on original index at the same time. This offer great flexibility, ... but come at a cost.

Rollup searches are slow :(

Context

I already removed two rollups from Spider because they were I needed more flexibility on the way they were built: I needed the ability to update a previously rolled up result. Which is not a feature of rollups.

So, I implementing live aggregation of two data: Tcp Parsing Status, and HTTP parsing status. Both status are preaggregated at one minute interval and stored in the own rotating index. It required extra development, extra pollers, and new indices, but it works great, with gret speed.

At that time, I decided to let the last rollup be, because it was aggregating Whisperer capture status information, which cannot be updated.

However, I noticed with monitoring that searches on this rollup where slow: around 300ms! For an index of a few megabytes !! Really not was is expected from Elasticsearch

This Rollup search was used to gather part of the timeline quality line information, as was generating some timeouts in the worst cases.

Rework

I decided to do as for the previous rollups:

  • New script in Redis when saving whisperers' status, that aggregates the capture status on a minute based interval
  • New poller configuration to poll the minute based aggregated data and store it in Elasticsearch
  • New index to store the aggregated resource, with its own ILM and rotated indices
  • And all of this is integrated before release in the monitoring. Which helps a lot!

All this was deployed seamlessly with the automated Setup :-) And the result is up to expectancies !!

  • Removal of rollup job searches on the index every second (still good to take)
  • Optimised process with Redis
  • And much faster searches: 70ms in average, more than 75% over a week of measurements.

Conclusion

Elasticsearch Rollups are great to test that preaggregating data speeds up queries, and to work on how to aggregate them. But current implementation of Rollup search is slow, and you'd better:

  • Either use search on the Rollup index (and not rollup search)
    • But then you cannot benefit from searching both rolled up index and source index
  • Either implement your own aggregation process, and optimise it on your use case. Which I did.

Cheers Thibaut

Back Office code refactoring and configuration loading

· 2 min read

Backoffice code refactoring

I recently spent a couple of weeks refactoring the Back Office code.

Nothing is visible on the UI, but maintenance and evolution of the microservices will be much easier:

  • Redis and Elasticsearch DAO
  • Configuration loading process
  • API exposure code
  • Circuit breakers initialization

All these parts are still much flexible, and can be adapted to the various services need, but the common code has extracted in common files throughout the services :)

No regression noticed for now =)

Configuration reloading

There was an issue with current setup procedure:

  • The whole upgrade process has been automated at the beginning of the year
  • But you had to look at the updated files after an update or configuration change to know which services needed restarting.
  • Or you had to stop and restart the full cluster.

Now, each service is monitoring its own configuration changes every 2 minutes (configurable). And if a configuration change is noticed, the service restarts gracefully :)

Sadly, it does not prevent errors to happens, since the restart should be managed in the gateway first. Indeed, the later tries to call the removed service for a couple of times.

Nevertheless, upgrading the plateform is easier than never :-D !

GUI now uses Dates not Timestamps

· One min read

The service API is accepting both  dates and timestamp. See Open API specification. I recently updated the GUI to use dates instead of timestamp to call the APIs.

Why?

Because this is much easier when reading the logs =) And the GUI do not need the microsecond precision that the timestamps allow and not the dates.

Small but useful change ;)

Log Codes are finally there :)

· 3 min read

Log codes

I enforce them in all systems I helped created in my professional work... but completely forgot them when starting Spider. And technical debt...

What? Error codes! Or more specifically Log codes.

And this was painful:

  • Log codes help regrouping quickly logs of the same meaning, reason
  • This is much useful:
    • To discard them
    • To analyse a couple of them and check if there are not other errors
    • To regroup common issues from several services as one (infra issue)

Spider implementation

So, during code refactoring,

  • I added Log code to all Spider logs in Info, Warn, Error and Fatal levels.
  • Pattern: [Service code]-[Feature]-001+
  • Service code is replaced by XXX when the log is in common code

I then added a new grid in monitoring that shows a quick and small analyis report on those logs.

And in one glimpse, I'm able to see what happened and if anything serious need to be looked at in the day before ;) Handy, isn't it? I like it much!

Demo

For instance, today, at 11h21:

  • Many errors were generated during parsing process
  • At the same time 25 services instances reconnected to Elasticsearch and Redis at once...
  • This looks like a server restart

Indeed, looking at Servers stats,

  • Spider3 server restarted, and its services were reallocated to other nodes:
    • Huge CPU decrease and free RAM increase on Spider 3
    • CPU increase and memory decrease on other nodes

This is confirmed on the grids.

Before

After

Why this restart? Who knows? Some AWS issue maybe... Anyway, who cares ;-) ! The system handled it gracefully.

  • I checked on the server itself... but the uptime was days ago. So the server did not restart...
  • I then checked on Docker daemon logs (journalctl -u docker.service)

  • msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
  • msg="heartbeat to manager failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/ag
  • msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=75fnw7bp17gcyt
  • msg="manager selected by agent for new session: " module=node/agent node.id=75fnw7bp17gcytn91d0apaq5t vel=info msg="waiting 50.073299ms before registering session"

Looks like this node had an issue communicating with the Swarm manager for some time, and the services were reallocated... Sounds like a network partition on AWS.

It is told that you have to make sure your architecture is resilient on the Cloud... Done ;-)

Although, I still need to rebalance the services manually... for now :(

Setup improvement: Manage system versions in installation and update.

· One min read

System versions

In order to facilitate Spider setup and updates, I introduced system versions. They allow external instalation to stay with a stable version for some time.

Generation

On request, a script generates a system version by tagging together:

  • Docker images of Services, UI and Setup scripts
  • Configuration templates
  • Indices templates

Using them

Then, the Makefile scripts of the infrastructure repository allows to:

  • List available system versions
  • Install or upgrade to a particular version

This is for now only the beginning, but setup are getting easier and easier :)

Timeline now shows communications over all their periods

· 2 min read

Long going issue

There had been an issue from the beginning with Timeline component and its representation:

  • Items that have a duration, like Statuses, Http coms, TCP sessions and so, were only represented on the timeline at the beginning of the item.
  • Doing more with a single query was highly costly, if even possible.

This meant that you could have a hole in the Timeline when in fact, there were ongoing communications in this time.

Things have changed: Elasticsearch improved!

- With ES 5.x, Elastic introduced 'range' data type that allows to define a range of numeric values or a range of time, a duration in other terms.

- With ES 7.4, Elastic have upgraded their histogram aggregation to manage ranges :) Yeah!

For range values, a document can fall into multiple buckets. The first bucket is computed from the lower bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same way from the upper bound of the range, and the range is counted in all buckets in between and including those two.

Reference : search-aggregations-bucket-histogram-aggregation

What it means is that, when performing an aggregation over a range field, ES is collecting in each histogram bar all items whose range intersect with the bar.

So doing, a long opened status will appear in all minutes / hours / days it is opened. Which is much better for the graphical representation :)

Implementation in Spider

  • Nothing changed on Timeline component, nor on the UI queries, but I introduced date ranges fields in the resource, and now, all search queries use them.
  • And... this is even a bit faster ;-)

Demo

We can see a long POST /create_session request, lasting 2.7s.

Upgrade to Node 12 - and new metrics

· 2 min read

Node 12 upgrade

I recently upgraded all Spider backend services and Whisperer client to Node 12, and all linked node.js dependencies to their latest version, thus reducing the technical debt :-)

Node 12 is supposed to be faster on async-await than Node 10. What does it give us?

In general Node 12 is indeed faster than Node 10 :) ! The improvement is quite interesting.

Impact of event loop on perceived performance

I also noticed that the response time tracked by the parsing service (Web-Write) calling the Packets service is increasing way too much when the parsing load increases.

After some consideration, I figured out that this was due to Node.js event loop.:

  • Since WebWrite is sending many requests in parallel to PackRead, even if PackRead is fast, the evenloop and monothreading architecture of Node.js implies that Node takes much time before launching the code responsible of tracking the responsetime of dependencies.
  • I then developped new metrics capture at API level, on server side. And thus I now have both:
    • The real performance of Spider APIs - the time to generate the responses
    • The performance of processing the exchanges from Client perspective

The result is quite interesting:

Spider performance is great :-) The event loop has much impact!

Cheers, Thibaut

Upgrading Traefik v1.7 -> v2.1

· One min read

This week end, I upgraded Traefik from v1.7 to 2.1.

Traefik is used in Spider as edge router as well as load balancer intra cluster. When I switched from NGINX to Traefik one year ago, there had been a huge improvement in performance and response time stability.

Now, new version is even better! :)

Here are the comparison before and after migration over 4h of run:

In short, there is an average improvement of 30% in response times intra cluster, with a slight improvement for edge routers response times.

Do you still wonder if you should change?

Resiliency of Spider system under load :)

· 2 min read

Spider endured an unwanted Stress test this morning!

Thanks, it proved it behaved and allowed me to confirm one bug =)

Look what I got from the monitoring:

Load went from 290k/min to 380k/min between 9:00am to 9:20am:

There were so many sessions at once to parsed that the parsing queue what hard to empty:

Thus generating many parsing errors, due to missing packets (short life time)

But all went back to normal afterwards :)

CPU and RAM were OK, so I bet the bottleneck was the configuration: limited number of parsers to absorb the load.

But there is a bug: Redis store does not get back to normal completely, some elements are staying in store and not deleted:

That will prevent the system from absorbing too many spikes like this in a row.

Also, some Circuit breakers to Redis opened during the spike (only 4...):

Under this load, the main two Redis instances were enduring 13k requests per second (26k/s combined), with a spike at 31k/s. For only ... 13% CPU each. Impressive!

And all this is due to a load test in SPT1 (the 4 whisperers at the bottom)

Conclusion:

  • System is resilient enough :)
  • Observability is great and important !!

Bonus: the summary dashboard:

  • A spike at 600k packets/min.
  • Many warnings in logs showing something got bad.
  • CPU on applicative node that went red for some time, and on an ES node as well.
  • Errors on 3 services that may not be scaled enough for this load.
  • And we see the circuit breakers errors due to Redis that suffered.

Really helpful!