Skip to main content

Improved free time selection

· One min read

Playing with Spider during non regression with very old pcap captures files, I kept fighting with the free time selection inputs on the right of the timeline.

It was difficult to move to 2018 or such!

I figured out that validation and change acceptance of those inputs needed to be done on both inputs together. So I redesigned the UX there, and it is much better now IMO :)

Tell me what you think!

  • You may validate a change of only one input by pressing enter (when no error)

  • You may validate a change of both input at once with the validation button

    • Thus this allows moving far and fast in time by changing both inputs, and validate only when finished.
  • When there is an error, the error text shows up with the possibility to cancel the change.

How does Spider cope with an 2x load for 15 min?

· 3 min read

Today, checking monitoring at the end of day, I found a spike of 'parsing errors' in the morning. The monitoring helped me find out why. take the path with me:

1 - Looking at the logs dashboard 

We can see a spike in logs - nearly 6000!! - around 10:13. The aggregation by codes show us very easily that there have been parsing issues, and when opening the log detail, because there were missing packets.

Let's find the root cause.

2- Looking at the parsing dashboard

We can see an increase of Tcp session in waiting to be parsed in the queue, and the parsing duration and delay increasing.

Many HTTP coms were still created, so there is no like, errors, but only in increase of demand.

There is a small red part of the Parsing status histogram, with 5603 sessions in errors out of 56000.

3- Further on, in the services dashboard

There is definitely an increase of input load, and an even more increase of created Http Coms. The input load almost doubled in size!

CPU is still good, with a net increase of parsing service.

4- Looking at DB status

Redis doubled its load, with a high increase in RAM, but it came back to normal straight after :) Works like a charm!

Response time and content of Redis increase significantly but nothing worrying. The spike has been absorbed.

Elasticsearch shows a net increase of new communications indexation.

5- Then the whisperers dashboard gives us the answer

In fact, all was normal, it's only the performance team (SPT1 whisperer) that decided to capture one of their test :-)

 

That's good observability capabilities, don't you think? All in all, everything when well.

  • The spike was absorbed for almost 15 minutes,
  • But the parsing replicas where not enough to cope with the input load, and the delay of parsing increased regularly
  • So much that Redis started removing data before it got parsed (when the parsing delay reached 45s, the TTL of packets)
    • Watch again the second set of diagrams to check this.
  • Then the parsers started complaining about missing packets when parsing the Tcp sessions. The system was in 'security' mode, avoiding to crash and avoiding the load increase.
  • All went back to normal after SPT1 stopped testing.

The system works well :) Yeah! Thank you for the improvised test, performance team !

We may also deduced from this event that parsing service replicas may be increased safely to absorb the spike. As the CPU usage still offered room for it. Auto scaling would be the best in this case.

Cheers, Thibaut

Enhanced monitoring for parsing status

· 2 min read

When playing with chaos testing, I noticed that I had no metric telling me if parsing speed was right, ok close to the limit. I knew when parsing was failing, but not if it was about to fail.

I then designed and added new metrics for parsing speed:

  • Delay before parsing
  • Duration of parsing
  • Speed of parsing

The first KPI indicates if parsing 'power' is enough, as it must stay between 10s (delay before parsing in the configuration) and 45s (TTL of packets in Redis).

The other KPIs indicates speed of parsers with current load and will allow to compare performance improvements.

In the main dashboard​

As a new parsing page​

I regrouped the previous parsing KPI together:

  • Tcp to parse in queue - to check it is not increasing
  • Tcp parsing status - to check quality of parsing
  • Maximum parsing delay - to check it stays way below 45s
  • Parsing duration of a polled page of Tcp sessions (max 20) - to check speed
  • Amount of communications created from the parsing - to check we indeed created something :)

All in all...  1 day of work :)

Avoiding duplicates

· One min read

When capturing both sides of the same communication - for instance, when capturing from both the gateway and the service itself - Spider captures twice the same communication, with sightly different dates.

It is now possible to ask Spider to avoid duplicates.

Avoiding duplicated communications​

With this option, Spider will generate the same id for the object on both side of the communication, and only one will be then saved (and parsed).

For this, select 'Avoid duplicated communications' on Capture Config tab.

Then, only one Tcp session will be created, and thus, only one example of the Http Communications.

Avoiding duplicated packets​

You may also chose to avoid duplicated packets, on the advanced options of the Packets saving part of Parsing Config tab. The options is visible only when saving packets.

Note that this is asking more resources in of the system, and should be only considered when doing statistics at the packet level (not often).

Changelog since may 2021

· One min read

It's been a while I did not write here.

Spider is progressing, but I spent much of my time Spider doing administrative and legal stuff. It's official public release is approaching :)

I nevertheless did some stuff:

  • Upgraded all services and UIs to Node 16 in august and september, with an upgrade of all libraries
  • Improved the UI so that it checks for a new version every time it receives the focus. With an integrated changelog of UI versions displayed in the details panel by rendering the service CHANGELOG.md file. You might have seen it already.
  • Improved teams configuration to allow copying teams settings to the user's in order to troubleshoot and improve them (the opposite was already existing)
  • Import/export of Whisperer configuration (decoding and parsing) from a file. This would have proven useful before, so it will again !

And I've spent some time solving my 'last' parsing issues to support long communications and optimise again the parsing.

That's for next post ! :)

My 1st customer satisfaction survey

· One min read

In october last year, I performed my first customer satisfaction survey... And the results are great!

I read some website advise and took some templates as examples. The first version was too long with too many choices and questions. I reduced it to get more feedback :)

Thanks to all of you that participated!

The summary in picture:

Contact me if you'd like to know more!

Parsing engine rework !

· 4 min read

Existing issue​

The existing parsing engine of Spider had two major issues:

  • Tcp sessions resources were including the list of packets they were built from. This was a limitation in the count of packets a tcp session could hold because the resource was ever increasing. And long persistent Tcp sessions were causing issues and thus were limited in terms of packets.
  • Http parsing logs were including as well list of packets and of HTTP communications found.

I studied how to remove these limitations, and how to improve at the same time the parsing speed and its footprint. While keeping the same quality of course!

And I managed :) !!

I had to change part of architecture level 1 decisions I took at the beginning, and it had impacts on Whisperers code on 7 other micro services, but it seemed sound and the right decision!

Work​

4 weeks later, it is all done, full regression tested and deployed on Streetsmart! And the result is AWESOME :-)

Spider now parses Tcp sessions in streaming, with a minimal footprint, and a reduced CPU usage of the servers for the same load ! :)

I also took the time to improve the 'understandability' of the process and the code quality. I will document the former soon.

Results​

Users... did not see any improvements (nor any issue), except that 'it seems faster', but figures are here to tell us!

The first day, I achieved 65 parsing errors out of 43 million communications! Those 2 missed bugs were solved straight away thanks to good observability! :)

Effects on back end​

3 /v2 APIs have been added to the API. Corresponding /v1 APIs will be deprecated soon.

But Spider is compatible with both, which allowed me an easy non regression by comparing parsing results of the same network communications... by both engines ;-) !

Effects on UI​

  1. Grids and details view of packets and Tcp sessions have been updated.
  2. Pcap upload feature has been updated to match new APIs.
  3. Downloading pcap packets has been fixed to match the new APIs.
  4. All this also implied changes in Tcp sessions display:
  • Details and content details pages are now with infinite scroll as there may be tens or hundreds of thousands of packets.
    • It deservers another improvement for later: to be able to select the time in the timeline!
  • Getting all packets of a Tcp session is only a single filter away.

Performance​

Last but not least... performance!! Give me the figures :)

Statistics over

  • 9h of run
  • 123 MB /min of parsed data
    • 318 000 packets /min
    • 31 000 tcp sessions /min
  • For a total of
    • 171 Million packets
    • 66,4 GB
    • 16 Million Tcp sessions
    • 0 error :-)

CPU usage dropped!

Before:After:

Redis footprint was divided by more than 2!

  • From 80 000 items in working memory to 30 000
  • From 500 MB to 200 MB memory footprint :)

Before:After:

Resource usage

CPU usage for parsing service dropped of 6%. But the most impressive is the CPU drop of 31% and 43% of inbound services: pack-write and tcp-write.

Whisperers --> Spider

Confirming the above figures, response time of pack-write and tcp-write as improved of 40% and 10%!

API stats

APIs statistics confirm the trend with service side improvements of up to 50%! Geez !!

Circuit breakers stats

When seen from the circuit breakers perspective, the difference is smaller, due to delay in client service internal processing.

Conclusion​

That was big work! Many changes in many places. By Spider is now faster and better than ever :)

Excel driven refactoring ! My first ever ;)

· 2 min read

One of the oldest saga of Network UI was needing a huge refactor. Well... not a refactor, the goal was to remove it completely.

I made it before I built my best practices with Saga. And this method was one that help me understand... not to do like this ;) This method was called on various users and automatic actions to update various elements of the UIs:

  • Timeline
  • Map
  • Grid
  • Stats
  • Nodes names
  • ...

While it was updating everything, it was quite simple. But for performance improvements, and to limit the queries on the servers, many parameters were added to limit some refresh to some situations.

However, this is not the right pattern. It is better to have each component have its own saga watching the actions they would need to refresh on. This is the pattern I implemented almost everywhere else, and it scales good, while limiting the responsibility in one place.

To perform this refactor was risky. As the function was called from many places with various arguments.

So I used ... Excel ! ;)

1. List the calls

2. List the behaviors from the params

3. List the needs for each call

4. Find the actions behind each need, and subscribe the update sagas to those actions

5. Tests!!

All in all... 5h of preparation, 5h of refactor + fix and... it rocks :) !

So much more understandable and maintenance easy. What a relief to remove this old code.

Code are aging badly. Really ;)

Monitoring - New Performance view

· One min read

I added a new view to monitoring. And thanks to the big refactors of last year... this was bloody easy :)

This view adds several grids to get performance statistics over the period:

  • Services performance
    • Replicas, CPU, RAM, Errors
  • Whisperers -> Services communications
  • Services API
  • Services -> Services communications
  • Services -> Elasticsearch
  • Services -> Redis
    • For all: Load, Latency, Errors

Setup is managing Docker config upgrades

· One min read

I wanted to remove coupling between Spider setup and infrastructure configuration.

There was one sticky bit still: the configuration service was using a volume with all configuration files of the applications mounted from it.

I moved it all in Docker configs, so that you may have many replicas of configuration service, and also so that High Availability is managed by Docker. To be able to go towards this, I upgraded Spider setup script to:

  • Create Docker configs for each application configuration file
  • Inject them in Configuration service Docker stack definition
  • And also... manage updates of those configuration to transparently change the Docker configs on next deploy.

Now, more than ever, setup and upgrades of Spider are simple:

  • Setup your ES cluster
  • Setup your Docker Swarm cluster
  • Pull the Setup repo and configure setup.yml
  • To install:
    • Run make new-install config db keypair admin crons cluster
  • To update:
    • Run make update config db cluster

I could also manage Docker secret upgrades... but since only the signing key is in secret, there is not much value in it :)