Skip to main content

Monitoring GUI upgrade

· One min read

I started a BIG and LONG task: removing the technical debt on the UI side:

  • Libraries updates
  • Moving React to function base and hooks whenever possible
  • Refactoring
  • Material UI upgrade + CSS-in-JS + theming approach

First application to be reworked: Monitoring UI. It allowed me to start with a full application, while doing common components with Networkview, while not struggling with the complexity of the latter one.

Timeline component was refactored too (https://www.npmjs.com/package/@spider-analyzer/timeline), which leads to a much easier maintenance :)

The result being not so much visible (apart in the code), I took the opportunity to introduce one feature while doing it: the dark mode. Often requested by users, and being facilitated by MUI theming :)

Here it is:

Dark mode can be activated in the settings.

... Now, let's get this work to NetworkView UI !

Using Elasticsearch asynchronous searches

· 4 min read

On May 15th, Elastic team released Elasticsearch 7.7 introducing asynchronous searches. A way to get partial results of a search request and to avoid waiting indefinitely for a complete result.

I saw there a way to improve User Experience while loading Spider, and to avoid the last timeouts that are still sometimes occurring when generating the Timeline or the Network map over many days.

So, here it is, 9 daysafter ES 7.7 release, the implementation of async searches in Spider is stable (I hope ;) ) and efficient!

Normal search

Async search

I stumble a bit at the beginning to find the right way to use this:

When to use partial results​

Loading partial results while data were already present meant resetting the existing map or timeline. The result was ugly and disturbing. I decided to limite partial loads to initial load, whisperer switch, view switch... In other words... when the resultset is empty before searching.

API integration​

Although ES does not require clients to send again the query parameters to get the async search followup, Spider API does.

Indeed, the async final result may present a 'next' link to get the next page. This link is built as hypermedia and includes everything necessary to proceed easily to the next page.

As Spider is stateless, the client is required to send all request parameters for all async follow up, in order to allow Spider to build this 'next' hypermedia link. Spider makes it easy to comply with, by providing another hypermedia link with all parameters to get the async call follow up.

Client usage​

I also tested several solutions to chain the calls in the client (UI) to finaly find that Elastic team made it really easy:

  • You may define a timeout (wait_for_completion_timeout) to get the first partial results in the first query.
    • If the results are avaiable before, you get them straight, as a normal search.
    • In the other cases, you get a result with partial (or no) data.
    • On further call, you may get the progress of the search straight... or also provide a timeout

The beauty of this is that you don't have a drawback in using or not async. If you use timeouts, you always get results when they are available. :)

At first, I implemented it so:

  1. Async search + timeout
  2. Wait 250ms after partial results
  3. Call follow-up
  4. Wait 250ms
  5. ...

But this method may lead you to get results later than they are available. Which is bad for a real time UI like Spider.

Using Timeouts in a clever way, you combine partial results, and ASAP response:

  1. Async search + timeout
  2. Call followup + timeout (no wait)

With this usage, as soon as the query is over, ES give you the results. And you may propose an incremental loading experience to your users.

Implementation results

I implemented async search on UI for the following 'long' queries:

  • Timeline loading
  • Timeline quality loading
  • Network map loading
  • DNS records loading

On all 3 views (HTTP, TCP, Packet), with 1s timeouts.

The effect is visible only when loading the full period with no filters. Indeed, other queries are way below 1s ;-) On automatic refresh, you won't see the change. The queries are not 'magically' faster: doing time based aggregation on 30 millions communications still takes... 4s :-O

As it is still new, I may have missed some stuff, so you may deactivate it in the Settings panel, to get back to normal search:

 

Share me your feelings ! :)

References

Switching deletes in Redis from DEL to UNLINK. Yet another quick win :)

· One min read

I discovered yesterday on the web that Redis UNLINK allowed much faster answers than DEL.

Using Spider bulk edit tools, I updated all services using DEL in their Dao layer or in Lua in a glimpse, comitted, build and pushed images, redeployed, and immediately saw the benefits !

I had a couple of 'slow' bulk requests that were taking more than 15 ms, and now, all are below 6ms :-)

Speedy Redis!

Upgrade to ES 7.7

· One min read

Elastic stack 7.7 is out!

I already upgraded Spider Beats, Kibana and Elasticsearch ;)

Speed is there, as usual, it seems even faster than before, as foretold by Elastic team :). What is most interesting is the new Async Search! I'll try it ASAP and keep you posted!

My goal is to use it on the big aggregations on the UI (Timeline and map) to get faster answers and progressive loading!

Using Docker logs

· One min read

I recently switched to using Docker volumes for logs., to reduce installation steps and coupling to infrastructure. I created a single shared volume by stack.

However this implied that logs from stopped containers were never removed... because logs rotation was not doing its work anymore.

I then understood better why 12 factors app practices recommends to only log to STDOUT, and let Docker handle with log storage and removal. And I decide to adapt.

  • Stop JSON logging on files
  • Replace human readable logging on STDOUT by json logging
  • Change filebeat configuration to container mode

At the same time, I benefited from this change in many ways:

  • Traefik, metricbeat and filebeat logs are also captured
  • Elasticsearch and kibana logs are captured on dev
  • All logs have been switched to JSON
  • Filebeat allow enriching logs with Docker metadata, allowing to know easily the origin of the log
  • Filebeat processors and scripting allow reshaping the logs to have a common format for all sources :-) Thanks Elastic devs!

It is all in place and deployed! More observability than ever.

Filters as tags

· 2 min read

I got the idea when working on U/X design for Flowbird products:

  • How to be more 'user friendly' with filters and query
  • How to show all active filters in a 'nice' way

I figured out that I could avoid adding all filters visible in the UI query component, and display all filters one by one as tags on the UI. This even brought the possibility to add modification options:

  • Deactivate the filter
  • Invert it
  • And of course, edit the filter in the query

Outcome

After much refactor and UI development, here is the result:

  • Filters are displayed as tags at the top left of the screen.
  • Filters are displayed with as much 'functional' labels and value as possible

  • Filters can be removed from the set with their close button

  • Filters can be deactivated, invert (include or not), manually edited
  • When saving a query, all filters and tags are saved with the query

  • A loaded query can be broken down again in the previous filters for edition

The feeling is really great. It is easy to load a query as default filters, then search, drill down to the selected items. Then cancel some filters and search for something else.

Play with it, tell me what you think! :-)

Menu disposition change

Due to the lower importance of the free search now - a filter as others -, the save/load query, undo and clear filters (new icon) buttons have been moved to a dedicated tool bar.

Improving GUI loading U/X

· 2 min read

Hi,

I recently spent a couple of hours to improve GUI loading U/X. First queries were long when no zoom on timeline was active.

I thought this was mostly related to the amount of data Elasticsearch had to load to do its queries, because the HTTP content is stored in the same index as all the metadata.

So I performed a split of this index, and now metadata and content are stored separately, and merged only when requesting a unique HTTP communication, or when exporting.

I divided the size of the index by almost 2, and cost me only one new poller and 2 new indices in the architecture.

 

The changes were good: no more timeout on page loading, but still, it took 11s for 7 days of SIT1...

So, I looked closer

Step2 - Avoiding the duplicated load

In fact, when loading the UI without zoom, the search queries for the timeline were executed twice !

It took me a while to figure out. I did some refactoring, solved 3 places that could cause the issue, one being in the timeline itself. And it is fixed :) !!

The number of queries to load the page has been reduced, and the speed is much better! From 11s, we down to 4-6s. It is still important, but considering that it is doing its aggregations in parallel to more than 350 millions records for 7 days of SIT1... it is great :oO Especially on a 2 cores AWS M5 server.

Replacing ElasticSearch Rollups

· 3 min read

They allow you to build constant aggregate of some of your index and save it for later search. Searching on rollups can be combined with searching on original index at the same time. This offer great flexibility, ... but come at a cost.

Rollup searches are slow :(

Context

I already removed two rollups from Spider because they were I needed more flexibility on the way they were built: I needed the ability to update a previously rolled up result. Which is not a feature of rollups.

So, I implementing live aggregation of two data: Tcp Parsing Status, and HTTP parsing status. Both status are preaggregated at one minute interval and stored in the own rotating index. It required extra development, extra pollers, and new indices, but it works great, with gret speed.

At that time, I decided to let the last rollup be, because it was aggregating Whisperer capture status information, which cannot be updated.

However, I noticed with monitoring that searches on this rollup where slow: around 300ms! For an index of a few megabytes !! Really not was is expected from Elasticsearch

This Rollup search was used to gather part of the timeline quality line information, as was generating some timeouts in the worst cases.

Rework

I decided to do as for the previous rollups:

  • New script in Redis when saving whisperers' status, that aggregates the capture status on a minute based interval
  • New poller configuration to poll the minute based aggregated data and store it in Elasticsearch
  • New index to store the aggregated resource, with its own ILM and rotated indices
  • And all of this is integrated before release in the monitoring. Which helps a lot!

All this was deployed seamlessly with the automated Setup :-) And the result is up to expectancies !!

  • Removal of rollup job searches on the index every second (still good to take)
  • Optimised process with Redis
  • And much faster searches: 70ms in average, more than 75% over a week of measurements.

Conclusion

Elasticsearch Rollups are great to test that preaggregating data speeds up queries, and to work on how to aggregate them. But current implementation of Rollup search is slow, and you'd better:

  • Either use search on the Rollup index (and not rollup search)
    • But then you cannot benefit from searching both rolled up index and source index
  • Either implement your own aggregation process, and optimise it on your use case. Which I did.

Cheers Thibaut

Back Office code refactoring and configuration loading

· 2 min read

Backoffice code refactoring​

I recently spent a couple of weeks refactoring the Back Office code.

Nothing is visible on the UI, but maintenance and evolution of the microservices will be much easier:

  • Redis and Elasticsearch DAO
  • Configuration loading process
  • API exposure code
  • Circuit breakers initialization

All these parts are still much flexible, and can be adapted to the various services need, but the common code has extracted in common files throughout the services :)

No regression noticed for now =)

Configuration reloading​

There was an issue with current setup procedure:

  • The whole upgrade process has been automated at the beginning of the year
  • But you had to look at the updated files after an update or configuration change to know which services needed restarting.
  • Or you had to stop and restart the full cluster.

Now, each service is monitoring its own configuration changes every 2 minutes (configurable). And if a configuration change is noticed, the service restarts gracefully :)

Sadly, it does not prevent errors to happens, since the restart should be managed in the gateway first. Indeed, the later tries to call the removed service for a couple of times.

Nevertheless, upgrading the plateform is easier than never :-D !

GUI now uses Dates not Timestamps

· One min read

The service API is accepting both  dates and timestamp. See Open API specification. I recently updated the GUI to use dates instead of timestamp to call the APIs.

Why?

Because this is much easier when reading the logs =) And the GUI do not need the microsecond precision that the timestamps allow and not the dates.

Small but useful change ;)