Using Elasticsearch asynchronous searches

On May 15th, Elastic team released Elasticsearch 7.7 introducing asynchronous searches. A way to get partial results of a search request and to avoid waiting indefinitely for a complete result.

I saw there a way to improve User Experience while loading Spider, and to avoid the last timeouts that are still sometimes occurring when generating the Timeline or the Network map over many days.

So, here it is, 9 daysafter ES 7.7 release, the implementation of async searches in Spider is stable (I hope 😉 ) and efficient!

Normal search

Async search

Trial and failures

I stumble a bit at the beginning to find the right way to use this:

When to use partial results

Loading partial results while data were already present meant resetting the existing map or timeline. The result was ugly and disturbing. I decided to limite partial loads to initial load, whisperer switch, view switch… In other words… when the resultset is empty before searching.

API integration

Although ES does not require clients to send again the query parameters to get the async search followup, Spider API does.

Indeed, the async final result may present a ‘next’ link to get the next page.
This link is built as hypermedia and includes everything necessary to proceed easily to the next page.

As Spider is stateless, the client is required to send all request parameters for all async follow up, in order to allow Spider to build this ‘next’ hypermedia link.
Spider makes it easy to comply with, by providing another hypermedia link with all parameters to get the async call follow up.

Client usage

I also tested several solutions to chain the calls in the client (UI) to finaly find that Elastic team made it really easy:

  • You may define a timeout (wait_for_completion_timeout) to get the first partial results in the first query.
    • If the results are avaiable before, you get them straight, as a normal search.
    • In the other cases, you get a result with partial (or no) data.
    • On further call, you may get the progress of the search straight… or also provide a timeout

The beauty of this is that you don’t have a drawback in using or not async. If you use timeouts, you always get results when they are available. 🙂

At first, I implemented it so:

  1. Async search + timeout
  2. Wait 250ms after partial results
  3. Call follow-up
  4. Wait 250ms

But this method may lead you to get results later than they are available. Which is bad for a real time UI like Spider.

Using Timeouts in a clever way, you combine partial results, and ASAP response:

  1. Async search + timeout
  2. Call followup + timeout (no wait)

With this usage, as soon as the query is over, ES give you the results. And you may propose an incremental loading experience to your users.

Implementation results

I implemented async search on UI for the following ‘long’ queries:

  • Timeline loading
  • Timeline quality loading
  • Network map loading
  • DNS records loading

On all 3 views (HTTP, TCP, Packet), with 1s timeouts.

The effect is visible only when loading the full period with no filters. Indeed, other queries are way below 1s 😉
On automatic refresh, you won’t see the change. The queries are not ‘magically’ faster: doing time based aggregation on 30 millions communications still takes… 4s :-O

As it is still new, I may have missed some stuff, so you may deactivate it in the Settings panel, to get back to normal search:

 

Share me your feelings ! 🙂

References

Switching deletes in Redis from DEL to UNLINK. Yet another quick win :)

I discovered yesterday on the web that Redis UNLINK allowed much faster answers than DEL.

Using Spider bulk edit tools, I updated all services using DEL in their Dao layer or in Lua in a glimpse, comitted, build and pushed images, redeployed, and immediately saw the benefits !

I had a couple of ‘slow’ bulk requests that were taking more than 15 ms, and now, all are below 6ms 🙂

Speedy Redis!

 

Upgrade to ES 7.7

Elastic stack 7.7 is out!

I already upgraded Spider Beats, Kibana and Elasticsearch 😉

Speed is there, as usual, it seems even faster than before, as foretold by Elastic team :). What is most interesting is the new Async Search! I’ll try it ASAP and keep you posted!

My goal is to use it on the big aggregations on the UI (Timeline and map) to get faster answers and progressive loading!

Using Docker logs

I recently switched to using Docker volumes for logs., to reduce installation steps and coupling to infrastructure. I created a single shared volume by stack.

However this implied that logs from stopped containers were never removed… because logs rotation was not doing its work anymore.

I then understood better why 12 factors app practices recommends to only log to STDOUT, and let Docker handle with log storage and removal. And I decide to adapt.

  • Stop JSON logging on files
  • Replace human readable logging on STDOUT by json logging
  • Change filebeat configuration to container mode

At the same time, I benefited from this change in many ways:

  • Traefik, metricbeat and filebeat logs are also captured
  • Elasticsearch and kibana logs are captured on dev
  • All logs have been switched to JSON
  • Filebeat allow enriching logs with Docker metadata, allowing to know easily the origin of the log
  • Filebeat processors and scripting allow reshaping the logs to have a common format for all sources 🙂 Thanks Elastic devs!

It is all in place and deployed! More observability than ever.

Filters as tags

Idea

I got the idea when working on U/X design for Flowbird products:

  • How to be more ‘user friendly’ with filters and query
  • How to show all active filters in a ‘nice’ way

I figured out that I could avoid adding all filters visible in the UI query component, and display all filters one by one as tags on the UI.
This even brought the possibility to add modification options:

  • Deactivate the filter
  • Invert it
  • And of course, edit the filter in the query

Outcome

After much refactor and UI development, here is the result:

  • Filters are displayed as tags at the top left of the screen.
  • Filters are displayed with as much ‘functional’ labels and value as possible

  • Filters can be removed from the set with their close button

  • Filters can be deactivated, invert (include or not), manually edited
  • When saving a query, all filters and tags are saved with the query

  • A loaded query can be broken down again in the previous filters for edition

The feeling is really great.
It is easy to load a query as default filters, then search, drill down to the selected items. Then cancel some filters and search for something else.

Play with it, tell me what you think! 🙂

Menu disposition change

Due to the lower importance of the free search now – a filter as others -, the save/load query, undo and clear filters (new icon) buttons have been moved to a dedicated tool bar.

Improving GUI loading U/X

Hi,

I recently spent a couple of hours to improve GUI loading U/X.
First queries were long when no zoom on timeline was active.

Step1 – Splitting ES content

I thought this was mostly related to the amount of data Elasticsearch had to load to do its queries, because the HTTP content is stored in the same index as all the metadata.

So I performed a split of this index, and now metadata and content are stored separately, and merged only when requesting a unique HTTP communication, or when exporting.


I divided the size of the index by almost 2, and cost me only one new poller and 2 new indices in the architecture.

 

The changes were good: no more timeout on page loading, but still, it took 11s for 7 days of SIT1…

So, I looked closer

Step2 – Avoiding the duplicated load

In fact, when loading the UI without zoom, the search queries for the timeline were executed twice !

It took me a while to figure out. I did some refactoring, solved 3 places that could cause the issue, one being in the timeline itself.
And it is fixed 🙂 !!

The number of queries to load the page has been reduced, and the speed is much better! From 11s, we down to 4-6s.
It is still important, but considering that it is doing its aggregations in parallel to more than 350 millions records for 7 days of SIT1… it is great :oO
Especially on a 2 cores AWS M5 server.

Replacing ElasticSearch Rollups

ES rollups are great … but slow!

They allow you to build constant aggregate of some of your index and save it for later search.
Searching on rollups can be combined with searching on original index at the same time. This offer great flexibility, … but come at a cost.

Rollup searches are slow 🙁

Context

I already removed two rollups from Spider because they were I needed more flexibility on the way they were built: I needed the ability to update a previously rolled up result. Which is not a feature of rollups.

So, I implementing live aggregation of two data: Tcp Parsing Status, and HTTP parsing status.
Both status are preaggregated at one minute interval and stored in the own rotating index.
It required extra development, extra pollers, and new indices, but it works great, with gret speed.

At that time, I decided to let the last rollup be, because it was aggregating Whisperer capture status information, which cannot be updated.

However, I noticed with monitoring that searches on this rollup where slow: around 300ms! For an index of a few megabytes !!
Really not was is expected from Elasticsearch

This Rollup search was used to gather part of the timeline quality line information, as was generating some timeouts in the worst cases.

Rework

I decided to do as for the previous rollups:

  • New script in Redis when saving whisperers’ status, that aggregates the capture status on a minute based interval
  • New poller configuration to poll the minute based aggregated data and store it in Elasticsearch
  • New index to store the aggregated resource, with its own ILM and rotated indices
  • And all of this is integrated before release in the monitoring. Which helps a lot!

All this was deployed seamlessly with the automated Setup 🙂
And the result is up to expectancies !!

  • Removal of rollup job searches on the index every second (still good to take)
  • Optimised process with Redis
  • And much faster searches: 70ms in average, more than 75% over a week of measurements.

Conclusion

Elasticsearch Rollups are great to test that preaggregating data speeds up queries, and to work on how to aggregate them.
But current implementation of Rollup search is slow, and you’d better:

  • Either use search on the Rollup index (and not rollup search)
    • But then you cannot benefit from searching both rolled up index and source index
  • Either implement your own aggregation process, and optimise it on your use case. Which I did.

Cheers
Thibaut

Setup improvement: Manage system versions in installation and update.

System versions

In order to facilitate Spider setup and updates, I introduced system versions. They allow external instalation to stay with a stable version for some time.

Generation

On request, a script generates a system version by tagging together:

  • Docker images of Services, UI and Setup scripts
  • Configuration templates
  • Indices templates

Using them

Then, the Makefile scripts of the infrastructure repository allows to:

  • List available system versions
  • Install or upgrade to a particular version

This is for now only the beginning, but setup are getting easier and easier 🙂

Log Codes are finally there :)

Log codes

I enforce them in all systems I helped created in my professional work… but completely forgot them when starting Spider. And technical debt…

What? Error codes! Or more specifically Log codes.

And this was painful:

  • Log codes help regrouping quickly logs of the same meaning, reason
  • This is much useful:
    • To discard them
    • To analyse a couple of them and check if there are not other errors
    • To regroup common issues from several services as one (infra issue)

Spider implementation

So, during code refactoring,

  • I added Log code to all Spider logs in Info, Warn, Error and Fatal levels.
  • Pattern: [Service code]-[Feature]-001+
  • Service code is replaced by XXX when the log is in common code

I then added a new grid in monitoring that shows a quick and small analyis report on those logs.

And in one glimpse, I’m able to see what happened and if anything serious need to be looked at in the day before 😉
Handy, isn’t it? I like it much!

Demo

For instance, today, at 11h21:

  • Many errors were generated during parsing process
  • At the same time 25 services instances reconnected to Elasticsearch and Redis at once…
  • This looks like a server restart

Indeed, looking at Servers stats,

  • Spider3 server restarted, and its services were reallocated to other nodes:
    • Huge CPU decrease and free RAM increase on Spider 3
    • CPU increase and memory decrease on other nodes

This is confirmed on the grids.

Before

After

Why this restart? Who knows? Some AWS issue maybe…
Anyway, who cares 😉 !
The system handled it gracefully.

  • I checked on the server itself… but the uptime was days ago. So the server did not restart…
  • I then checked on Docker daemon logs (journalctl -u docker.service)

  • msg=”error receiving response” error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded”
  • msg=”heartbeat to manager { } failed” error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded” method=”(*session).heartbeat” module=node/ag
  • msg=”agent: session failed” backoff=100ms error=”rpc error: code = DeadlineExceeded desc = context deadline exceeded” module=node/agent node.id=75fnw7bp17gcyt
  • msg=”manager selected by agent for new session: { }” module=node/agent node.id=75fnw7bp17gcytn91d0apaq5t
    vel=info msg=”waiting 50.073299ms before registering session

Looks like this node had an issue communicating with the Swarm manager for some time, and the services were reallocated… Sounds like a network partition on AWS.

It is told that you have to make sure your architecture is resilient on the Cloud… Done 😉

Although, I still need to rebalance the services manually… for now 🙁

Back Office code refactoring and configuration loading

Backoffice code refactoring

I recently spent a couple of weeks refactoring the Back Office code.

Nothing is visible on the UI, but maintenance and evolution of the microservices will be much easier:

  • Redis and Elasticsearch DAO
  • Configuration loading process
  • API exposure code
  • Circuit breakers initialization

All these parts are still much flexible, and can be adapted to the various services need, but the common code has extracted in common files throughout the services 🙂

No regression noticed for now =)

Configuration reloading

There was an issue with current setup procedure:

  • The whole upgrade process has been automated at the beginning of the year
  • But you had to look at the updated files after an update or configuration change to know which services needed restarting.
  • Or you had to stop and restart the full cluster.

Now, each service is monitoring its own configuration changes every 2 minutes (configurable). And if a configuration change is noticed, the service restarts gracefully 🙂

Sadly, it does not prevent errors to happens, since the restart should be managed in the gateway first. Indeed, the later tries to call the removed service for a couple of times.

Nevertheless, upgrading the plateform is easier than never 😀 !