Skip to main content

Monitoring - Status summary dashboard

· 3 min read

Description

This dashboard provides a visual picture summarizing the state of the full cluster at any time.

Screenshot

Content

  • Summary of applicative status (top left hand corner):
    • Processing speed indicator: amount of packets and tcp sessions received by minutes
    • Parsing errors: % of parsing errors
  • Summary of servers status (bottom left hand corner):
    • ES status: short status of the ES servers - CPU, RAM and HEAP used
    • Nodes status: short status of the Cluster servers - CPU and RAM
  • A network map of all Spider microservices with their inter communications
    • The map shows the summary of Whisperers status, UI usage status, Datastores status and Applicative cluster status:
      • Number of connected Whisperers is shown on the left Whisperers node
      • Number of connected Users is shown on the right UI node
      • Circuit breakers status is represented by the arrows
        • Green arrows when no errors, orange when errors
        • When hovered, the arrows display the speed and average response time
      • Services status is represented by the nodes
        • Blue nodes when no errors, red when errors
        • Size of nodes depending of the CPU usage
        • Color of nodes depending of the visibility or not of the nodes
  • 4 differents views to avoid seeing too many arrows at once:
    • Query path
    • Command path
    • Upload path
    • Monitoring path
  • Tooltips detailling the status of each Node / Link
    • Can be pinned to compare different periods easily

Example of tooltips:

For Whisperers:

  • Count of connected Whisperers / total
  • Count of Whisperers with > 10% CPU: not normal
  • Count of Whisperers with > 150 MB RAM: not normal
  • Count of Whisperers with PCAP overflow: means that network is too fast for them, they need to be better configured
  • Count of Whisperers with queues overflow: means that Spider servers are not scaled enough
  • Total count of requests / min from Whisperers to the Spider servers

For Services:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Count of requests In and Out

For Pollers:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Average count of items waiting to be polled
  • Count of requests In and Out

For Redis DB:

  • Average CPU of the Redis instance (may be used for several DB)
  • Average RAM of the Redis instance
  • Average count of items in this Redis DB
  • Total count of requests In

For ES index:

  • Average CPU used by this index over the period
  • Size of the index
  • Variation of size, count of items and count of deleted over the period
  • Speed of indexing, getting and searching
  • Total count of requests In

For Browsers:

  • Number of connected Users
  • Number of Users having clicked during the period ;)
  • Average duration of a user session

For each link between nodes:

  • List of requests made on the link with:
    • Count of requests / min
    • Average latency
    • Max 90% latency
    • Count of errors
  • On hover, show summary info:

 

Last but not least, UI services... are not monitored... yet ;-)

Spider self monitoring

· 2 min read

Spider being built with microservices, I needed good monitoring dashboards to understand what was going on, and to troubleshoot issues.

I first started with custom Kibana dashboard, but I couldn't get all information or representation that I needed. And it was tough to get the cluster status at one side.

So I designed my own monitoring dashboards. And implemented them. There are 6 dashboards to monitor Spider:

  • Status summary - (link) - Summary visual picture of the full cluster at any time
  • Applicative cluster status - (link) - Status of the applicative processing of Spider: speed, quality of parsing, circuit breakers, logs
  • Servers status - (link) - Status of the servers hosting the cluster and its datastores: CPU, RAM...
  • Datastores status - (link) - Status of the datastores: response time, speed, size, and circuit breakers
  • Whisperers status - (link) - Status of the whisperers: state, uploaded data, quality of parsing, cpu, ram, queues, circuit breakers...
  • UI usage status - (link) - History and statistics of UI connections...

Rights

To access Spider monitoring, you need specific rights:

  • Administrative monitoring: access to all dashboards, for people managing the platform.
  • Whisperers monitoring: access to Whisperers status dashboards, for people managing a set of Whisperers.
    • It allows to check if Whisperers are running fine when we think we're missing data.
    • Especially, the KPI about quality of parsing

Timeline

All dashboards include the same Spider timeline as for the analysis UI, with zoom, pan, selection... and different flavors. By default, the TCP parsing status flavor is displayed, as it is the main quality indicator of Spider health.

A specific timeline, with different lifespan and flavors exists for UI usage status, as the data displayed there dates from the beginning of monitoring.

Charts visual synchronization

The different charts of each dashboard are synchronized and rendered with the same x axis so that it is easy to correlate the information of each graph.

When moving the mouse over the charts, a tooltip is displayed on each chart with the related data of each chart at this time.

Details

The content of each dashboard is explained in the linked pages.

Timescale on sequence diagrams

· One min read

Sequence diagrams are useful to understand requests fan-out and to see where time is taken and where to optimize the process.

BUT, they had one flaw: requests and responses are drawn sequentially: you don't see easily where time is lost.

So I did an improvement: you can now choose different time scale to stress out the gaps between calls:

  • Linear time scale: visual gap is increase linearly depending of difference in time between calls
  • Logarithmic time scale: difference between short calls is stressed out
  • Squarred time scale: long delays are stressed out
  • Sequential time scale: as before

Example:

Architecture upgrade: splitting a monolith

· 2 min read

One service in Spider back-end has been growing too much. It included:

  • Whisperer configurations
  • Users rights on this Whisperer
  • Whisperer current status
  • Whisperer status history
  • Whisperer hosts resolving

The last 2 were on different indices, but the 3 first 'data aggregates' were inside the same resource/document.

This resulted in a complex service to update, in conflicts in Optimistic Concurrency management, and in slow response time due to the size of the resources.

It needed split.

I firstly tried to split it logically from the resource perspective, extracting the configuration as it is the most stable data... But this was a bad idea: splitting configuration and rights was complexifying a lot the access and usage of the resources from the UI and the other services that needed the information!

So I figured out I had to split the monolith from the client perspective.

In result, I extracted from the first module:

  • An operating  service to process status input and to store both status and current status
  • An operating service to process hosts input and to store them
  • A configuration service to manage configuration and rights

This was much better. But I had slowness due to the fact that all those modules were accessing and storing to ES directly. So, I switch to saving in Redis and configure pollers to serialize the data to ES. Everything was already available to do this easily from the saving processes of Packets, Sessions and Http communications. I also added a pure cache to Whisperer configs resources:

  • On save, save in Redis and ES
  • On read, read from Redis, and if not, read from ES and save in Redis

All in all, requests from Whisperers clients went from 200ms+ to save Status or Hosts to... 50 and 15 ms ;-) Yeah !!

Capture improvement

· One min read

New options have been added to capture settings:

  • Wait for resolving:
    • Don't capture packets to/from on host until its name has been resolved by the DNS.
    • This allows ignoring ALL packets from the 'Hosts to ignore' list. And allows for instance to avoid spikes in capture when the first call to a UI are made
  • Track unresolved IP
    • Capture / or not packets from hosts that could not be resolved from the DNS.

New Circuit Breakers on Whisperer

· One min read

Whisperers have been tracking their CPU and RAM usage since long. Now, they are checking these metrics, and Whisperers can be configured to stop capture when they are above a defined threshold.

This allows to limit the impact of the Whisperers on the hosts they capture when there spikes of traffic. Of course, you loose monitoring data... but you allow your system to cope with the surge in traffic.

By default, Whisperers check their CPU and RAM usage every 20s. Once opened, the circuit breaker will stop capture for the next 20s and check again after.

The circuit breakers are configured in Capture settings:

Select all on Grid

· One min read

Small improvement, but that can save minutes :)

As Remi ask for it, it is now possible to 'select all' records in the loaded grid. It works like this:

  • It ticks all checkboxes of the grid
  • If a record was selected, it is not any more
  • If a record was not selected, it is selected now

This allows those use cases:

  • Select all but these two (invert selection)
  • Select all
  • Unselect all

Attention: this does not affect selected records that are not in the current grid:

  • If you had 5 records in selection and change time window, and click select all to select all 20 records displayed... You'll end up with 25 records in selection.

Note that selection is limited to 100 records. To limit the size (and time) of the export.

Upgrade to ES 6.4.1

· One min read

Spider has been upgraded to ElasticSearch 6.4.1 !!

It now benefits to APM access, SQL queries and so on :-) I'll tell you more later.

Import / Export Http Communications

· 2 min read

On request from Remi L., I increased priority of this feature last week, and it is released today :-)

It is now possible to:

Export Http communications

  • Export a selection of Http communications, including:

Beware though: if the request or response body was encoded (gzipped or chunked for instance), it is still encoded, as transmitted on the wire.

Import them back

  • Import back this export to another Whisperer, of UPLOAD type.
    • It is like magic, you get back your saved communications and can analyze them at peace.
    • You may import many exports at once by selecting many files or by drag&dropping them on the upload icon.
    • Beware though: if you import from different environments from the same time window to the same Whisperer.. you may have some IP clashes, and some strange results ;)

The first identified use cases

  • Being able to save a selection of requests performed by an integrated client to check non regression later on.
  • Being able to compare different clients integration.
  • Being able to export clients integration from production to its own environment to be able to create automated tests from them.

This feature, linked to the 'Diff' feature previously released adds even more power to Spider as a killer tool for integration :-)

Interested?

Anybody can export. However, you need to have your own Whisperer of UPLOAD type to be able to import back.

  • I created one for Remi for tests, ask one from me if you want.
  • For now, I'd rather not give to everybody to right to create Whisperers.
  • Once created, you'll have all configuration options to them, and will be able to share them with others (your team).
    • But only the owner of the Whisperer can upload to it.

Cheers, Thibaut

Merging clients 'replicas' on map

· One min read

Network map got a small improvement:

  • Now, clients with similar (or same) identification are merged on the map, as for servers replicas.
    • This reduces the amount of 'noise' on the map, and show a single client connecting with several IPs (many stations, many devices, or one moving device) as one single client.

This option is active when Merge option is active.

NB: This feature is not present on the sequence diagram.