Skip to main content

Monitoring - Whisperers status dashboard

· 5 min read

Description​

This dashboard provides a status of Whisperers clients: state, uploaded data, quality of parsing, cpu, ram, queues, circuit breakers…

Screenshot​

Content​

Whisperer status (chart)​

  • Tracks status of all Whisperers connected to the server:
    • Starting
    • Recording
    • Stopped
    • Invalid_Config
    • Internal_Error
    • Server_Down (when they can't get configuration)

Whisperer uploads to server (chart)​

  • Tracks data uploaded from the Whisperer to the server, in MB

Whisperers current status (grid)​

  • Lists current session status sent by all Whisperers
    • Whisperer start, host monitored and uptime
    • Session start and duration
    • CPU, RAM
    • Payload sent and errors
  • Common Spider features on grid:
    • Allows opening the status record in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll

Whisperers config and parsing status (grid)​

  • Lists Whisperers and their parsing status over the selected period
    • Sent sessions, amount and percentage of parsing errors
    • Parsed Http communications and missing part
  • Common Spider features on grid:
    • Allows comparing items (config and stats merged)
    • Full integrated search using ES querystring with autocompletion and color syntaxing
      • Only on Whisperer config
    • Many fields to display / hide
    • Sorting on columns (from config)

Whisperer CPU usage (chart)​

  • Tracks status CPU usage of all connected Whisperers
  • Should be low ;)
  • The more packets captured and parsed, the more CPU usage.
    • Captured packets can be limited by PCAP filter
    • Parsed packets can be limited by Hostname blacklisting in configuration
    • A circuit breaker on CPU usage can be set to pause Whisperers when too high load
  • Classic usage: between 3 and 10%

Whisperer used RAM (chart)​

  • Tracks status RAM usage of all connected Whisperers
  • Classic usage:
    • 115 MB when capturing and server responding
    • 50 MB when stopped

Whisperer queue length (chart)​

  • Tracks size of sending queue of Whisperers
    • Packets and Tcpsessions
  • When a Whisperer has too many requests to send to server, they are pushed to a queue, waiting for next slot to be sent.
  • When items are in the queue, it means either:
    • The server is getting slow and has issues
    • The Whisperer is under high pressure of packets to capture

Queues overflow (chart)​

  • Tracks size of queues overflow
    • Packets and Tcpsessions
  • When a Whisperer has too many requests to send to server, they are pushed to a queue, waiting for next slot to be sent.
  • When the queue is full, oldest items in queues are discarded and never sent.
    • This causes parsing issues and missing data (not sent)
  • It shouldn't happen if the Whisperers and Servers are correctly scaled ;)

Active circuit breakers (chart)​

  • Tracks when Whisperers have active circuit breakers
  • When a Whisperer cannot connect to the server, or fails sending data (time out, mostly), a circuit breaker opens, and the Whisperer stops trying for some time.
    • Data is lost
  • This can happen when:
    • CPU on the host the Whisperer is in is heavy loaded
    • Server is not scaled big enough
    • Server is partially down
      • When server is completely down, the Whisperer stops its capture and waits for it to get back up again

Whisperers status items (grid)​

  • Lists all status sent by Whisperers
  • Items are pre filtered on those having errors
  • Common Spider features on grid:
    • Allows opening the status record in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll

Hosts items (grid)​

  • Lists hosts resources of Whisperers
  • Hosts resources tracks the name resolving of Hosts seen by Whisperers
    • Start and stop of capture for each host
    • Dns names
    • Custom names set by users on UI or by parsing configuration
    • Position on map (if fixed)
  • An host resource is updated at regular interval, and a new one is created only when an host changes IP or Dns name
  • Common Spider features on grid
    • Allows opening the host record in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll

Hosts stats (grid)​

  • Perform statistic on Hosts resources for each Whisperer over the period
  • If, over a couple of hours, a Whisperer has too many Hosts records, with a very short average duration, it means that:
    • Names of hosts is not stable
      • For instance Docker Swarm has a bug in reverse DNS of hosts. Often, the id of the Docker is returned instead of the name of the service replica.
      • This can be worked around with Whisperers settings
    • Name resolving of IPs on the UI may fail
      • The UI limit its load to 99 Hosts resources at once.
  • Grid has limited features: only display.

Monitoring - Applicative cluster status dashboard

· 3 min read

Description​

This dashboard provides a status of the applicative processing of Spider: speed, quality of parsing, circuit breakers, logs...

Screenshot​

Content​

Processing speed (chart)​

  • Evolution of input speed from Whisperers per min.
  • Packets and Tcp inputs
  • Http output

Tcp parsing status (chart)​

  • Quality of parsing, listing the different status: Waiting, Pending, Ok, Warning, Error
  • The less red, the better ! :)
  • Errors could have many factors, but mainly: CPU contention on clients or servers

Service Speed from Whisperers (chart)​

  • Response time of the servers endpoints, as seen from the Whisperers client
  • The lower the better
  • If too big, more server nodes and service replicas are needed

Service Speed between apps (chart)​

  • Response time between nodes in the cluster
  • The lower the better
  • The most stable the better
  • If too big, more server nodes and service replicas are needed

Polling queued (chart)​

  • Count of items in Redis queues, waiting for serialization in ES
  • The most stable the better
  • If increasing, more pollers replicas are needed, or more ES indexation power

Active circuit breakers (chart)​

  • Count of opened circuit breakers on communications between services (not to the stores)
  • Nothing on the graph is the target

Circuit breakers items (grid)​

  • List circuit breakers status over the period
  • Preconfigured to display only those between applications, and with errors
  • Common Spider features on grid:
    • Allows opening the status record in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll

Errors and Warnings in logs (charts)​

  • Count of errors in logs, grouped by service
  • Count of warnings in logs, grouped by service
  • No items is the target

Log items (grid)​

  • List logs over the period
  • Preconfigured to display only those of Warning or Error levels
  • Common Spider features on grid:
    • Allows opening the logs in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll
  • Hyperlinks to Spider objects, depending of what is linked in the logs:
    • Whisperers configurations
    • TcpSessions
    • HttpCommunications
    • Customers
    • Packets
    • and so on...
  • Open logs in a specific detail panel with hyperlinks.

Monitoring - Datastores status dashboard

· 3 min read

Description​

This dashboard provides a status of Elasticsearch and Redis datastores: response time, speed, size, and circuit breakers

Screenshot​

Content​

Redis response times (chart)​

  • Tracks the response times of services and pollers calling Redis
  • ... Redis is sooo fast!

Elasticsearch response times (chart)​

  • Tracks the response time of services and pollers calling Elasticsearch
  • Times are much longer than Redis but should stay below 1s
  • Most long times are Pollers saving data

Redis content (chart)​

  • Tracks the evolution of data content in Redis memory
  • Must be stable

Elasticsearch content (chart)​

  • Track the evolution of data in Elasticsearch
  • Always goes up, except when purging ;-)

Redis size (chart)​

  • Tracks the memory usage of Redis
  • See servers status for more info

Elasticsearch size (chart)​

  • Tracks the size of Elasticsearch indices (aggregated for time based indices)

Elasticsearch CPU load spread over indices (chart)​

  • Tracks the CPU usage of Elasticsearch spreaded over its indices
  • Strange that indexing seems low CPU... to be checked
  • Allows to see unexpected patterns

Elasticsearch index speed (chart)​

  • Track indexing speed on each index
  • Confirms processing speed of applicative cluster
    • Nb: many packets are not saved in ES to save space

Elasticsearch get speed (chart)​

  • Tracks Elasticsearch direct document access (get)
  • Almost absent in Spider thanks to cache optimisations
  • This chart allowed to track those usages and effectively optimise them ;)

Elasticsearch search speed (chart)​

  • Tracks Elasticsearch searches
  • Almost absent in Spider thanks to cache optimisations
  • This chart allowed to track those usages and effectively optimise them ;)
  • Searches are almost only used by UIs

Active circuit breakers on Redis and ES (chart)​

  • Tracks opened circuit breakers between services, pollers and Elasticsearch/Redis
  • Very often CB are opening on pollers when ES cluster is too small sized...
    • But pollers retry, so this is not an issue.
    • It can be checked in logs, listing all items not saved... but that you can still open in Monitoring UI, because they were eventually saved.

Circuit breakers items (grid)​

  • Lists circuit breakers status over the period
  • Preconfigured to display only those between applications and datastores, and with errors
  • Common Spider features on grid:
    • Allows opening the status record in the detail panel
    • Allows comparing items
    • Full integrated search using ES querystring with autocompletion and color syntaxing
    • Many fields to display / hide
    • Sorting on columns
    • Infinite scroll

Monitoring - Servers status dashboard

· 3 min read

Description​

This dashboard provides a status of the servers hosting the cluster and its datastores: CPU, RAM…

Screenshot​

Content​

Applicative nodes CPU usage (chart)​

  • CPU usage of each node involved in the applicative cluster
  • Can be above 100% when multiple cores
  • Stability is key
  • Same usage on each nodes is preferred
  • Target is below 75% * number of cores

Applicative nodes free RAM (chart)​

  • Free RAM usage of each node involved in the applicative cluster
  • This include the caching, so could be rather low
  • Stability is better
  • Same usage on each nodes is preferred

Services CPU usage (chart)​

  • Sum of all CPU usage of all replicas for each service
  • Allow to find most demanding services easily and scale them
  • Allow to track weird behaviors
  • We can see that the most used ones are:
    • PackWrite that receives and parse Packets from Whisperers
    • WebWrite that aggregates packets of a TCP session to parse it
    • PackRead that gives packets to Webwrite
    • TcpUpdate that updates TCP sessions
    • TcpWrite that receives TCP sessions from Whisperers

Services average RAM usage (chart)​

  • Track the average RAM usage of all replicas for each service
  • Stability is the target
  • There is currently an issue with MonitorWrite memory. Yet to be fixed.

Redis CPU usage (chart)​

  • Track the CPU usage of Redis databases instances
  • Nothing special to say... it is so small!
  • The number of instances and what they hosts is configurable
  • Here:
    • Main: Tcp sessions, Http coms and Http pers
    • Pack: Packets, Status, Whisp status, Customers and Whisperers

Redis used RAM (chart)​

  • Tracks the memory usage of Redis
  • When the processing of pollers and parsing is too slow, Redis accumulates data and can reach its maximum (1GB for default)

Elasticsearch CPU usage (chart)​

  • Tracks the CPU usage of Elasticsearch inside each Node of Elasticsearch cluster
  • Maximum at 100%
  • Stability is a key

Elasticsearch heap used (chart)​

  • Track the JVM Heap used  of Elasticsearch on each node
  • Should stay below the limit (each: 4GB - half the node memory)

Elasticsearch disk used (chart)​

  • Track the disk used on each ES node
  • Should not reach the limit (here, 400 GB)

Monitoring - Status summary dashboard

· 3 min read

Description​

This dashboard provides a visual picture summarizing the state of the full cluster at any time.

Screenshot​

Content​

  • Summary of applicative status (top left hand corner):
    • Processing speed indicator: amount of packets and tcp sessions received by minutes
    • Parsing errors: % of parsing errors
  • Summary of servers status (bottom left hand corner):
    • ES status: short status of the ES servers - CPU, RAM and HEAP used
    • Nodes status: short status of the Cluster servers - CPU and RAM
  • A network map of all Spider microservices with their inter communications
    • The map shows the summary of Whisperers status, UI usage status, Datastores status and Applicative cluster status:
      • Number of connected Whisperers is shown on the left Whisperers node
      • Number of connected Users is shown on the right UI node
      • Circuit breakers status is represented by the arrows
        • Green arrows when no errors, orange when errors
        • When hovered, the arrows display the speed and average response time
      • Services status is represented by the nodes
        • Blue nodes when no errors, red when errors
        • Size of nodes depending of the CPU usage
        • Color of nodes depending of the visibility or not of the nodes
  • 4 differents views to avoid seeing too many arrows at once:
    • Query path
    • Command path
    • Upload path
    • Monitoring path
  • Tooltips detailling the status of each Node / Link
    • Can be pinned to compare different periods easily

Example of tooltips:

For Whisperers:

  • Count of connected Whisperers / total
  • Count of Whisperers with > 10% CPU: not normal
  • Count of Whisperers with > 150 MB RAM: not normal
  • Count of Whisperers with PCAP overflow: means that network is too fast for them, they need to be better configured
  • Count of Whisperers with queues overflow: means that Spider servers are not scaled enough
  • Total count of requests / min from Whisperers to the Spider servers

For Services:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Count of requests In and Out

For Pollers:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Average count of items waiting to be polled
  • Count of requests In and Out

For Redis DB:

  • Average CPU of the Redis instance (may be used for several DB)
  • Average RAM of the Redis instance
  • Average count of items in this Redis DB
  • Total count of requests In

For ES index:

  • Average CPU used by this index over the period
  • Size of the index
  • Variation of size, count of items and count of deleted over the period
  • Speed of indexing, getting and searching
  • Total count of requests In

For Browsers:

  • Number of connected Users
  • Number of Users having clicked during the period ;)
  • Average duration of a user session

For each link between nodes:

  • List of requests made on the link with:
    • Count of requests / min
    • Average latency
    • Max 90% latency
    • Count of errors
  • On hover, show summary info:

 

Last but not least, UI services... are not monitored... yet ;-)

Spider self monitoring

· 2 min read

Spider being built with microservices, I needed good monitoring dashboards to understand what was going on, and to troubleshoot issues.

I first started with custom Kibana dashboard, but I couldn't get all information or representation that I needed. And it was tough to get the cluster status at one side.

So I designed my own monitoring dashboards. And implemented them. There are 6 dashboards to monitor Spider:

  • Status summary - (link) - Summary visual picture of the full cluster at any time
  • Applicative cluster status - (link) - Status of the applicative processing of Spider: speed, quality of parsing, circuit breakers, logs
  • Servers status - (link) - Status of the servers hosting the cluster and its datastores: CPU, RAM...
  • Datastores status - (link) - Status of the datastores: response time, speed, size, and circuit breakers
  • Whisperers status - (link) - Status of the whisperers: state, uploaded data, quality of parsing, cpu, ram, queues, circuit breakers...
  • UI usage status - (link) - History and statistics of UI connections...

Rights​

To access Spider monitoring, you need specific rights:

  • Administrative monitoring: access to all dashboards, for people managing the platform.
  • Whisperers monitoring: access to Whisperers status dashboards, for people managing a set of Whisperers.
    • It allows to check if Whisperers are running fine when we think we're missing data.
    • Especially, the KPI about quality of parsing

Timeline​

All dashboards include the same Spider timeline as for the analysis UI, with zoom, pan, selection... and different flavors. By default, the TCP parsing status flavor is displayed, as it is the main quality indicator of Spider health.

A specific timeline, with different lifespan and flavors exists for UI usage status, as the data displayed there dates from the beginning of monitoring.

Charts visual synchronization​

The different charts of each dashboard are synchronized and rendered with the same x axis so that it is easy to correlate the information of each graph.

When moving the mouse over the charts, a tooltip is displayed on each chart with the related data of each chart at this time.

Details​

The content of each dashboard is explained in the linked pages.

Timescale on sequence diagrams

· One min read

Sequence diagrams are useful to understand requests fan-out and to see where time is taken and where to optimize the process.

BUT, they had one flaw: requests and responses are drawn sequentially: you don't see easily where time is lost.

So I did an improvement: you can now choose different time scale to stress out the gaps between calls:

  • Linear time scale: visual gap is increase linearly depending of difference in time between calls
  • Logarithmic time scale: difference between short calls is stressed out
  • Squarred time scale: long delays are stressed out
  • Sequential time scale: as before

Example:

Architecture upgrade: splitting a monolith

· 2 min read

One service in Spider back-end has been growing too much. It included:

  • Whisperer configurations
  • Users rights on this Whisperer
  • Whisperer current status
  • Whisperer status history
  • Whisperer hosts resolving

The last 2 were on different indices, but the 3 first 'data aggregates' were inside the same resource/document.

This resulted in a complex service to update, in conflicts in Optimistic Concurrency management, and in slow response time due to the size of the resources.

It needed split.

I firstly tried to split it logically from the resource perspective, extracting the configuration as it is the most stable data... But this was a bad idea: splitting configuration and rights was complexifying a lot the access and usage of the resources from the UI and the other services that needed the information!

So I figured out I had to split the monolith from the client perspective.

In result, I extracted from the first module:

  • An operating  service to process status input and to store both status and current status
  • An operating service to process hosts input and to store them
  • A configuration service to manage configuration and rights

This was much better. But I had slowness due to the fact that all those modules were accessing and storing to ES directly. So, I switch to saving in Redis and configure pollers to serialize the data to ES. Everything was already available to do this easily from the saving processes of Packets, Sessions and Http communications. I also added a pure cache to Whisperer configs resources:

  • On save, save in Redis and ES
  • On read, read from Redis, and if not, read from ES and save in Redis

All in all, requests from Whisperers clients went from 200ms+ to save Status or Hosts to... 50 and 15 ms ;-) Yeah !!

Capture improvement

· One min read

New options have been added to capture settings:

  • Wait for resolving:
    • Don't capture packets to/from on host until its name has been resolved by the DNS.
    • This allows ignoring ALL packets from the 'Hosts to ignore' list. And allows for instance to avoid spikes in capture when the first call to a UI are made
  • Track unresolved IP
    • Capture / or not packets from hosts that could not be resolved from the DNS.

New Circuit Breakers on Whisperer

· One min read

Whisperers have been tracking their CPU and RAM usage since long. Now, they are checking these metrics, and Whisperers can be configured to stop capture when they are above a defined threshold.

This allows to limit the impact of the Whisperers on the hosts they capture when there spikes of traffic. Of course, you loose monitoring data... but you allow your system to cope with the surge in traffic.

By default, Whisperers check their CPU and RAM usage every 20s. Once opened, the circuit breaker will stop capture for the next 20s and check again after.

The circuit breakers are configured in Capture settings: