156 posts tagged with "features"

Monitoring - UI usage dashboard

October 24, 2018 · 3 min read

Description

This dashboard provides statistics on Spider UI usage: Connected client over time, Usage statistics, Users statistics, Session history and Actions usage.

This dashboard has a different timescale as the others. Indeed, the UI tracking records are not purged at the same time as all other operational data. You can choose on the time scale to display:

Actions
Sessions
Jobs (purges & uploads)

Screenshot

Content

Simultaneous sessions over time (chart)

Tracks number of simultaneous connected users using on Spider UI over time
X axis: days
Y axis: hours

Actions per session distribution (chart)

Show distribution of count of actions per sessions

Network view usage stats (chart)

Show usage time statistics of Spider UI Views (HTTP, TCP, Packet) and Mode (Stats, Sequence diagram, Stats)

Self Monitoring usage stats (chart)

Show usage time statistics of monitoring views.

NetworkView options stats (chart)

Show options usage of NetworkView

Most used actions (chart)

Show most used actions in the UI

Most used Whisperers groups (chart)

Show most used Whisperers groups (by time)
- By grouping them on their first 4 characters

User stats (grid)

Lists each users usage statistics of the UI over the selected period
- Total actions
- Total duration of sessions
- Total hours of usage (each hour started counts)
- Total count of days of usage
- % of working day usage (sundays and saturdays are removed from reference)
- Last usage date and time

Session items (grid)

Lists users sessions items
Preconfigured to display
- Start time of session
- Application
- User
- Duration of session
- Active hours
- Browser reloads count
- Actions count
- % spent on each view
- % spent on each mode/subview
- % spent on each options
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Upload and purge items (grid)

Lists frontend / backend jobs triggered by users
- Purges
- Uploads (Pcap and Json export)
Preconfigured to display
- Date of creation
- Job type
- Whisperer used
- Status of action
- Progress
- Duration of job
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Monitoring - Whisperers status dashboard

October 5, 2018 · 5 min read

Description

This dashboard provides a status of Whisperers clients: state, uploaded data, quality of parsing, cpu, ram, queues, circuit breakers…

Screenshot

Content

Whisperer status (chart)

Tracks status of all Whisperers connected to the server:
- Starting
- Recording
- Stopped
- Invalid_Config
- Internal_Error
- Server_Down (when they can't get configuration)

Whisperer uploads to server (chart)

Tracks data uploaded from the Whisperer to the server, in MB

Whisperers current status (grid)

Lists current session status sent by all Whisperers
- Whisperer start, host monitored and uptime
- Session start and duration
- CPU, RAM
- Payload sent and errors
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Whisperers config and parsing status (grid)

Lists Whisperers and their parsing status over the selected period
- Sent sessions, amount and percentage of parsing errors
- Parsed Http communications and missing part
Common Spider features on grid:
- Allows comparing items (config and stats merged)
- Full integrated search using ES querystring with autocompletion and color syntaxing
  - Only on Whisperer config
- Many fields to display / hide
- Sorting on columns (from config)

Whisperer CPU usage (chart)

Tracks status CPU usage of all connected Whisperers
Should be low ;)
The more packets captured and parsed, the more CPU usage.
- Captured packets can be limited by PCAP filter
- Parsed packets can be limited by Hostname blacklisting in configuration
- A circuit breaker on CPU usage can be set to pause Whisperers when too high load
Classic usage: between 3 and 10%

Whisperer used RAM (chart)

Tracks status RAM usage of all connected Whisperers
Classic usage:
- 115 MB when capturing and server responding
- 50 MB when stopped

Whisperer queue length (chart)

Tracks size of sending queue of Whisperers
- Packets and Tcpsessions
When a Whisperer has too many requests to send to server, they are pushed to a queue, waiting for next slot to be sent.
When items are in the queue, it means either:
- The server is getting slow and has issues
- The Whisperer is under high pressure of packets to capture

Queues overflow (chart)

Tracks size of queues overflow
- Packets and Tcpsessions
When a Whisperer has too many requests to send to server, they are pushed to a queue, waiting for next slot to be sent.
When the queue is full, oldest items in queues are discarded and never sent.
- This causes parsing issues and missing data (not sent)
It shouldn't happen if the Whisperers and Servers are correctly scaled ;)

Active circuit breakers (chart)

Tracks when Whisperers have active circuit breakers
When a Whisperer cannot connect to the server, or fails sending data (time out, mostly), a circuit breaker opens, and the Whisperer stops trying for some time.
- Data is lost
This can happen when:
- CPU on the host the Whisperer is in is heavy loaded
- Server is not scaled big enough
- Server is partially down
  - When server is completely down, the Whisperer stops its capture and waits for it to get back up again

Whisperers status items (grid)

Lists all status sent by Whisperers
Items are pre filtered on those having errors
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Hosts items (grid)

Lists hosts resources of Whisperers
Hosts resources tracks the name resolving of Hosts seen by Whisperers
- Start and stop of capture for each host
- Dns names
- Custom names set by users on UI or by parsing configuration
- Position on map (if fixed)
An host resource is updated at regular interval, and a new one is created only when an host changes IP or Dns name
Common Spider features on grid
- Allows opening the host record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Hosts stats (grid)

Perform statistic on Hosts resources for each Whisperer over the period
If, over a couple of hours, a Whisperer has too many Hosts records, with a very short average duration, it means that:
- Names of hosts is not stable
  - For instance Docker Swarm has a bug in reverse DNS of hosts. Often, the id of the Docker is returned instead of the name of the service replica.
  - This can be worked around with Whisperers settings
- Name resolving of IPs on the UI may fail
  - The UI limit its load to 99 Hosts resources at once.
Grid has limited features: only display.

Monitoring - Applicative cluster status dashboard

October 1, 2018 · 3 min read

Description

This dashboard provides a status of the applicative processing of Spider: speed, quality of parsing, circuit breakers, logs...

Screenshot

Content

Processing speed (chart)

Evolution of input speed from Whisperers per min.
Packets and Tcp inputs
Http output

Tcp parsing status (chart)

Quality of parsing, listing the different status: Waiting, Pending, Ok, Warning, Error
The less red, the better ! :)
Errors could have many factors, but mainly: CPU contention on clients or servers

Service Speed from Whisperers (chart)

Response time of the servers endpoints, as seen from the Whisperers client
The lower the better
If too big, more server nodes and service replicas are needed

Service Speed between apps (chart)

Response time between nodes in the cluster
The lower the better
The most stable the better
If too big, more server nodes and service replicas are needed

Polling queued (chart)

Count of items in Redis queues, waiting for serialization in ES
The most stable the better
If increasing, more pollers replicas are needed, or more ES indexation power

Active circuit breakers (chart)

Count of opened circuit breakers on communications between services (not to the stores)
Nothing on the graph is the target

Circuit breakers items (grid)

List circuit breakers status over the period
Preconfigured to display only those between applications, and with errors
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Errors and Warnings in logs (charts)

Count of errors in logs, grouped by service
Count of warnings in logs, grouped by service
No items is the target

Log items (grid)

List logs over the period
Preconfigured to display only those of Warning or Error levels
Common Spider features on grid:
- Allows opening the logs in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll
Hyperlinks to Spider objects, depending of what is linked in the logs:
- Whisperers configurations
- TcpSessions
- HttpCommunications
- Customers
- Packets
- and so on...
Open logs in a specific detail panel with hyperlinks.

Monitoring - Datastores status dashboard

October 1, 2018 · 3 min read

Description

This dashboard provides a status of Elasticsearch and Redis datastores: response time, speed, size, and circuit breakers

Screenshot

Content

Redis response times (chart)

Tracks the response times of services and pollers calling Redis
... Redis is sooo fast!

Elasticsearch response times (chart)

Tracks the response time of services and pollers calling Elasticsearch
Times are much longer than Redis but should stay below 1s
Most long times are Pollers saving data

Redis content (chart)

Tracks the evolution of data content in Redis memory
Must be stable

Elasticsearch content (chart)

Track the evolution of data in Elasticsearch
Always goes up, except when purging ;-)

Redis size (chart)

Tracks the memory usage of Redis
See servers status for more info

Elasticsearch size (chart)

Tracks the size of Elasticsearch indices (aggregated for time based indices)

Elasticsearch CPU load spread over indices (chart)

Tracks the CPU usage of Elasticsearch spreaded over its indices
Strange that indexing seems low CPU... to be checked
Allows to see unexpected patterns

Elasticsearch index speed (chart)

Track indexing speed on each index
Confirms processing speed of applicative cluster
- Nb: many packets are not saved in ES to save space

Elasticsearch get speed (chart)

Tracks Elasticsearch direct document access (get)
Almost absent in Spider thanks to cache optimisations
This chart allowed to track those usages and effectively optimise them ;)

Elasticsearch search speed (chart)

Tracks Elasticsearch searches
Almost absent in Spider thanks to cache optimisations
This chart allowed to track those usages and effectively optimise them ;)
Searches are almost only used by UIs

Active circuit breakers on Redis and ES (chart)

Tracks opened circuit breakers between services, pollers and Elasticsearch/Redis
Very often CB are opening on pollers when ES cluster is too small sized...
- But pollers retry, so this is not an issue.
- It can be checked in logs, listing all items not saved... but that you can still open in Monitoring UI, because they were eventually saved.

Circuit breakers items (grid)

Lists circuit breakers status over the period
Preconfigured to display only those between applications and datastores, and with errors
Common Spider features on grid:
- Allows opening the status record in the detail panel
- Allows comparing items
- Full integrated search using ES querystring with autocompletion and color syntaxing
- Many fields to display / hide
- Sorting on columns
- Infinite scroll

Monitoring - Servers status dashboard

October 1, 2018 · 3 min read

Description

This dashboard provides a status of the servers hosting the cluster and its datastores: CPU, RAM…

Screenshot

Content

Applicative nodes CPU usage (chart)

CPU usage of each node involved in the applicative cluster
Can be above 100% when multiple cores
Stability is key
Same usage on each nodes is preferred
Target is below 75% * number of cores

Applicative nodes free RAM (chart)

Free RAM usage of each node involved in the applicative cluster
This include the caching, so could be rather low
Stability is better
Same usage on each nodes is preferred

Services CPU usage (chart)

Sum of all CPU usage of all replicas for each service
Allow to find most demanding services easily and scale them
Allow to track weird behaviors
We can see that the most used ones are:
- PackWrite that receives and parse Packets from Whisperers
- WebWrite that aggregates packets of a TCP session to parse it
- PackRead that gives packets to Webwrite
- TcpUpdate that updates TCP sessions
- TcpWrite that receives TCP sessions from Whisperers

Services average RAM usage (chart)

Track the average RAM usage of all replicas for each service
Stability is the target
There is currently an issue with MonitorWrite memory. Yet to be fixed.

Redis CPU usage (chart)

Track the CPU usage of Redis databases instances
Nothing special to say... it is so small!
The number of instances and what they hosts is configurable
Here:
- Main: Tcp sessions, Http coms and Http pers
- Pack: Packets, Status, Whisp status, Customers and Whisperers

Redis used RAM (chart)

Tracks the memory usage of Redis
When the processing of pollers and parsing is too slow, Redis accumulates data and can reach its maximum (1GB for default)

Elasticsearch CPU usage (chart)

Tracks the CPU usage of Elasticsearch inside each Node of Elasticsearch cluster
Maximum at 100%
Stability is a key

Elasticsearch heap used (chart)

Track the JVM Heap used of Elasticsearch on each node
Should stay below the limit (each: 4GB - half the node memory)

Elasticsearch disk used (chart)

Track the disk used on each ES node
Should not reach the limit (here, 400 GB)

Monitoring - Status summary dashboard

October 1, 2018 · 4 min read

Description

This dashboard provides a visual picture summarizing the state of the full cluster at any time.

Screenshot

Content

Summary of applicative status (top left hand corner):
- Processing speed indicator: amount of packets and tcp sessions received by minutes
- Parsing errors: % of parsing errors
Summary of servers status (bottom left hand corner):
- ES status: short status of the ES servers - CPU, RAM and HEAP used
- Nodes status: short status of the Cluster servers - CPU and RAM
A network map of all Spider microservices with their inter communications
- The map shows the summary of Whisperers status, UI usage status, Datastores status and Applicative cluster status:
  - Number of connected Whisperers is shown on the left Whisperers node
  - Number of connected Users is shown on the right UI node
  - Circuit breakers status is represented by the arrows
    - Green arrows when no errors, orange when errors
    - When hovered, the arrows display the speed and average response time
  - Services status is represented by the nodes
    - Blue nodes when no errors, red when errors
    - Size of nodes depending of the CPU usage
    - Color of nodes depending of the visibility or not of the nodes
4 differents views to avoid seeing too many arrows at once:
- Query path
- Command path
- Upload path
- Monitoring path
Tooltips detailling the status of each Node / Link
- Can be pinned to compare different periods easily

Example of tooltips:

For Whisperers:

Count of connected Whisperers / total
Count of Whisperers with > 10% CPU: not normal
Count of Whisperers with > 150 MB RAM: not normal
Count of Whisperers with PCAP overflow: means that network is too fast for them, they need to be better configured
Count of Whisperers with queues overflow: means that Spider servers are not scaled enough
Total count of requests / min from Whisperers to the Spider servers

For Services:

Count of replicas
Total CPU usage on the cluster
Average RAM of a replica
Count of Errors and Warnings in the logs
Count of requests In and Out

For Pollers:

Count of replicas
Total CPU usage on the cluster
Average RAM of a replica
Count of Errors and Warnings in the logs
Average count of items waiting to be polled
Count of requests In and Out

For Redis DB:

Average CPU of the Redis instance (may be used for several DB)
Average RAM of the Redis instance
Average count of items in this Redis DB
Total count of requests In

For ES index:

Average CPU used by this index over the period
Size of the index
Variation of size, count of items and count of deleted over the period
Speed of indexing, getting and searching
Total count of requests In

For Browsers:

Number of connected Users
Number of Users having clicked during the period ;)
Average duration of a user session

For each link between nodes:

List of requests made on the link with:
- Count of requests / min
- Average latency
- Max 90% latency
- Count of errors
On hover, show summary info:

Last but not least, UI services... are not monitored... yet ;-)

Spider self monitoring

September 30, 2018 · 3 min read

Spider being built with microservices, I needed good monitoring dashboards to understand what was going on, and to troubleshoot issues.

I first started with custom Kibana dashboard, but I couldn't get all information or representation that I needed. And it was tough to get the cluster status at one side.

So I designed my own monitoring dashboards. And implemented them. There are 6 dashboards to monitor Spider:

Status summary - (link) - Summary visual picture of the full cluster at any time
Applicative cluster status - (link) - Status of the applicative processing of Spider: speed, quality of parsing, circuit breakers, logs
Servers status - (link) - Status of the servers hosting the cluster and its datastores: CPU, RAM...
Datastores status - (link) - Status of the datastores: response time, speed, size, and circuit breakers
Whisperers status - (link) - Status of the whisperers: state, uploaded data, quality of parsing, cpu, ram, queues, circuit breakers...
UI usage status - (link) - History and statistics of UI connections...

Rights

To access Spider monitoring, you need specific rights:

Administrative monitoring: access to all dashboards, for people managing the platform.
Whisperers monitoring: access to Whisperers status dashboards, for people managing a set of Whisperers.
- It allows to check if Whisperers are running fine when we think we're missing data.
- Especially, the KPI about quality of parsing

Timeline

All dashboards include the same Spider timeline as for the analysis UI, with zoom, pan, selection... and different flavors. By default, the TCP parsing status flavor is displayed, as it is the main quality indicator of Spider health.

A specific timeline, with different lifespan and flavors exists for UI usage status, as the data displayed there dates from the beginning of monitoring.

Charts visual synchronization

The different charts of each dashboard are synchronized and rendered with the same x axis so that it is easy to correlate the information of each graph.

When moving the mouse over the charts, a tooltip is displayed on each chart with the related data of each chart at this time.

Details

The content of each dashboard is explained in the linked pages.

Timescale on sequence diagrams

September 27, 2018 · One min read

Sequence diagrams are useful to understand requests fan-out and to see where time is taken and where to optimize the process.

BUT, they had one flaw: requests and responses are drawn sequentially: you don't see easily where time is lost.

So I did an improvement: you can now choose different time scale to stress out the gaps between calls:

Linear time scale: visual gap is increase linearly depending of difference in time between calls
Logarithmic time scale: difference between short calls is stressed out
Squarred time scale: long delays are stressed out
Sequential time scale: as before

Example:

Capture improvement

September 23, 2018 · One min read

New options have been added to capture settings:

Wait for resolving:
- Don't capture packets to/from on host until its name has been resolved by the DNS.
- This allows ignoring ALL packets from the 'Hosts to ignore' list. And allows for instance to avoid spikes in capture when the first call to a UI are made
Track unresolved IP
- Capture / or not packets from hosts that could not be resolved from the DNS.

New Circuit Breakers on Whisperer

September 23, 2018 · One min read

Whisperers have been tracking their CPU and RAM usage since long. Now, they are checking these metrics, and Whisperers can be configured to stop capture when they are above a defined threshold.

This allows to limit the impact of the Whisperers on the hosts they capture when there spikes of traffic. Of course, you loose monitoring data... but you allow your system to cope with the surge in traffic.

By default, Whisperers check their CPU and RAM usage every 20s. Once opened, the circuit breaker will stop capture for the next 20s and check again after.

The circuit breakers are configured in Capture settings:

Description​

Screenshot​

Content​

Simultaneous sessions over time (chart)​

Actions per session distribution (chart)​

Network view usage stats (chart)​

Self Monitoring usage stats (chart)​

NetworkView options stats (chart)​

Most used actions (chart)​

Most used Whisperers groups (chart)​

User stats (grid)​

Session items (grid)​

Upload and purge items (grid)​

Description​

Screenshot​

Content​

Whisperer status (chart)​

Whisperer uploads to server (chart)​

Whisperers current status (grid)​

Whisperers config and parsing status (grid)​

Whisperer CPU usage (chart)​

Whisperer used RAM (chart)​

Whisperer queue length (chart)​

Queues overflow (chart)​

Active circuit breakers (chart)​

Whisperers status items (grid)​

Hosts items (grid)​

Hosts stats (grid)​

Description​

Screenshot​

Content​

Processing speed (chart)​

Tcp parsing status (chart)​

Service Speed from Whisperers (chart)​

Service Speed between apps (chart)​

Polling queued (chart)​

Active circuit breakers (chart)​

Circuit breakers items (grid)​

Errors and Warnings in logs (charts)​

Log items (grid)​

Description​

Screenshot​

Content​

Redis response times (chart)​

Elasticsearch response times (chart)​

Redis content (chart)​

Elasticsearch content (chart)​

Redis size (chart)​

Elasticsearch size (chart)​

Elasticsearch CPU load spread over indices (chart)​

Elasticsearch index speed (chart)​

Elasticsearch get speed (chart)​

Elasticsearch search speed (chart)​

Active circuit breakers on Redis and ES (chart)​

Circuit breakers items (grid)​

Description​

Screenshot​

Content​

Applicative nodes CPU usage (chart)​

Applicative nodes free RAM (chart)​

Services CPU usage (chart)​

Services average RAM usage (chart)​

Redis CPU usage (chart)​

Redis used RAM (chart)​

Elasticsearch CPU usage (chart)​

Elasticsearch heap used (chart)​

Elasticsearch disk used (chart)​

Description​

Screenshot​

Content​

Rights​

Timeline​

Charts visual synchronization​

Details​

Description

Screenshot

Content

Simultaneous sessions over time (chart)

Actions per session distribution (chart)

Network view usage stats (chart)

Self Monitoring usage stats (chart)

NetworkView options stats (chart)

Most used actions (chart)

Most used Whisperers groups (chart)

User stats (grid)

Session items (grid)

Upload and purge items (grid)

Description

Screenshot

Content

Whisperer status (chart)

Whisperer uploads to server (chart)

Whisperers current status (grid)

Whisperers config and parsing status (grid)

Whisperer CPU usage (chart)

Whisperer used RAM (chart)

Whisperer queue length (chart)

Queues overflow (chart)

Active circuit breakers (chart)

Whisperers status items (grid)

Hosts items (grid)

Hosts stats (grid)

Description

Screenshot

Content

Processing speed (chart)

Tcp parsing status (chart)

Service Speed from Whisperers (chart)

Service Speed between apps (chart)

Polling queued (chart)

Active circuit breakers (chart)

Circuit breakers items (grid)

Errors and Warnings in logs (charts)

Log items (grid)

Description

Screenshot

Content

Redis response times (chart)

Elasticsearch response times (chart)

Redis content (chart)

Elasticsearch content (chart)

Redis size (chart)

Elasticsearch size (chart)

Elasticsearch CPU load spread over indices (chart)

Elasticsearch index speed (chart)

Elasticsearch get speed (chart)

Elasticsearch search speed (chart)

Active circuit breakers on Redis and ES (chart)

Circuit breakers items (grid)

Description

Screenshot

Content

Applicative nodes CPU usage (chart)

Applicative nodes free RAM (chart)

Services CPU usage (chart)

Services average RAM usage (chart)

Redis CPU usage (chart)

Redis used RAM (chart)

Elasticsearch CPU usage (chart)

Elasticsearch heap used (chart)

Elasticsearch disk used (chart)

Description

Screenshot

Content

Rights

Timeline

Charts visual synchronization

Details