Monitoring – Status summary dashboard

Description

This dashboard provides a visual picture summarizing the state of the full cluster at any time.

Screenshot

Content

  • Summary of applicative status (top left hand corner):
    • Processing speed indicator: amount of packets and tcp sessions received by minutes
    • Parsing errors: % of parsing errors
  • Summary of servers status (bottom left hand corner):
    • ES status: short status of the ES servers – CPU, RAM and HEAP used
    • Nodes status: short status of the Cluster servers – CPU and RAM
  • A network map of all Spider microservices with their inter communications
    • The map shows the summary of Whisperers status, UI usage status, Datastores status and Applicative cluster status:
      • Number of connected Whisperers is shown on the left Whisperers node
      • Number of connected Users is shown on the right UI node
      • Circuit breakers status is represented by the arrows
        • Green arrows when no errors, orange when errors
        • When hovered, the arrows display the speed and average response time
      • Services status is represented by the nodes
        • Blue nodes when no errors, red when errors
        • Size of nodes depending of the CPU usage
        • Color of nodes depending of the visibility or not of the nodes
  • 4 differents views to avoid seeing too many arrows at once:
    • Query path
    • Command path
    • Upload path
    • Monitoring path
  • Tooltips detailling the status of each Node / Link
    • Can be pinned to compare different periods easily

Example of tooltips:

For Whisperers:

  • Count of connected Whisperers / total
  • Count of Whisperers with > 10% CPU: not normal
  • Count of Whisperers with > 150 MB RAM: not normal
  • Count of Whisperers with PCAP overflow: means that network is too fast for them, they need to be better configured
  • Count of Whisperers with queues overflow: means that Spider servers are not scaled enough
  • Total count of requests / min from Whisperers to the Spider servers

For Services:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Count of requests In and Out

For Pollers:

  • Count of replicas
  • Total CPU usage on the cluster
  • Average RAM of a replica
  • Count of Errors and Warnings in the logs
  • Average count of items waiting to be polled
  • Count of requests In and Out

For Redis DB:

  • Average CPU of the Redis instance (may be used for several DB)
  • Average RAM of the Redis instance
  • Average count of items in this Redis DB
  • Total count of requests In

For ES index:

  • Average CPU used by this index over the period
  • Size of the index
  • Variation of size, count of items and count of deleted over the period
  • Speed of indexing, getting and searching
  • Total count of requests In

For Browsers:

  • Number of connected Users
  • Number of Users having clicked during the period 😉
  • Average duration of a user session

For each link between nodes:

  • List of requests made on the link with:
    • Count of requests / min
    • Average latency
    • Max 90% latency
    • Count of errors
  • On hover, show summary info:

 

Last but not least, UI services… are not monitored… yet 😉

One Reply to “Monitoring – Status summary dashboard”

Leave a Reply

Your email address will not be published. Required fields are marked *