Use cases
Proven use cases
Analysis / learning
- System discovery
- Understanding system behavior and interactions
- Checking internal integrations between services and systems
- Checking internal communications and patterns between services and datastores
- Parallel processing and concurrency issues analysis
Support
- Root cause analysis of issues
- Checking and follow up of third parties integrations
- Troubleshooting network communications
- Cluster misconfiguration visible in one glimpse
Tests
- Checking developments
- Post deployment sanity checks
- Access rights debugging through the global fan-out of the requests.
Performance
- Checking and troubleshooting system performance
- Collecting inside cluster performance metrics without instrumentation
Real life stories
These stories are real life stories. They are samples of the unique power Spider brings in troubleshooting distributed systems!
Many other cases happened over the years that I forgot about!
Parallel calls
Once in 2018 in Flowbird production, we had an unexpected behavior of one microservices. We could not explain nor reproduce the answers it was giving.
With Spider sequence diagram, we were able to see what other calls were made in parallel to the same service replica, and we found out that for all bad answers, another call to another API of the same service was made.
We found a few minutes later that a parameter of the second API was declared as a global and impacting the first API!
It took us less than one hour to troubleshoot a case that would have taken weeks otherwise!
Time synchronisation delay
In 2021, in Flowbird production, we noticed that a random number of calls were answered by 403 Not authorized
responses when the call before or later with same credentials were perfectly right.
The analysis was in progress for weeks without success, generating much annoyance in customers. Spider was off at this time.
Out of solution, we reinstalled Spider and found the culprit in less than 1 hour after installation!
- The IAM solution was generating a JWT with the
nbf
field: not before. - This field contains a date (resolution in seconds) before which the token in invalid.
- The issue was the clock of the IAM server was shifting faster than other servers.
- Even with the
ntp
time synchronization updates, the IAM server ended up, every hour, with 1 or 2 ms delay from the application server. - And tokens that were generated close to the start of a second could be received by the application server while this one was still in the previous second! Thus making the token invalid!
Spider helped us see quickly that those rejected calls had a token not valid at the time of the capture!
We would never have found it without it!
NGINX forward auth cache issue
In 2022, in Flowbird test platform, we found that some requests were made by a service spawned process using a token this process could not have known about!
Using Spider we found out that:
- The token had been served by NGINX cache based on the result of a previous request
- We found the previous request and noticed that it was calling NGINX with two different authentications: a certificate and a token, both with different user!
- So there was a bug in the code (to fix)
- But also, NGINX was associating from the cache the result of a request with 2 auth to a new request that was made only with 1 auth
- SO there was a bug in NGINX configuration (or NGINX itself)
Seeing the unseen to understand the unexpected!
That's a huge strength!!