Shutdown delay
I introduced a shutdown delay on all microservices to postpone their shutdown.
Kubernetes control plane need this time to update the Load Balancer to tell it that the services get unreachable.
Contextโ
On Kubernetes, when downscaling or stopping PODs, a signal SIGTERM is sent to the PODs. The control plane is updated, and the load balancer used in Kube Services is updated to remove the PODs from its list.
In kube spec, the Load Balancer should be updated immediately. But nothing is immediate...
In fact, it may even take some seconds to update the Load Balancer.
In EKS, it takes < 1s, but on Karbon, it is more like 7, 10s!
Linked external issues:
- https://github.com/kubernetes/ingress-nginx/issues/3639
- https://gitlab.com/gitlab-org/charts/gitlab/-/issues/2943
Current way of shutdownโ
On Spider microservices, the shutdown process is clean:
- When receiving SIGTERM, a shutdown is required on the server
- The HTTP server is stopped
- Then, all ongoing connections are waited for completion
- Then the service is stopped
The issue is that it is instantaneous (or almost) between SIGTERM and closing the HTTP server inbound.
But the Kube Service Virtual IP may still send some requests to this HTTP server for some (short) time, thus causing errors. Errors that could be hard to troubleshoot!
We found those errors with Spider running in Karbon, ar it was generating errors for 8 to 17 seconds!
For an unreachable replicas. Which was strange...
But thanks to Spider self monitoring UI, it took me only a couple of minutes to correlate these errors with the downscaling of replicas by Kubernetes HPA!
Solutionsโ
After some searches, I found 2 options:
- Add an preStop lifecycle hook doing a sleep(n) in a shell / script that will wait when receiving the stop command, before sending the SIGTERM.
- Add a delay in the applicative layer of the apps, that will force add a wait before shutting down the server.
Pros & Consโ
Solution | Pros | Cons |
---|---|---|
PreStop hook | Fast to develop | A shell would add some attack surface, but I could develop a specific node.js script to do it. |
Applicative change | * In the framework * May be used outside Kube | Less attack surface |
Conclusionโ
Finally, I decided to add the shutdown delay inside the code. It took me an evening, and the solution worked :)
It is now part of the setup options!