Shutdown delay

April 11, 2023 · 3 min read

Creator of Spider

I introduced a shutdown delay on all microservices to postpone their shutdown.
Kubernetes control plane need this time to update the Load Balancer to tell it that the services get unreachable.

Context

On Kubernetes, when downscaling or stopping PODs, a signal SIGTERM is sent to the PODs. The control plane is updated, and the load balancer used in Kube Services is updated to remove the PODs from its list.

In kube spec, the Load Balancer should be updated immediately. But nothing is immediate...
In fact, it may even take some seconds to update the Load Balancer.

In EKS, it takes < 1s, but on Karbon, it is more like 7, 10s!

Linked external issues:

Current way of shutdown

On Spider microservices, the shutdown process is clean:

When receiving SIGTERM, a shutdown is required on the server
The HTTP server is stopped
Then, all ongoing connections are waited for completion
Then the service is stopped

The issue is that it is instantaneous (or almost) between SIGTERM and closing the HTTP server inbound.

But the Kube Service Virtual IP may still send some requests to this HTTP server for some (short) time, thus causing errors. Errors that could be hard to troubleshoot!

We found those errors with Spider running in Karbon, ar it was generating errors for 8 to 17 seconds!
For an unreachable replicas. Which was strange...
But thanks to Spider self monitoring UI, it took me only a couple of minutes to correlate these errors with the downscaling of replicas by Kubernetes HPA!

Solutions

After some searches, I found 2 options:

Add an preStop lifecycle hook doing a sleep(n) in a shell / script that will wait when receiving the stop command, before sending the SIGTERM.
- Reference: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
Add a delay in the applicative layer of the apps, that will force add a wait before shutting down the server.

Pros & Cons

Solution	Pros	Cons
PreStop hook	Fast to develop	A shell would add some attack surface, but I could develop a specific node.js script to do it.
Applicative change	* In the framework * May be used outside Kube	Less attack surface

Conclusion

Finally, I decided to add the shutdown delay inside the code. It took me an evening, and the solution worked :)

It is now part of the setup options!

Context​

Current way of shutdown​

Solutions​

Pros & Cons​

Conclusion​

Context

Current way of shutdown

Solutions

Pros & Cons

Conclusion