Skip to main content

Shutdown delay

· 3 min read
Creator of Spider

I introduced a shutdown delay on all microservices to postpone their shutdown.
Kubernetes control plane need this time to update the Load Balancer to tell it that the services get unreachable.

Context

On Kubernetes, when downscaling or stopping PODs, a signal SIGTERM is sent to the PODs. The control plane is updated, and the load balancer used in Kube Services is updated to remove the PODs from its list.

In kube spec, the Load Balancer should be updated immediately. But nothing is immediate...
In fact, it may even take some seconds to update the Load Balancer.

In EKS, it takes < 1s, but on Karbon, it is more like 7, 10s!

Linked external issues:

Current way of shutdown

On Spider microservices, the shutdown process is clean:

  • When receiving SIGTERM, a shutdown is required on the server
  • The HTTP server is stopped
  • Then, all ongoing connections are waited for completion
  • Then the service is stopped

The issue is that it is instantaneous (or almost) between SIGTERM and closing the HTTP server inbound.

But the Kube Service Virtual IP may still send some requests to this HTTP server for some (short) time, thus causing errors. Errors that could be hard to troubleshoot!

We found those errors with Spider running in Karbon, ar it was generating errors for 8 to 17 seconds!
For an unreachable replicas. Which was strange...
But thanks to Spider self monitoring UI, it took me only a couple of minutes to correlate these errors with the downscaling of replicas by Kubernetes HPA!

Solutions

After some searches, I found 2 options:

Pros & Cons

SolutionProsCons
PreStop hookFast to developA shell would add some attack surface,
but I could develop a specific node.js script to do it.
Applicative change* In the framework
* May be used outside Kube
Less attack surface

Conclusion

Finally, I decided to add the shutdown delay inside the code. It took me an evening, and the solution worked :)

It is now part of the setup options!