Skip to main content

Shutdown delay

ยท 3 min read
Creator of Spider

I introduced a shutdown delay on all microservices to postpone their shutdown.
Kubernetes control plane need this time to update the Load Balancer to tell it that the services get unreachable.

Contextโ€‹

On Kubernetes, when downscaling or stopping PODs, a signal SIGTERM is sent to the PODs. The control plane is updated, and the load balancer used in Kube Services is updated to remove the PODs from its list.

In kube spec, the Load Balancer should be updated immediately. But nothing is immediate...
In fact, it may even take some seconds to update the Load Balancer.

In EKS, it takes < 1s, but on Karbon, it is more like 7, 10s!

Linked external issues:

Current way of shutdownโ€‹

On Spider microservices, the shutdown process is clean:

  • When receiving SIGTERM, a shutdown is required on the server
  • The HTTP server is stopped
  • Then, all ongoing connections are waited for completion
  • Then the service is stopped

The issue is that it is instantaneous (or almost) between SIGTERM and closing the HTTP server inbound.

But the Kube Service Virtual IP may still send some requests to this HTTP server for some (short) time, thus causing errors. Errors that could be hard to troubleshoot!

We found those errors with Spider running in Karbon, ar it was generating errors for 8 to 17 seconds!
For an unreachable replicas. Which was strange...
But thanks to Spider self monitoring UI, it took me only a couple of minutes to correlate these errors with the downscaling of replicas by Kubernetes HPA!

Solutionsโ€‹

After some searches, I found 2 options:

Pros & Consโ€‹

SolutionProsCons
PreStop hookFast to developA shell would add some attack surface,
but I could develop a specific node.js script to do it.
Applicative change* In the framework
* May be used outside Kube
Less attack surface

Conclusionโ€‹

Finally, I decided to add the shutdown delay inside the code. It took me an evening, and the solution worked :)

It is now part of the setup options!