Skip to main content

Rewriting the Controller in Go, with Claude

· 9 min read
Creator of Spider

The Spider Controller is the Kubernetes-native agent that watches your cluster in real time: it tracks pods, services, deployments, and ownership chains, attaches capture agents to pods, answers DNS queries, and maintains a live WebSocket connection to the Spider backend. For the past few years it has been written in Node.js. That changes now.

Why rewrite?

Two pain points made this rewrite increasingly inevitable.

Scalability. Node.js is single-threaded. The Controller needs to simultaneously watch eight types of Kubernetes objects, process attachment decisions, serve DNS queries, maintain a WebSocket connection, and run periodic jobs. On small-to-medium clusters this is perfectly fine. On large clusters — 50+ nodes, 4,000+ pods, 3,000+ services — the event loop begins to feel the pressure.

State drift. In high-churn environments like development clusters that get wiped and redeployed frequently, the in-memory state built up by the Controller can drift from reality. The Node.js implementation relied on a hand-rolled watch loop; a reconnection or a missed event could leave orphaned entries in memory. The Go ecosystem's client-go informer framework, by contrast, was purpose-built to handle exactly this — automatic reconnection, resource-version tracking, cache resync, and retry.

Go was the natural choice: it is already the language of the other Spider capture agents (Gossiper and Gocipher), it produces a single static binary with no runtime dependencies, and its goroutine model maps well to the concurrent nature of the work.

The approach: Claude as co-developer

Rather than starting from a blank file, I handed the task to Claude (via Claude Code) with a clear mandate:

Write a drop-in replacement for the Node.js Controller in Go. Same HTTP API. Same WebSocket protocol. Same DNS proxy behaviour. Same log format. Lower footprint.

The workflow unfolded in structured phases:

PhaseWhat happened
DiscoveryClaude read the entire Node.js codebase and produced a detailed architecture document covering every component: config polling, JWT lifecycle, Kubernetes watchers, attachment logic, DNS proxy, WebSocket handlers, HTTP routes, periodic jobs
PlanningAn implementation plan was written, broken into six sequential phases with an explicit feature checklist to track against
ImplementationEach phase was implemented in order, with Claude writing the Go code against the established patterns from Gossiper and Gocipher
IterationBuild, deploy, test on a local cluster; then on a real production cluster

A few practical observations about working with Claude on a project of this size:

  • Context is everything. Claude regularly needs to be reminded of the goal — "this must be a drop-in replacement, same external behaviour" — especially as the conversation accumulates. Keeping a living progress document that Claude updates after each phase makes resuming sessions much easier.
  • Clearing context between phases helps. Starting a fresh conversation for each phase, with the progress document as input, prevents the model from carrying stale assumptions from earlier work.
  • Give it the ability to test. Providing Claude with the build, deploy, and test commands — including how to deploy to a real cluster and read logs — made a huge difference. An AI that can run, observe, and iterate is far more effective than one that only writes code.
  • Ask it to update its own memory. Telling Claude to keep the progress document and its memory files current between sessions made long-running work across multiple evenings practical.

The implementation itself covered a lot of ground:

  • Kubernetes informers for 8 resource types with namespace filtering, ownership chain tracking, and service-to-pod label matching
  • A complete whisperer attachment system with 4 trigger mechanisms, duplicate detection, and failure throttling
  • WebSocket client with exponential-backoff reconnection and request/response correlation
  • UDP DNS server (A, PTR, AAAA records) with host aliases and system fallback
  • HTTP REST API with RS256 JWT authentication
  • Per-minute network usage buffer with Kubernetes metadata enrichment
  • Structured JSON logging with integer log levels matching the existing agents

Full feature parity with the Node.js version was reached across 6 phases.

First deployment: a rude awakening

The first production deployment was on a large cluster:

ClusterFigures
Nodes53
Pods4,092
Services3,323
Deployments1,795
Active whisperers224
Gociphers52

The results were not encouraging:

MetricNode.jsGo v1
CPU~7%153%
Memory189 MB1,616 MB

21× the CPU. 8× the memory. Not exactly the "lower footprint" we were aiming for.

This is the part where you resist the urge to declare the rewrite a failure. Instead: profile.

Performance improvements — round by round

What followed was four rounds of optimisation, with the last 2 using pprof profiling on the live cluster.

Round 0 — The obvious fixes

Before profiling, a few structural issues were visible from code review:

  • Pod field selector at the API server. Terminal pods (Succeeded, Failed) were being loaded into the informer cache. Filtering them out at source reduced the object count significantly.
  • Metadata stripping. Kubernetes objects carry annotations and managed fields that the Controller never reads. A transform function stripped these before objects entered the informer store.
  • Resync interval. Reduced from 1 minute to 10 minutes — the informer's periodic full re-list was happening far too often.
  • Resync skip. client-go calls UpdateFunc with the same object pointer for old and new on resync events. Adding if oldObj == newObj { return } across all eight informers eliminated a flood of redundant work every 10 minutes.

Result after Round 0: 59% CPU / 884 MB

Round 1 — Cache and evaluation optimisations

Six targeted changes driven by profiling analysis:

  • Extended the resync skip to all informer UpdateFunc handlers (not just pods). On each 10-minute resync: 3,323 services were re-linking their pods, 1,795 deployments were scanning for orphans. All now skipped when nothing actually changed.
  • Replaced global map clones with targeted lookups. Checking 0–3 entries on a specific pod is orders of magnitude cheaper than cloning a 168-entry global map, especially when called on every Kubernetes event.
  • Incremental object counts. Instead of scanning all 13,000+ cached objects every 20 seconds to compute statistics, a counter is now incremented/decremented on each add/delete. O(1) instead of O(n).
  • Cached host IP set. Similarly, the set of node IPs was previously derived by scanning 4,092 pods every 20 seconds. Now maintained incrementally.
  • Namespace pre-filter in attachment evaluation. The first check before any owner-chain walk: if the attachment targets a different namespace, skip immediately.
  • Bounded goroutine pool for attachments. A semaphore of 10 caps the number of concurrent attachment goroutines, preventing memory spikes during mass evaluation bursts.

Result after Round 1: ~24% CPU / 336 MB

Round 2 — Signal coalescing

The attachment evaluator was re-running a full evaluation on every Kubernetes workload event. With 47 active attachments and thousands of pods, that multiplies quickly. Two changes:

  • Removed the 30-second periodic evaluation ticker. The periodic evaluation was redundant given the state-check job that runs as a safety belt every 30 seconds anyway.
  • Channel drain before evaluation. Multiple TriggerEvaluation signals arriving in a burst are now merged into a single evaluation pass. N workload events → 1 evaluation, not N.

Result after Round 2: ~11% CPU / 279 MB

Round 3 — DNS was the real problem (in the profile)

At ~11% total CPU, pprof showed the DNS server accounting for 53% of the sampled time. The cause: every DNS query was scanning all 4,092 pods and all 3,323 services using fmt.Sprintf allocations for hostname matching — roughly 10,000 string allocations per query.

The Kubernetes cache already maintained forward and reverse DNS maps for exactly this purpose. Switching to direct O(1) map lookups and eliminating all fmt.Sprintf calls from the hot path fixed this almost entirely.

Result after Round 3: 7–9% CPU / 279 MB

Round 4 — DNS fallback cache

With the in-process lookups now O(1), a new pprof run showed the system DNS fallback (net.Resolver) at the top of the profile — queries not found in the Kubernetes cache still triggered real UDP round-trips to the upstream resolver. A simple TTL cache (30 seconds positive, 5 seconds negative) absorbed the repeated lookups.

Result after Round 4: 5–6% CPU / 287 MB

Final comparison

MetricNode.jsGo (final)
CPU~7%5–6%
Memory189 MB287 MB

CPU is now slightly better than Node.js. Memory is higher by ~100 MB — a structural gap explained by client-go informers maintaining their own internal object store alongside the application cache (every object lives twice in memory), plus Go runtime overhead. Closing that gap would require replacing informers with raw Kubernetes watches, which trades away the reliability features that motivated the rewrite in the first place. A worthwhile trade-off.

Lessons learned

A few things that stand out from this project:

Profile before optimising. Once the obvious inefficiencies were cleared, the DNS server was accounting for over half of the remaining sampled time — not the Kubernetes watchers, not the attachment logic. That was not predictable from code review. pprof pointed directly at it.

O(n) in hot paths is invisible until scale. Scanning 7,000+ objects per DNS query, or cloning 168-entry maps on every pod event, is perfectly harmless on a small cluster and catastrophic on a large one.

client-go resync is expensive without the skip. The oldObj == newObj check is a single line. Without it, every 10-minute resync triggers a full cascade of re-evaluation across all cached objects.

Signal coalescing matters for bursty systems. Kubernetes generates events in bursts — a deployment rollout touches hundreds of pods rapidly. Without draining the evaluation channel, each of those events triggers a full evaluation independently.

fmt.Sprintf in hot paths allocates. In a DNS server handling hundreds of queries per second, each of those allocations immediately becomes GC pressure. O(1) lookups with no allocation changes the picture entirely.

Conclusion

Four evenings of work, versus a human estimate of one month or more. The result is a Go binary that matches the Node.js Controller's external behaviour exactly — same HTTP API, same WebSocket protocol, same DNS proxy, same log format — while running at slightly lower CPU on a large production cluster and with better scaling headroom as cluster size grows.

The rewrite also uncovered and fixed the state drift issue: client-go informers handle reconnection, resource version tracking, and periodic resync automatically, eliminating the category of bugs that motivated the migration in the first place.

Claude did the heavy lifting. The human contribution was context management, goal reminders, local and remote test infrastructure, and the performance debugging lead — including instrumenting the code for remote pprof profiling. A genuinely productive collaboration.