3× More Connections. Same Core. Inside the

How FlashProxy used an autonomous AI optimizer to discover a TCP accept engine that uses 3× fewer CPU instructions than Go. Open-source and MIT licensed.

There is a class of performance problems that only shows up at scale, and it is deeply humbling when it does. You are not bottlenecked on your routing logic. Not on encryption. Not on anything your application is actually supposed to be doing. You are bottlenecked on the infrastructure around it. On the plumbing.

That is exactly what happened to us.

At around 40,000 connections per second, our production Go proxy began hitting CPU limits in the accept path. The culprit was Go's standard model for network services: one goroutine per connection, one syscall per operation. Every short-lived connection (a health check, a load balancer probe, a small redirect) touches the kernel four times. accept, read, write, close. At high connection churn, those four round-trips stop being overhead and start being the entire cost. The application code is nearly free. Getting bytes in and out of the kernel is not.

The fix everyone reaches for is io_uring, Linux's asynchronous I/O interface that batches kernel operations and dramatically reduces the number of times you have to cross the user/kernel boundary. We knew that. The harder question was how to tune it.

Why We Didn't Just Tune It Ourselves

io_uring has a long list of documented optimizations: multishot accept, registered file descriptors, DEFER_TASKRUN, completion chains, MSG_MORE coalescing. The problem is that these techniques interact with each other in ways that are genuinely difficult to predict. Some combinations compound. Some cancel each other out. Some regressions are completely invisible until you benchmark under realistic load on real hardware.

A developer sitting with perf stat can test a handful of combinations in a day. But the real search space, covering combinations, orderings, parameter values, and kernel version interactions, is far larger than that. More importantly, human intuition is a liability here. We tend to hold onto theories that look right on paper, even when the data says otherwise.

So we built a closed measurement loop and gave the search problem to an AI agent.

How the Research Loop Worked

We set up two server implementations running side by side. The control was a Go server modelling our production proxy's accept path exactly: goroutine per connection, SO_REUSEPORT fan-out. It was frozen for the entire experiment. The treatment was a C server using liburing, starting from a basic single-shot io_uring accept loop. That was the only code that could be changed.

Both servers served the same contract: accept a connection, read request bytes, write a fixed HTTP 200 response, close. Same hardware, same kernel, same load. The only variable was how they handled it.

Claude Code ran headless as the optimizer. Each iteration it read the current champion, the full history of previous mutations stored as git commits, a knowledge base of lessons accumulated across runs, and a database of profiling data. It formed a single hypothesis and made its edit. Then it handed off entirely.

A separate bash harness, which the AI could not touch, built the mutation, pinned it to an isolated CPU core, ran the load generator, and scored the result. The scoring formula was:

score = 1,000,000,000/mean(instructions per connection)

CPU instruction counts are an exact hardware measurement. Unlike throughput numbers, they are immune to thermal throttling and clock frequency variation. They tell you precisely how much work the CPU is doing per connection, which is what actually determines capacity at scale.

If a mutation improved the score by more than 3% over the current champion, it was promoted. If not, it was deleted with git reset --hard and the loop continued. The AI wrote the hypotheses. The harness made every keep or revert decision. Neither interfered with the other's job.

To prevent the optimizer from gaming the benchmark, every scored run was validated: the reply bytes had to be exactly correct, connections had to complete end-to-end, and the failure rate had to stay below 0.01%. Any violation scored zero.

What We Found

The optimizer ran for two days and converged on six changes that together account for the full performance gap.

DEFER_TASKRUN and SINGLE_ISSUER ring flags. These move completion task-work into the worker thread's own event loop, eliminating cross-CPU wakeups. This was the single largest jump in the entire run, a direct reduction in kernel CPU cost per operation, not a throughput trick.
Registered file descriptors. With direct descriptors, accepted connections live in the ring's own table rather than the process file descriptor table. This skips the fd-table install on accept and the lookup on every subsequent operation. The optimizer tested multiple table sizes and found that 4,096 entries was the optimal configuration.
Multishot accept. Instead of re-arming the accept operation after every connection, you arm it once and the kernel posts a completion for each new connection automatically. This cut io_uring_enter calls per connection to 0.34.
Per-worker connection freelist. Pre-allocating 128 connection objects per worker eliminated malloc on the hot path entirely. The measured share of CPU instructions going to libc dropped from 1.32% to 0.92%.
MSG_MORE reply and FIN fusion. Sending the reply with MSG_MORE holds it in the TCP write queue so the connection's FIN can piggyback onto it, shipping both as a single TCP segment. One NIC doorbell instead of two. The optimizer found this by noticing that a low-level kernel write function was consuming twice its expected share of instructions and tracing it back to unnecessary segment splitting.
Batched completions with CQE_SKIP_SUCCESS. Tagging send and close operations so they do not generate completion events on success means only accept and receive produce completions, roughly two per connection instead of four. One submit-and-wait call drives many connections simultaneously.

The Failure That Taught the Most

At one point, the optimizer tried linking the receive, send, and close operations into a chain, a technique that causes each to fire automatically when the previous completes. The goal was to reduce kernel round-trips, and it worked: enter calls per connection dropped from 1.90 to 1.40.

The score fell 34%. Reverted immediately.

The lesson that went into the knowledge base: minimizing kernel entries is not the lever. Chain serialization costs more than the entries it saves.

This is exactly the kind of result that breaks human intuition. The metric that looked like the bottleneck was not the actual bottleneck. An engineer would probably have defended that optimization longer. The loop measured it, rejected it, and moved on.

Where the Search Stopped

After the six wins, the optimizer explored roughly a dozen more candidates. All were reverted, not because they regressed, but because measurement noise exceeded the 3% promotion threshold. There was nothing left to find.

At the champion configuration, approximately 94% of the remaining CPU belongs to the Linux kernel's TCP stack. Around 1% is application code. Around 4% is leveraging internals. There is no user-space code left to optimize meaningfully. The research correctly identified the floor and stopped at it.

The Results

On one pinned core, CPU-bound, with 512 fixed in-flight connections on loopback:

Go goroutine-per-connection: 83,250 instructions per connection
Vanilla io_uring starting baseline: 59,931 instructions per connection
FlashProxy's accept engine: 27,363 instructions per connection

That is 3.04× fewer CPU instructions per connection than Go, and 2.19× fewer than the io_uring baseline. On a single saturated core, that translates to roughly six times the connection throughput compared to the goroutine model.

These are loopback benchmarks designed to isolate CPU cost cleanly. The ratios are what matter, not the absolute numbers. The underlying techniques (multishot accept, DEFER_TASKRUN, registered descriptors) are documented in io_uring idioms. What the research produced was proof of which combinations actually work together, and which ones look good on paper but cost you in practice.

The Open Source Library

The winning design is now flashaccept, an open-source C library that packages all six optimizations behind a simple API. You give it a port and a request handler. It runs an optimized io_uring accept loop per core automatically.

It requires Linux and liburing 2.3 or later. On older kernels it degrades gracefully. The intended use case is high-churn, short-lived connections: health checks, redirectors, load balancer probes, small RPC responses. The complete research rig, including the baseline, the harness, the optimizer configuration, and the full accumulated knowledge base, is included and fully reproducible.

MIT licensed. Available now at github.com/thealonlevi/flashaccept.

This research came directly out of building FlashProxy's proxy infrastructure. If you are running a high-churn Linux service and the accept path is your bottleneck, we built this for exactly that problem. If you want to extend it or run the research rig yourself, the repository has everything you need.

FAQ

What is Flashaccept?

Flashaccept is an open-source C library built by FlashProxy that provides a high-performance TCP accept engine for Linux. It uses io_uring under the hood and accepts connections for 3.04× fewer CPU instructions than a standard Go goroutine-per-connection server. It is MIT licensed and available on GitHub.

What workloads is it designed for?

Flashaccept is built for high-churn, short-lived connections where the request-reply-close cycle happens at high volume: health check endpoints, load balancer probes, HTTP redirectors, and small RPC responses. It is not designed for keep-alive connections or multi-exchange sessions in v1.

What Linux version and liburing version do I need?

You need Linux with liburing version 2.3 or later, which ships with Ubuntu 24.04 and later. The fast path (multishot accept, direct descriptors) requires kernel 5.19 or newer. On older kernels the library falls back gracefully to single-shot accept and regular file descriptors.

What is io_uring and why does it matter for proxy performance?

io_uring is a Linux kernel interface introduced in kernel 5.1 that allows applications to submit and receive I/O operations asynchronously using shared memory ring buffers, dramatically reducing the number of system calls required. For a proxy infrastructure handling tens of thousands of short-lived connections per second, the cost of crossing the user/kernel boundary repeatedly becomes the dominant CPU expense. io_uring collapses that cost significantly.

Is flashaccept what FlashProxy runs in production?

The research came directly out of our production scaling work. The proxy infrastructure at FlashProxy operates across 190+ countries and handles substantial connection volume. The accept path optimization was a real engineering requirement, not a research exercise. flashaccept is the distilled result of that work, open-sourced so others can use it.

Can I use flashaccept today?

Yes. It is version 1.0.1, MIT licensed, and available via GitHub, vcpkg, Conan, and the Arch AUR. The library has been tested under AddressSanitizer and UBSan across 400,000+ connections across all four configuration paths.

Where can I learn more about FlashProxy's infrastructure?

You can read more about how we build and scale proxy infrastructure on the FlashProxy blog or explore our proxy network directly.

3× More Connections. Same Core. Inside the Performance Research Behind FlashProxy's Accept Engine

Table of Contents