Multi-Core Packet Distribution: Scaling Network Load in Linux

Modern servers ship with 16, 32, or even 128 CPU cores, yet a naive Linux network stack funnels all incoming traffic through a single core. The result is a predictable bottleneck: one core pegged at 100% while the rest sit idle, and throughput plateaus well below what the NIC can deliver. Proper network packet distribution breaks that bottleneck by spreading work across every available core.

Why Single-Core Packet Processing Fails at Scale

When a packet arrives, the NIC raises a hardware interrupt. By default that interrupt is handled by one CPU, which then processes the packet through the kernel's softirq mechanism. At 10 Gbps and beyond, a single core can receive millions of packets per second. Even a stripped-down kernel path costs hundreds of nanoseconds per packet, making it trivially easy to saturate one core while the NIC drops frames.

The problem compounds with stateful workloads. Firewalls, NAT tables, and connection tracking all require per-flow state. Concentrating that work on one core serializes every lookup, pushing cache pressure and lock contention to the limit.

Receive Side Scaling (RSS): Hardware-Level Distribution

RSS is the first line of defense and the most efficient mechanism for network packet distribution. A capable NIC maintains multiple hardware receive queues — each mapped to a distinct MSI-X interrupt vector and pinned to a separate CPU core. The NIC itself hashes incoming flows (typically using a Toeplitz hash over the 4-tuple: src IP, dst IP, src port, dst port) and places each packet into the appropriate queue.

To verify RSS queue count and configure it:

ethtool -l eth0          # show current and max queue counts
ethtool -L eth0 combined 8   # set 8 combined RX/TX queues

Pair RSS with IRQ affinity so each queue's interrupt is handled by its dedicated core:

set_irq_affinity.sh eth0   # vendor script, or set /proc/irq/N/smp_affinity manually

With RSS properly configured, the NIC distributes flows deterministically — packets belonging to the same TCP connection always land on the same core, preserving in-order delivery without locking.

Receive Packet Steering (RPS): Software RSS for Older Hardware

Not every NIC supports multiple hardware queues. RPS replicates RSS behavior in the kernel's softirq layer. A single hardware interrupt wakes one CPU, which then computes a flow hash and places the packet into a per-CPU backlog queue for a target CPU chosen from a configurable bitmask.

# Enable RPS on all cores for eth0 receive queue 0
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus

RPS adds a small inter-processor interrupt (IPI) overhead compared to RSS, but it enables multi-core network packet distribution on hardware that predates multi-queue NICs. It also lets you exclude certain cores — for example, keeping real-time threads on isolated cores free of softirq load.

Receive Flow Steering (RFS) and XPS

RPS alone can cause cache misses: the softirq runs on core A, but the application socket sits on core B. Receive Flow Steering (RFS) solves this by steering packets to the CPU where the consuming socket is actually running, dramatically improving L3 cache locality.

echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

On the transmit side, Transmit Packet Steering (XPS) maps CPU cores to TX queues, ensuring that a core always enqueues to its own dedicated transmit queue and avoids cross-core locking on the qdisc layer.

Flow Hashing and ECMP for Load Balancer Deployments

At the load balancer tier, network packet distribution extends beyond a single host. Equal-Cost Multi-Path (ECMP) routing uses flow hashing at the router to distribute connections across multiple backend servers. Linux's kernel supports ECMP natively; tools like ip route with nexthop objects configure it directly.

For software load balancers running on multi-core hosts, XDP (eXpress Data Path) provides the fastest path. An XDP program attached to a NIC driver hook can redirect packets to specific CPU queues or even directly to user-space sockets using AF_XDP — all before the packet touches the kernel's main networking stack. This is where frameworks like PFQ and similar networking tools offer programmable packet filtering and steering at line rate.

Tuning the Network Queue and Kernel Parameters

Hardware and steering configuration alone won't reach maximum throughput without tuning the network queue depths and kernel buffers. Key parameters for high-throughput Linux networking:

sysctl -w net.core.netdev_max_backlog=10000
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"

Increasing netdev_max_backlog prevents packet drops when a CPU cannot drain its backlog queue fast enough during traffic bursts. Socket buffer sizes directly affect TCP throughput on high-BDP (bandwidth-delay product) paths.

Putting It Together: A Practical Checklist

Effective multi-core network packet distribution in Linux requires coordinating hardware, kernel, and application layers simultaneously. Start with RSS if your NIC supports it, pin IRQs to dedicated cores, enable RFS to keep packet processing close to the consuming socket, and tune queue depths to absorb bursts. For kernel-bypass scenarios, evaluate XDP or AF_XDP with tools from the linux networking ecosystem to push throughput well past what the kernel stack can sustain. Profiling with perf, sar -n DEV, and ethtool -S statistics will confirm where bottlenecks remain and guide further iteration.