Linux Networking Performance Packet Processing

Zero Copy Packet Processing: Faster Networking in User Space

Why Traditional Packet Processing Becomes a Bottleneck

In conventional Linux networking, every packet received by a network interface card (NIC) travels through a well-worn path: the kernel copies data from the NIC's DMA buffer into kernel memory, then copies it again into user-space memory before any application can inspect it. At 1 Gbps, this double-copy overhead is tolerable. At 10, 25, or 100 Gbps, it becomes catastrophic — consuming CPU cycles, polluting caches, and introducing microseconds of latency that compound at scale.

This is precisely the problem that zero copy networking solves. By eliminating redundant memory copies and reducing kernel-to-user transitions, modern frameworks allow applications to process millions of packets per second on commodity hardware without specialized ASICs.

What Zero Copy Networking Actually Means

The term "zero copy" is sometimes used loosely, but in the context of packet processing it has a precise meaning: packet data written by the NIC via DMA is made directly accessible to user-space code without any intermediate kernel copy. The application reads packet payloads from a shared memory region that both the kernel driver and the user-space process can access simultaneously.

This shared memory model requires careful synchronization — typically via ring buffers and memory-mapped I/O — but the payoff is dramatic. A CPU core that previously spent 40–60% of its cycles moving bytes between memory regions can now dedicate that capacity entirely to inspecting, filtering, and forwarding packets. Zero copy networking is therefore not just a micro-optimization; it is an architectural shift in how packet data flows through the system.

Kernel Bypass and MMAP-Based Ring Buffers

Two primary mechanisms underpin zero copy packet processing in Linux. The first is kernel bypass, where the NIC driver exposes packet data directly to user space, completely skipping the kernel network stack. Frameworks like DPDK (Data Plane Development Kit) and PF_RING use this approach, mapping NIC descriptor rings into the application's virtual address space.

The second mechanism is the PACKET_MMAP interface built into the Linux kernel itself. By opening a raw socket and enabling the PACKET_RX_RING option, an application can memory-map a circular buffer that the kernel fills with incoming frames. The application polls this ring without issuing system calls per packet, achieving near-zero copy semantics without leaving the kernel's networking model entirely.

int fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req));
void *ring = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

This pattern is the foundation of tools like PFQ — pfq.io's own packet filtering and network queue framework — which extends PACKET_MMAP semantics with multi-queue fan-out, in-kernel functional composition, and hardware timestamping support.

PFQ: Zero Copy Networking for Linux at Scale

PFQ is a Linux kernel module designed specifically for high-speed packet capture and in-kernel processing. It exposes a socket API that user-space applications use to receive packets from multiple hardware queues simultaneously, with each packet appearing exactly once in a per-socket memory-mapped ring — a true zero copy networking implementation.

What distinguishes PFQ from raw PACKET_MMAP is its functional engine: a small domain-specific language (pfq-lang) that lets operators compose packet filtering, steering, and transformation logic inside the kernel. Decisions about which packets reach which user-space consumer are made before data ever crosses the kernel-user boundary, minimizing unnecessary copies even further.

PFQ also integrates with network queues exposed by modern multi-queue NICs, distributing packet streams across CPU cores via RSS (Receive Side Scaling) while preserving flow affinity. This makes it well suited to intrusion detection systems, traffic analyzers, and network monitoring probes that must sustain line-rate capture without dropping frames.

Measuring the Performance Gains

Benchmarks consistently show that zero copy networking frameworks outperform traditional recv()-based capture by an order of magnitude. On a 10 GbE interface with 64-byte packets (the hardest case, at ~14.8 Mpps), a standard AF_PACKET socket without MMAP typically saturates a CPU core at around 1–2 Mpps. PFQ and DPDK-based implementations routinely sustain 10–14 Mpps per core under the same conditions.

Latency improvements are equally significant. Eliminating the copy path reduces per-packet processing latency from tens of microseconds to low single-digit microseconds, which matters enormously for latency-sensitive applications like financial trading infrastructure, real-time DDoS mitigation, and 5G user-plane functions.

Choosing the Right Zero Copy Approach

The right framework depends on your constraints. If you need portability and are already using the Linux kernel network stack for some traffic, PACKET_MMAP with a library like libpcap (in memory-mapped mode) is the lowest-friction entry point. If you need maximum throughput and can dedicate CPU cores and NICs to data-plane work, DPDK or PFQ offer superior performance at the cost of higher integration complexity.

For monitoring and packet filtering workloads specifically, PFQ's in-kernel composition model offers an elegant middle ground: the kernel does the heavy lifting of steering and filtering, and user-space receives only the packets it actually needs — a natural extension of the zero copy networking philosophy to the filtering layer itself.

Getting Started with Zero Copy Packet Capture

To experiment with zero copy networking on Linux, start by enabling PACKET_MMAP on a raw socket and measuring your baseline capture rate with a tool like pktgen as a traffic source. Then explore PFQ by cloning the repository, building the kernel module, and running the included pfq-capture example against a live interface. The performance difference will be immediately apparent — and the architectural clarity of shared-ring packet delivery makes it an excellent foundation for building production-grade network tools.

High-Performance Packet Capture with Linux Kernel Bypass