What is Precision I/O doing? Constantine Sapuntzakis June 13, 2004 Disclaimer: I have not talked with Precision I/O folks at all. I have just read their web site and various trade articles about them. Precision I/O promises to improve the performance of the TCP/IP/Ethernet stack in servers, without resorting to new wire protocols or complicated offload engines. The company has some people with excellent achievements, including Chief Science Officer Van Jacobson, of TCP congestion control and header prediction fame, Chief Technology Officer Bob Felderman, one of the developers of Myrinet, and Judy Estrin, founder of several succesful start-ups. Why decrease the overhead of the TCP/IP/Ethernet stack? Intelligently reducing the overhead increases the number of cycles available for the application, presumably increasing the number of clients that can be served. Precision I/O says as much on their home page. On Precision I/O's web page and interviews (e.g. eWeek, April 5, 2004, "Startups Aim to Free Ethernet Packet Jams") it looks like they see operating system overheads as the primary problem in TCP/IP/Ethernet. So, it seems that Precision I/O is going to do TCP/IP/Ethernet without getting the OS involved in the fast path. == Packet processing today This section lays how receiving data from the network works with today's operating systems and applications. The essential insight is this: there are potentially many context switches between the application and kernel. The receive path today looks something like this: * Application calls read or recv. A system call requires a context switch into the kernel. If there is some data on the queue, return it to the application. Otherwise, sleep waiting for data. Busy network servers should always have some data in the queue and, as such, should not be sleeping too much waiting for data. * Network interface card (NIC) receives a packet. It appends the packet to a queue in main memory that the NIC's device driver set up. After crossing a high watermark or holding packets in the queue until a certain timeout, it asserts an interrupt to tell the device driver it might be a good time to check the queue. * The processor switches from whatever it was doing (potentially running userland applications) to kernel interrupt context. It runs the NIC device driver's interrupt handler. The NIC's handler takes the packets and puts them on some kernel Ethernet packet queues. It then schedules a soft interrupt (or deferred procedure call) for further protocol processing. * As the system is returning to userland, the soft interrupt runs, doing IP and TCP protocol processing. Eventually, data gets appended to some higher level protocol queue. Processes waiting on that queue are woken up. * The soft interrupts terminates. Before the kernel returns to userland, the scheduler is run to see woken up processes deserve to run ahead of the current process. * Process that was sleeping on the queue gets to run. The process wakes up, looks for data in the queue, and, if there is some, copies it into application buffers passed to read/recv. == Speeding Ethernet/IP/TCP up There are some half-measures that can be done to improve performance. A new system call could be introduced enable the application to read or write multiple packets at once. The TCP/IP code could be streamlined for the common case. A more radical way of decreasing overhead is to get rid of the need to context switch into the kernel: by processing network packets entirely in userland and disabling interrupts. So, how would one move TCP/IP network processing to userland? * a packet queue for each application, or at least each accelerated application, mapped into the application's address space * a network adapter which can do matches on TCP/IP packet headers and route packets to the appropriate application's queues. Note this offloads the packet matching from the processor * a device driver that allocate queues to userland processes, allows the map and lock the queues in the memory, allows clients to register interest in types of packets * a TCP/IP library linked into the application that does the TCP/IP processing the kernel used to do. This can even be done without changing the application by 1) having the library implement the libc socket calls, 2) telling the dynamic linker to load the TCP/IP library between the application and libc. * the network adapter should also filter outgoing packets to ensure that the process is not sending arbitrary Ethernet packets, and to shape and police the traffic coming from the application. This would help maintain the current OS secrutiy model. * the packet receive and transmit queues could be designed to support multiple threads or processes reading or writing the queue. Using non-blocking synchronization techniques would help ensure progress even in cases of sudden process failure. This probably only really make sense for servers where requests are a single packet (like DNS). What other techniques could Precision I/O be using? * Interrupts are expensive, requiring context switches to kernel mode, and, often, interrupt enable/disable code. But, if your server's main job is talking to the network, you can disable interrupts. By the time the server is done processing one packet, chances are another one has arrived. With userland queues, the application can just poll the queue. The application notices as soon as the packet arrives - low latency. * Replace the sockets interface to the application with a lower-level interface. Some applications (though not kernel NFS and iSCSI) read into circular buffers and then do copies of bulk data from there into application data structures. Potential benefits: * Better instruction cache performance. No need to run interrupt handlers and demuxing code. No code needed in TCP/IP stack to defend against malicious or badly written local applications. * Different TCP stacks optimized for different applications. Probably don't need SACK and timestamps for the datacenter. * More easily deploy new protocols and improvements to current protocols. It's easier to update a library than the kernel. And you don't necessarily need to upgrade all your applications at once. == No need for direct data placement, RDMA, or offload engines? Even with the techniques described above, there is still a copy from packet buffers to application buffers (e.g. ethernet packets into buffer cache pages for NFS). But, if your application accesses storage using read or write instead of mmap or direct I/O, you already have a copy in your path to storage. If you put the NFS/iSCSI client at user level too, you can potentially do a single copy from the NIC queue into the application buffer, avoiding the kernel buffer cache. As a result, IP/Ethernet storage should be no slower for that application than Fibrechannel storage. For applications that receive large amounts of data using direct I/O or mmap, the copy may impact performance. Suppression of that copy using direct data placement, RDMA, or offload engines should help performance. == Some other notes Some other notes: * Can't easily do zero-copy on transmit from user-land. User-land applications can't really do scatter-gather because they don't know physical addresses and can't pin pages. So, you still have to go into the kernel to do zero-copy transmits. However, by providing a user-land transmit buffer, Precision I/O can potentially provide lower latencies for applications that pass small messages. They just copy their data into the transmit buffer, avoiding the need to context switch and run the kernel stack. * Do applications require you need to implement select/poll/kqueue in its full generality? * If an application pauses for a long time before servicing the packet queue, what effect does that have on TCP performance? Do you have timers to try to ensure that the TCP queue is periodically checked? Even with timers, the thread could be suspended due to a page fault. But, if your high performance server has lots of slow page faults, you may have other problems. * Apache-style applications have potentially hundreds of worker threads or processes, all sleeping in accept, waiting for incoming TCP connections. Do you give each worker its own queue? If so, how does the system determine which queue gets which connections? If the adapter randomly chooses a queue and associates flows with it, how does it deal with SYN attacks? If a software thread is in charge of deciding which thread gets which TCP flows with the adapter, how does it do so with low overhead for small flows, how does it forward the SYN packet/information to the relevant worker thread? Revised: June 20, 2004: Elucidate. Add direct I/O. Mention easier to deploy new protocols