Unknown | 168 lines | 168 code | 0 blank | 0 comment | 0 complexity | d2024edc8d67dcfff7ce335976c488d5 MD5 | raw file
1.\" 2.\" Copyright (c) 2002 Kenneth D. Merry. 3.\" All rights reserved. 4.\" 5.\" Redistribution and use in source and binary forms, with or without 6.\" modification, are permitted provided that the following conditions 7.\" are met: 8.\" 1. Redistributions of source code must retain the above copyright 9.\" notice, this list of conditions, and the following disclaimer, 10.\" without modification, immediately at the beginning of the file. 11.\" 2. The name of the author may not be used to endorse or promote products 12.\" derived from this software without specific prior written permission. 13.\" 14.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 15.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 16.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 17.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR 18.\" ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 19.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 20.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 21.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 22.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 23.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 24.\" SUCH DAMAGE. 25.\" 26.\" $FreeBSD$ 27.\" 28.Dd December 5, 2004 29.Dt ZERO_COPY 9 30.Os 31.Sh NAME 32.Nm zero_copy , 33.Nm zero_copy_sockets 34.Nd "zero copy sockets code" 35.Sh SYNOPSIS 36.Cd "options ZERO_COPY_SOCKETS" 37.Sh DESCRIPTION 38The 39.Fx 40kernel includes a facility for eliminating data copies on 41socket reads and writes. 42.Pp 43This code is collectively known as the zero copy sockets code, because during 44normal network I/O, data will not be copied by the CPU at all. 45Rather it 46will be DMAed from the user's buffer to the NIC (for sends), or DMAed from 47the NIC to a buffer that will then be given to the user (receives). 48.Pp 49The zero copy sockets code uses the standard socket read and write 50semantics, and therefore has some limitations and restrictions that 51programmers should be aware of when trying to take advantage of this 52functionality. 53.Pp 54For sending data, there are no special requirements or capabilities that 55the sending NIC must have. 56The data written to the socket, though, must be 57at least a page in size and page aligned in order to be mapped into the 58kernel. 59If it does not meet the page size and alignment constraints, it 60will be copied into the kernel, as is normally the case with socket I/O. 61.Pp 62The user should be careful not to overwrite buffers that have been written 63to the socket before the data has been freed by the kernel, and the 64copy-on-write mapping cleared. 65If a buffer is overwritten before it has 66been given up by the kernel, the data will be copied, and no savings in CPU 67utilization and memory bandwidth utilization will be realized. 68.Pp 69The 70.Xr socket 2 71API does not really give the user any indication of when his data has 72actually been sent over the wire, or when the data has been freed from 73kernel buffers. 74For protocols like TCP, the data will be kept around in 75the kernel until it has been acknowledged by the other side; it must be 76kept until the acknowledgement is received in case retransmission is required. 77.Pp 78From an application standpoint, the best way to guarantee that the data has 79been sent out over the wire and freed by the kernel (for TCP-based sockets) 80is to set a socket buffer size (see the 81.Dv SO_SNDBUF 82socket option in the 83.Xr setsockopt 2 84manual page) appropriate for the application and network environment and then 85make sure you have sent out twice as much data as the socket buffer size 86before reusing a buffer. 87For TCP, the send and receive socket buffer sizes 88generally directly correspond to the TCP window size. 89.Pp 90For receiving data, in order to take advantage of the zero copy receive 91code, the user must have a NIC that is configured for an MTU greater than 92the architecture page size. 93(E.g., for i386 it would be 4KB.) 94Additionally, in order for zero copy receive to work, 95packet payloads must be at least a page in size and page aligned. 96.Pp 97Achieving page aligned payloads requires a NIC that can split an incoming 98packet into multiple buffers. 99It also generally requires some sort of 100intelligence on the NIC to make sure that the payload starts in its own 101buffer. 102This is called 103.Dq "header splitting" . 104Currently the only NICs with 105support for header splitting are Alteon Tigon 2 based boards running 106slightly modified firmware. 107The 108.Fx 109.Xr ti 4 110driver includes modified firmware for Tigon 2 boards only. 111Header 112splitting code can be written, however, for any NIC that allows putting 113received packets into multiple buffers and that has enough programmability 114to determine that the header should go into one buffer and the payload into 115another. 116.Pp 117You can also do a form of header splitting that does not require any NIC 118modifications if your NIC is at least capable of splitting packets into 119multiple buffers. 120This requires that you optimize the NIC driver for your 121most common packet header size. 122If that size (ethernet + IP + TCP headers) 123is generally 66 bytes, for instance, you would set the first buffer in a 124set for a particular packet to be 66 bytes long, and then subsequent 125buffers would be a page in size. 126For packets that have headers that are 127exactly 66 bytes long, your payload will be page aligned. 128.Pp 129The other requirement for zero copy receive to work is that the buffer that 130is the destination for the data read from a socket must be at least a page 131in size and page aligned. 132.Pp 133Obviously the requirements for receive side zero copy are impossible to 134meet without NIC hardware that is programmable enough to do header 135splitting of some sort. 136Since most NICs are not that programmable, or their 137manufacturers will not share the source code to their firmware, this approach 138to zero copy receive is not widely useful. 139.Pp 140There are other approaches, such as RDMA and TCP Offload, that may 141potentially help alleviate the CPU overhead associated with copying data 142out of the kernel. 143Most known techniques require some sort of support at 144the NIC level to work, and describing such techniques is beyond the scope 145of this manual page. 146.Pp 147The zero copy send and zero copy receive code can be individually turned 148off via the 149.Va kern.ipc.zero_copy.send 150and 151.Va kern.ipc.zero_copy.receive 152.Nm sysctl 153variables respectively. 154.Sh SEE ALSO 155.Xr sendfile 2 , 156.Xr socket 2 , 157.Xr ti 4 158.Sh HISTORY 159The zero copy sockets code first appeared in 160.Fx 5.0 , 161although it has 162been in existence in patch form since at least mid-1999. 163.Sh AUTHORS 164.An -nosplit 165The zero copy sockets code was originally written by 166.An Andrew Gallatin Aq gallatin@FreeBSD.org 167and substantially modified and updated by 168.An Kenneth Merry Aq ken@FreeBSD.org .