/share/man/man4/bpf.4
https://bitbucket.org/freebsd/freebsd-head/ · Forth · 1106 lines · 1100 code · 5 blank · 1 comment · 44 complexity · 047ed04913c4d91c074bb01f6d5a6b28 MD5 · raw file
- .\" Copyright (c) 2007 Seccuris Inc.
- .\" All rights reserved.
- .\"
- .\" This software was developed by Robert N. M. Watson under contract to
- .\" Seccuris Inc.
- .\"
- .\" Redistribution and use in source and binary forms, with or without
- .\" modification, are permitted provided that the following conditions
- .\" are met:
- .\" 1. Redistributions of source code must retain the above copyright
- .\" notice, this list of conditions and the following disclaimer.
- .\" 2. Redistributions in binary form must reproduce the above copyright
- .\" notice, this list of conditions and the following disclaimer in the
- .\" documentation and/or other materials provided with the distribution.
- .\"
- .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- .\" SUCH DAMAGE.
- .\"
- .\" Copyright (c) 1990 The Regents of the University of California.
- .\" All rights reserved.
- .\"
- .\" Redistribution and use in source and binary forms, with or without
- .\" modification, are permitted provided that: (1) source code distributions
- .\" retain the above copyright notice and this paragraph in its entirety, (2)
- .\" distributions including binary code include the above copyright notice and
- .\" this paragraph in its entirety in the documentation or other materials
- .\" provided with the distribution, and (3) all advertising materials mentioning
- .\" features or use of this software display the following acknowledgement:
- .\" ``This product includes software developed by the University of California,
- .\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
- .\" the University nor the names of its contributors may be used to endorse
- .\" or promote products derived from this software without specific prior
- .\" written permission.
- .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
- .\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
- .\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
- .\"
- .\" This document is derived in part from the enet man page (enet.4)
- .\" distributed with 4.3BSD Unix.
- .\"
- .\" $FreeBSD$
- .\"
- .Dd June 15, 2010
- .Dt BPF 4
- .Os
- .Sh NAME
- .Nm bpf
- .Nd Berkeley Packet Filter
- .Sh SYNOPSIS
- .Cd device bpf
- .Sh DESCRIPTION
- The Berkeley Packet Filter
- provides a raw interface to data link layers in a protocol
- independent fashion.
- All packets on the network, even those destined for other hosts,
- are accessible through this mechanism.
- .Pp
- The packet filter appears as a character special device,
- .Pa /dev/bpf .
- After opening the device, the file descriptor must be bound to a
- specific network interface with the
- .Dv BIOCSETIF
- ioctl.
- A given interface can be shared by multiple listeners, and the filter
- underlying each descriptor will see an identical packet stream.
- .Pp
- A separate device file is required for each minor device.
- If a file is in use, the open will fail and
- .Va errno
- will be set to
- .Er EBUSY .
- .Pp
- Associated with each open instance of a
- .Nm
- file is a user-settable packet filter.
- Whenever a packet is received by an interface,
- all file descriptors listening on that interface apply their filter.
- Each descriptor that accepts the packet receives its own copy.
- .Pp
- The packet filter will support any link level protocol that has fixed length
- headers.
- Currently, only Ethernet,
- .Tn SLIP ,
- and
- .Tn PPP
- drivers have been modified to interact with
- .Nm .
- .Pp
- Since packet data is in network byte order, applications should use the
- .Xr byteorder 3
- macros to extract multi-byte values.
- .Pp
- A packet can be sent out on the network by writing to a
- .Nm
- file descriptor.
- The writes are unbuffered, meaning only one packet can be processed per write.
- Currently, only writes to Ethernets and
- .Tn SLIP
- links are supported.
- .Sh BUFFER MODES
- .Nm
- devices deliver packet data to the application via memory buffers provided by
- the application.
- The buffer mode is set using the
- .Dv BIOCSETBUFMODE
- ioctl, and read using the
- .Dv BIOCGETBUFMODE
- ioctl.
- .Ss Buffered read mode
- By default,
- .Nm
- devices operate in the
- .Dv BPF_BUFMODE_BUFFER
- mode, in which packet data is copied explicitly from kernel to user memory
- using the
- .Xr read 2
- system call.
- The user process will declare a fixed buffer size that will be used both for
- sizing internal buffers and for all
- .Xr read 2
- operations on the file.
- This size is queried using the
- .Dv BIOCGBLEN
- ioctl, and is set using the
- .Dv BIOCSBLEN
- ioctl.
- Note that an individual packet larger than the buffer size is necessarily
- truncated.
- .Ss Zero-copy buffer mode
- .Nm
- devices may also operate in the
- .Dv BPF_BUFMODE_ZEROCOPY
- mode, in which packet data is written directly into two user memory buffers
- by the kernel, avoiding both system call and copying overhead.
- Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
- the page size.
- The maximum zero-copy buffer size is returned by the
- .Dv BIOCGETZMAX
- ioctl.
- Note that an individual packet larger than the buffer size is necessarily
- truncated.
- .Pp
- The user process registers two memory buffers using the
- .Dv BIOCSETZBUF
- ioctl, which accepts a
- .Vt struct bpf_zbuf
- pointer as an argument:
- .Bd -literal
- struct bpf_zbuf {
- void *bz_bufa;
- void *bz_bufb;
- size_t bz_buflen;
- };
- .Ed
- .Pp
- .Vt bz_bufa
- is a pointer to the userspace address of the first buffer that will be
- filled, and
- .Vt bz_bufb
- is a pointer to the second buffer.
- .Nm
- will then cycle between the two buffers as they fill and are acknowledged.
- .Pp
- Each buffer begins with a fixed-length header to hold synchronization and
- data length information for the buffer:
- .Bd -literal
- struct bpf_zbuf_header {
- volatile u_int bzh_kernel_gen; /* Kernel generation number. */
- volatile u_int bzh_kernel_len; /* Length of data in the buffer. */
- volatile u_int bzh_user_gen; /* User generation number. */
- /* ...padding for future use... */
- };
- .Ed
- .Pp
- The header structure of each buffer, including all padding, should be zeroed
- before it is configured using
- .Dv BIOCSETZBUF .
- Remaining space in the buffer will be used by the kernel to store packet
- data, laid out in the same format as with buffered read mode.
- .Pp
- The kernel and the user process follow a simple acknowledgement protocol via
- the buffer header to synchronize access to the buffer: when the header
- generation numbers,
- .Vt bzh_kernel_gen
- and
- .Vt bzh_user_gen ,
- hold the same value, the kernel owns the buffer, and when they differ,
- userspace owns the buffer.
- .Pp
- While the kernel owns the buffer, the contents are unstable and may change
- asynchronously; while the user process owns the buffer, its contents are
- stable and will not be changed until the buffer has been acknowledged.
- .Pp
- Initializing the buffer headers to all 0's before registering the buffer has
- the effect of assigning initial ownership of both buffers to the kernel.
- The kernel signals that a buffer has been assigned to userspace by modifying
- .Vt bzh_kernel_gen ,
- and userspace acknowledges the buffer and returns it to the kernel by setting
- the value of
- .Vt bzh_user_gen
- to the value of
- .Vt bzh_kernel_gen .
- .Pp
- In order to avoid caching and memory re-ordering effects, the user process
- must use atomic operations and memory barriers when checking for and
- acknowledging buffers:
- .Bd -literal
- #include <machine/atomic.h>
- /*
- * Return ownership of a buffer to the kernel for reuse.
- */
- static void
- buffer_acknowledge(struct bpf_zbuf_header *bzh)
- {
- atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
- }
- /*
- * Check whether a buffer has been assigned to userspace by the kernel.
- * Return true if userspace owns the buffer, and false otherwise.
- */
- static int
- buffer_check(struct bpf_zbuf_header *bzh)
- {
- return (bzh->bzh_user_gen !=
- atomic_load_acq_int(&bzh->bzh_kernel_gen));
- }
- .Ed
- .Pp
- The user process may force the assignment of the next buffer, if any data
- is pending, to userspace using the
- .Dv BIOCROTZBUF
- ioctl.
- This allows the user process to retrieve data in a partially filled buffer
- before the buffer is full, such as following a timeout; the process must
- recheck for buffer ownership using the header generation numbers, as the
- buffer will not be assigned to userspace if no data was present.
- .Pp
- As in the buffered read mode,
- .Xr kqueue 2 ,
- .Xr poll 2 ,
- and
- .Xr select 2
- may be used to sleep awaiting the availability of a completed buffer.
- They will return a readable file descriptor when ownership of the next buffer
- is assigned to user space.
- .Pp
- In the current implementation, the kernel may assign zero, one, or both
- buffers to the user process; however, an earlier implementation maintained
- the invariant that at most one buffer could be assigned to the user process
- at a time.
- In order to both ensure progress and high performance, user processes should
- acknowledge a completely processed buffer as quickly as possible, returning
- it for reuse, and not block waiting on a second buffer while holding another
- buffer.
- .Sh IOCTLS
- The
- .Xr ioctl 2
- command codes below are defined in
- .In net/bpf.h .
- All commands require
- these includes:
- .Bd -literal
- #include <sys/types.h>
- #include <sys/time.h>
- #include <sys/ioctl.h>
- #include <net/bpf.h>
- .Ed
- .Pp
- Additionally,
- .Dv BIOCGETIF
- and
- .Dv BIOCSETIF
- require
- .In sys/socket.h
- and
- .In net/if.h .
- .Pp
- In addition to
- .Dv FIONREAD
- and
- .Dv SIOCGIFADDR ,
- the following commands may be applied to any open
- .Nm
- file.
- The (third) argument to
- .Xr ioctl 2
- should be a pointer to the type indicated.
- .Bl -tag -width BIOCGETBUFMODE
- .It Dv BIOCGBLEN
- .Pq Li u_int
- Returns the required buffer length for reads on
- .Nm
- files.
- .It Dv BIOCSBLEN
- .Pq Li u_int
- Sets the buffer length for reads on
- .Nm
- files.
- The buffer must be set before the file is attached to an interface
- with
- .Dv BIOCSETIF .
- If the requested buffer size cannot be accommodated, the closest
- allowable size will be set and returned in the argument.
- A read call will result in
- .Er EIO
- if it is passed a buffer that is not this size.
- .It Dv BIOCGDLT
- .Pq Li u_int
- Returns the type of the data link layer underlying the attached interface.
- .Er EINVAL
- is returned if no interface has been specified.
- The device types, prefixed with
- .Dq Li DLT_ ,
- are defined in
- .In net/bpf.h .
- .It Dv BIOCPROMISC
- Forces the interface into promiscuous mode.
- All packets, not just those destined for the local host, are processed.
- Since more than one file can be listening on a given interface,
- a listener that opened its interface non-promiscuously may receive
- packets promiscuously.
- This problem can be remedied with an appropriate filter.
- .It Dv BIOCFLUSH
- Flushes the buffer of incoming packets,
- and resets the statistics that are returned by BIOCGSTATS.
- .It Dv BIOCGETIF
- .Pq Li "struct ifreq"
- Returns the name of the hardware interface that the file is listening on.
- The name is returned in the ifr_name field of
- the
- .Li ifreq
- structure.
- All other fields are undefined.
- .It Dv BIOCSETIF
- .Pq Li "struct ifreq"
- Sets the hardware interface associate with the file.
- This
- command must be performed before any packets can be read.
- The device is indicated by name using the
- .Li ifr_name
- field of the
- .Li ifreq
- structure.
- Additionally, performs the actions of
- .Dv BIOCFLUSH .
- .It Dv BIOCSRTIMEOUT
- .It Dv BIOCGRTIMEOUT
- .Pq Li "struct timeval"
- Set or get the read timeout parameter.
- The argument
- specifies the length of time to wait before timing
- out on a read request.
- This parameter is initialized to zero by
- .Xr open 2 ,
- indicating no timeout.
- .It Dv BIOCGSTATS
- .Pq Li "struct bpf_stat"
- Returns the following structure of packet statistics:
- .Bd -literal
- struct bpf_stat {
- u_int bs_recv; /* number of packets received */
- u_int bs_drop; /* number of packets dropped */
- };
- .Ed
- .Pp
- The fields are:
- .Bl -hang -offset indent
- .It Li bs_recv
- the number of packets received by the descriptor since opened or reset
- (including any buffered since the last read call);
- and
- .It Li bs_drop
- the number of packets which were accepted by the filter but dropped by the
- kernel because of buffer overflows
- (i.e., the application's reads are not keeping up with the packet traffic).
- .El
- .It Dv BIOCIMMEDIATE
- .Pq Li u_int
- Enable or disable
- .Dq immediate mode ,
- based on the truth value of the argument.
- When immediate mode is enabled, reads return immediately upon packet
- reception.
- Otherwise, a read will block until either the kernel buffer
- becomes full or a timeout occurs.
- This is useful for programs like
- .Xr rarpd 8
- which must respond to messages in real time.
- The default for a new file is off.
- .It Dv BIOCSETF
- .It Dv BIOCSETFNR
- .Pq Li "struct bpf_program"
- Sets the read filter program used by the kernel to discard uninteresting
- packets.
- An array of instructions and its length is passed in using
- the following structure:
- .Bd -literal
- struct bpf_program {
- int bf_len;
- struct bpf_insn *bf_insns;
- };
- .Ed
- .Pp
- The filter program is pointed to by the
- .Li bf_insns
- field while its length in units of
- .Sq Li struct bpf_insn
- is given by the
- .Li bf_len
- field.
- See section
- .Sx "FILTER MACHINE"
- for an explanation of the filter language.
- The only difference between
- .Dv BIOCSETF
- and
- .Dv BIOCSETFNR
- is
- .Dv BIOCSETF
- performs the actions of
- .Dv BIOCFLUSH
- while
- .Dv BIOCSETFNR
- does not.
- .It Dv BIOCSETWF
- .Pq Li "struct bpf_program"
- Sets the write filter program used by the kernel to control what type of
- packets can be written to the interface.
- See the
- .Dv BIOCSETF
- command for more
- information on the
- .Nm
- filter program.
- .It Dv BIOCVERSION
- .Pq Li "struct bpf_version"
- Returns the major and minor version numbers of the filter language currently
- recognized by the kernel.
- Before installing a filter, applications must check
- that the current version is compatible with the running kernel.
- Version numbers are compatible if the major numbers match and the application minor
- is less than or equal to the kernel minor.
- The kernel version number is returned in the following structure:
- .Bd -literal
- struct bpf_version {
- u_short bv_major;
- u_short bv_minor;
- };
- .Ed
- .Pp
- The current version numbers are given by
- .Dv BPF_MAJOR_VERSION
- and
- .Dv BPF_MINOR_VERSION
- from
- .In net/bpf.h .
- An incompatible filter
- may result in undefined behavior (most likely, an error returned by
- .Fn ioctl
- or haphazard packet matching).
- .It Dv BIOCSHDRCMPLT
- .It Dv BIOCGHDRCMPLT
- .Pq Li u_int
- Set or get the status of the
- .Dq header complete
- flag.
- Set to zero if the link level source address should be filled in automatically
- by the interface output routine.
- Set to one if the link level source
- address will be written, as provided, to the wire.
- This flag is initialized to zero by default.
- .It Dv BIOCSSEESENT
- .It Dv BIOCGSEESENT
- .Pq Li u_int
- These commands are obsolete but left for compatibility.
- Use
- .Dv BIOCSDIRECTION
- and
- .Dv BIOCGDIRECTION
- instead.
- Set or get the flag determining whether locally generated packets on the
- interface should be returned by BPF.
- Set to zero to see only incoming packets on the interface.
- Set to one to see packets originating locally and remotely on the interface.
- This flag is initialized to one by default.
- .It Dv BIOCSDIRECTION
- .It Dv BIOCGDIRECTION
- .Pq Li u_int
- Set or get the setting determining whether incoming, outgoing, or all packets
- on the interface should be returned by BPF.
- Set to
- .Dv BPF_D_IN
- to see only incoming packets on the interface.
- Set to
- .Dv BPF_D_INOUT
- to see packets originating locally and remotely on the interface.
- Set to
- .Dv BPF_D_OUT
- to see only outgoing packets on the interface.
- This setting is initialized to
- .Dv BPF_D_INOUT
- by default.
- .It Dv BIOCSTSTAMP
- .It Dv BIOCGTSTAMP
- .Pq Li u_int
- Set or get format and resolution of the time stamps returned by BPF.
- Set to
- .Dv BPF_T_MICROTIME ,
- .Dv BPF_T_MICROTIME_FAST ,
- .Dv BPF_T_MICROTIME_MONOTONIC ,
- or
- .Dv BPF_T_MICROTIME_MONOTONIC_FAST
- to get time stamps in 64-bit
- .Vt struct timeval
- format.
- Set to
- .Dv BPF_T_NANOTIME ,
- .Dv BPF_T_NANOTIME_FAST ,
- .Dv BPF_T_NANOTIME_MONOTONIC ,
- or
- .Dv BPF_T_NANOTIME_MONOTONIC_FAST
- to get time stamps in 64-bit
- .Vt struct timespec
- format.
- Set to
- .Dv BPF_T_BINTIME ,
- .Dv BPF_T_BINTIME_FAST ,
- .Dv BPF_T_NANOTIME_MONOTONIC ,
- or
- .Dv BPF_T_BINTIME_MONOTONIC_FAST
- to get time stamps in 64-bit
- .Vt struct bintime
- format.
- Set to
- .Dv BPF_T_NONE
- to ignore time stamp.
- All 64-bit time stamp formats are wrapped in
- .Vt struct bpf_ts .
- The
- .Dv BPF_T_MICROTIME_FAST ,
- .Dv BPF_T_NANOTIME_FAST ,
- .Dv BPF_T_BINTIME_FAST ,
- .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
- .Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
- and
- .Dv BPF_T_BINTIME_MONOTONIC_FAST
- are analogs of corresponding formats without _FAST suffix but do not perform
- a full time counter query, so their accuracy is one timer tick.
- The
- .Dv BPF_T_MICROTIME_MONOTONIC ,
- .Dv BPF_T_NANOTIME_MONOTONIC ,
- .Dv BPF_T_BINTIME_MONOTONIC ,
- .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
- .Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
- and
- .Dv BPF_T_BINTIME_MONOTONIC_FAST
- store the time elapsed since kernel boot.
- This setting is initialized to
- .Dv BPF_T_MICROTIME
- by default.
- .It Dv BIOCFEEDBACK
- .Pq Li u_int
- Set packet feedback mode.
- This allows injected packets to be fed back as input to the interface when
- output via the interface is successful.
- When
- .Dv BPF_D_INOUT
- direction is set, injected outgoing packet is not returned by BPF to avoid
- duplication. This flag is initialized to zero by default.
- .It Dv BIOCLOCK
- Set the locked flag on the
- .Nm
- descriptor.
- This prevents the execution of
- ioctl commands which could change the underlying operating parameters of
- the device.
- .It Dv BIOCGETBUFMODE
- .It Dv BIOCSETBUFMODE
- .Pq Li u_int
- Get or set the current
- .Nm
- buffering mode; possible values are
- .Dv BPF_BUFMODE_BUFFER ,
- buffered read mode, and
- .Dv BPF_BUFMODE_ZBUF ,
- zero-copy buffer mode.
- .It Dv BIOCSETZBUF
- .Pq Li struct bpf_zbuf
- Set the current zero-copy buffer locations; buffer locations may be
- set only once zero-copy buffer mode has been selected, and prior to attaching
- to an interface.
- Buffers must be of identical size, page-aligned, and an integer multiple of
- pages in size.
- The three fields
- .Vt bz_bufa ,
- .Vt bz_bufb ,
- and
- .Vt bz_buflen
- must be filled out.
- If buffers have already been set for this device, the ioctl will fail.
- .It Dv BIOCGETZMAX
- .Pq Li size_t
- Get the largest individual zero-copy buffer size allowed.
- As two buffers are used in zero-copy buffer mode, the limit (in practice) is
- twice the returned size.
- As zero-copy buffers consume kernel address space, conservative selection of
- buffer size is suggested, especially when there are multiple
- .Nm
- descriptors in use on 32-bit systems.
- .It Dv BIOCROTZBUF
- Force ownership of the next buffer to be assigned to userspace, if any data
- present in the buffer.
- If no data is present, the buffer will remain owned by the kernel.
- This allows consumers of zero-copy buffering to implement timeouts and
- retrieve partially filled buffers.
- In order to handle the case where no data is present in the buffer and
- therefore ownership is not assigned, the user process must check
- .Vt bzh_kernel_gen
- against
- .Vt bzh_user_gen .
- .El
- .Sh BPF HEADER
- One of the following structures is prepended to each packet returned by
- .Xr read 2
- or via a zero-copy buffer:
- .Bd -literal
- struct bpf_xhdr {
- struct bpf_ts bh_tstamp; /* time stamp */
- uint32_t bh_caplen; /* length of captured portion */
- uint32_t bh_datalen; /* original length of packet */
- u_short bh_hdrlen; /* length of bpf header (this struct
- plus alignment padding) */
- };
- struct bpf_hdr {
- struct timeval bh_tstamp; /* time stamp */
- uint32_t bh_caplen; /* length of captured portion */
- uint32_t bh_datalen; /* original length of packet */
- u_short bh_hdrlen; /* length of bpf header (this struct
- plus alignment padding) */
- };
- .Ed
- .Pp
- The fields, whose values are stored in host order, and are:
- .Pp
- .Bl -tag -compact -width bh_datalen
- .It Li bh_tstamp
- The time at which the packet was processed by the packet filter.
- .It Li bh_caplen
- The length of the captured portion of the packet.
- This is the minimum of
- the truncation amount specified by the filter and the length of the packet.
- .It Li bh_datalen
- The length of the packet off the wire.
- This value is independent of the truncation amount specified by the filter.
- .It Li bh_hdrlen
- The length of the
- .Nm
- header, which may not be equal to
- .\" XXX - not really a function call
- .Fn sizeof "struct bpf_xhdr"
- or
- .Fn sizeof "struct bpf_hdr" .
- .El
- .Pp
- The
- .Li bh_hdrlen
- field exists to account for
- padding between the header and the link level protocol.
- The purpose here is to guarantee proper alignment of the packet
- data structures, which is required on alignment sensitive
- architectures and improves performance on many other architectures.
- The packet filter ensures that the
- .Vt bpf_xhdr ,
- .Vt bpf_hdr
- and the network layer
- header will be word aligned.
- Currently,
- .Vt bpf_hdr
- is used when the time stamp is set to
- .Dv BPF_T_MICROTIME ,
- .Dv BPF_T_MICROTIME_FAST ,
- .Dv BPF_T_MICROTIME_MONOTONIC ,
- .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
- or
- .Dv BPF_T_NONE
- for backward compatibility reasons. Otherwise,
- .Vt bpf_xhdr
- is used. However,
- .Vt bpf_hdr
- may be deprecated in the near future.
- Suitable precautions
- must be taken when accessing the link layer protocol fields on alignment
- restricted machines.
- (This is not a problem on an Ethernet, since
- the type field is a short falling on an even offset,
- and the addresses are probably accessed in a bytewise fashion).
- .Pp
- Additionally, individual packets are padded so that each starts
- on a word boundary.
- This requires that an application
- has some knowledge of how to get from packet to packet.
- The macro
- .Dv BPF_WORDALIGN
- is defined in
- .In net/bpf.h
- to facilitate
- this process.
- It rounds up its argument to the nearest word aligned value (where a word is
- .Dv BPF_ALIGNMENT
- bytes wide).
- .Pp
- For example, if
- .Sq Li p
- points to the start of a packet, this expression
- will advance it to the next packet:
- .Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
- .Pp
- For the alignment mechanisms to work properly, the
- buffer passed to
- .Xr read 2
- must itself be word aligned.
- The
- .Xr malloc 3
- function
- will always return an aligned buffer.
- .Sh FILTER MACHINE
- A filter program is an array of instructions, with all branches forwardly
- directed, terminated by a
- .Em return
- instruction.
- Each instruction performs some action on the pseudo-machine state,
- which consists of an accumulator, index register, scratch memory store,
- and implicit program counter.
- .Pp
- The following structure defines the instruction format:
- .Bd -literal
- struct bpf_insn {
- u_short code;
- u_char jt;
- u_char jf;
- u_long k;
- };
- .Ed
- .Pp
- The
- .Li k
- field is used in different ways by different instructions,
- and the
- .Li jt
- and
- .Li jf
- fields are used as offsets
- by the branch instructions.
- The opcodes are encoded in a semi-hierarchical fashion.
- There are eight classes of instructions:
- .Dv BPF_LD ,
- .Dv BPF_LDX ,
- .Dv BPF_ST ,
- .Dv BPF_STX ,
- .Dv BPF_ALU ,
- .Dv BPF_JMP ,
- .Dv BPF_RET ,
- and
- .Dv BPF_MISC .
- Various other mode and
- operator bits are or'd into the class to give the actual instructions.
- The classes and modes are defined in
- .In net/bpf.h .
- .Pp
- Below are the semantics for each defined
- .Nm
- instruction.
- We use the convention that A is the accumulator, X is the index register,
- P[] packet data, and M[] scratch memory store.
- P[i:n] gives the data at byte offset
- .Dq i
- in the packet,
- interpreted as a word (n=4),
- unsigned halfword (n=2), or unsigned byte (n=1).
- M[i] gives the i'th word in the scratch memory store, which is only
- addressed in word units.
- The memory store is indexed from 0 to
- .Dv BPF_MEMWORDS
- - 1.
- .Li k ,
- .Li jt ,
- and
- .Li jf
- are the corresponding fields in the
- instruction definition.
- .Dq len
- refers to the length of the packet.
- .Bl -tag -width BPF_STXx
- .It Dv BPF_LD
- These instructions copy a value into the accumulator.
- The type of the source operand is specified by an
- .Dq addressing mode
- and can be a constant
- .Pq Dv BPF_IMM ,
- packet data at a fixed offset
- .Pq Dv BPF_ABS ,
- packet data at a variable offset
- .Pq Dv BPF_IND ,
- the packet length
- .Pq Dv BPF_LEN ,
- or a word in the scratch memory store
- .Pq Dv BPF_MEM .
- For
- .Dv BPF_IND
- and
- .Dv BPF_ABS ,
- the data size must be specified as a word
- .Pq Dv BPF_W ,
- halfword
- .Pq Dv BPF_H ,
- or byte
- .Pq Dv BPF_B .
- The semantics of all the recognized
- .Dv BPF_LD
- instructions follow.
- .Bd -literal
- BPF_LD+BPF_W+BPF_ABS A <- P[k:4]
- BPF_LD+BPF_H+BPF_ABS A <- P[k:2]
- BPF_LD+BPF_B+BPF_ABS A <- P[k:1]
- BPF_LD+BPF_W+BPF_IND A <- P[X+k:4]
- BPF_LD+BPF_H+BPF_IND A <- P[X+k:2]
- BPF_LD+BPF_B+BPF_IND A <- P[X+k:1]
- BPF_LD+BPF_W+BPF_LEN A <- len
- BPF_LD+BPF_IMM A <- k
- BPF_LD+BPF_MEM A <- M[k]
- .Ed
- .It Dv BPF_LDX
- These instructions load a value into the index register.
- Note that
- the addressing modes are more restrictive than those of the accumulator loads,
- but they include
- .Dv BPF_MSH ,
- a hack for efficiently loading the IP header length.
- .Bd -literal
- BPF_LDX+BPF_W+BPF_IMM X <- k
- BPF_LDX+BPF_W+BPF_MEM X <- M[k]
- BPF_LDX+BPF_W+BPF_LEN X <- len
- BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf)
- .Ed
- .It Dv BPF_ST
- This instruction stores the accumulator into the scratch memory.
- We do not need an addressing mode since there is only one possibility
- for the destination.
- .Bd -literal
- BPF_ST M[k] <- A
- .Ed
- .It Dv BPF_STX
- This instruction stores the index register in the scratch memory store.
- .Bd -literal
- BPF_STX M[k] <- X
- .Ed
- .It Dv BPF_ALU
- The alu instructions perform operations between the accumulator and
- index register or constant, and store the result back in the accumulator.
- For binary operations, a source mode is required
- .Dv ( BPF_K
- or
- .Dv BPF_X ) .
- .Bd -literal
- BPF_ALU+BPF_ADD+BPF_K A <- A + k
- BPF_ALU+BPF_SUB+BPF_K A <- A - k
- BPF_ALU+BPF_MUL+BPF_K A <- A * k
- BPF_ALU+BPF_DIV+BPF_K A <- A / k
- BPF_ALU+BPF_AND+BPF_K A <- A & k
- BPF_ALU+BPF_OR+BPF_K A <- A | k
- BPF_ALU+BPF_LSH+BPF_K A <- A << k
- BPF_ALU+BPF_RSH+BPF_K A <- A >> k
- BPF_ALU+BPF_ADD+BPF_X A <- A + X
- BPF_ALU+BPF_SUB+BPF_X A <- A - X
- BPF_ALU+BPF_MUL+BPF_X A <- A * X
- BPF_ALU+BPF_DIV+BPF_X A <- A / X
- BPF_ALU+BPF_AND+BPF_X A <- A & X
- BPF_ALU+BPF_OR+BPF_X A <- A | X
- BPF_ALU+BPF_LSH+BPF_X A <- A << X
- BPF_ALU+BPF_RSH+BPF_X A <- A >> X
- BPF_ALU+BPF_NEG A <- -A
- .Ed
- .It Dv BPF_JMP
- The jump instructions alter flow of control.
- Conditional jumps
- compare the accumulator against a constant
- .Pq Dv BPF_K
- or the index register
- .Pq Dv BPF_X .
- If the result is true (or non-zero),
- the true branch is taken, otherwise the false branch is taken.
- Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
- However, the jump always
- .Pq Dv BPF_JA
- opcode uses the 32 bit
- .Li k
- field as the offset, allowing arbitrarily distant destinations.
- All conditionals use unsigned comparison conventions.
- .Bd -literal
- BPF_JMP+BPF_JA pc += k
- BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf
- BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf
- BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf
- BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf
- BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf
- BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf
- BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf
- BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf
- .Ed
- .It Dv BPF_RET
- The return instructions terminate the filter program and specify the amount
- of packet to accept (i.e., they return the truncation amount).
- A return value of zero indicates that the packet should be ignored.
- The return value is either a constant
- .Pq Dv BPF_K
- or the accumulator
- .Pq Dv BPF_A .
- .Bd -literal
- BPF_RET+BPF_A accept A bytes
- BPF_RET+BPF_K accept k bytes
- .Ed
- .It Dv BPF_MISC
- The miscellaneous category was created for anything that does not
- fit into the above classes, and for any new instructions that might need to
- be added.
- Currently, these are the register transfer instructions
- that copy the index register to the accumulator or vice versa.
- .Bd -literal
- BPF_MISC+BPF_TAX X <- A
- BPF_MISC+BPF_TXA A <- X
- .Ed
- .El
- .Pp
- The
- .Nm
- interface provides the following macros to facilitate
- array initializers:
- .Fn BPF_STMT opcode operand
- and
- .Fn BPF_JUMP opcode operand true_offset false_offset .
- .Sh SYSCTL VARIABLES
- A set of
- .Xr sysctl 8
- variables controls the behaviour of the
- .Nm
- subsystem
- .Bl -tag -width indent
- .It Va net.bpf.optimize_writers: No 0
- Various programs use BPF to send (but not receive) raw packets
- (cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
- They do not need incoming packets to be send to them. Turning this option on
- makes new BPF users to be attached to write-only interface list until program
- explicitly specifies read filter via
- .Cm pcap_set_filter() .
- This removes any performance degradation for high-speed interfaces.
- .It Va net.bpf.stats:
- Binary interface for retrieving general statistics.
- .It Va net.bpf.zerocopy_enable: No 0
- Permits zero-copy to be used with net BPF readers. Use with caution.
- .It Va net.bpf.maxinsns: No 512
- Maximum number of instructions that BPF program can contain. Use
- .Xr tcpdump 1
- -d option to determine approximate number of instruction for any filter.
- .It Va net.bpf.maxbufsize: No 524288
- Maximum buffer size to allocate for packets buffer.
- .It Va net.bpf.bufsize: No 4096
- Default buffer size to allocate for packets buffer.
- .El
- .Sh EXAMPLES
- The following filter is taken from the Reverse ARP Daemon.
- It accepts only Reverse ARP requests.
- .Bd -literal
- struct bpf_insn insns[] = {
- BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
- BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
- BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
- sizeof(struct ether_header)),
- BPF_STMT(BPF_RET+BPF_K, 0),
- };
- .Ed
- .Pp
- This filter accepts only IP packets between host 128.3.112.15 and
- 128.3.112.35.
- .Bd -literal
- struct bpf_insn insns[] = {
- BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
- BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
- BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
- BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
- BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
- BPF_STMT(BPF_RET+BPF_K, 0),
- };
- .Ed
- .Pp
- Finally, this filter returns only TCP finger packets.
- We must parse the IP header to reach the TCP header.
- The
- .Dv BPF_JSET
- instruction
- checks that the IP fragment offset is 0 so we are sure
- that we have a TCP header.
- .Bd -literal
- struct bpf_insn insns[] = {
- BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
- BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
- BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
- BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
- BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
- BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
- BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
- BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
- BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
- BPF_STMT(BPF_RET+BPF_K, 0),
- };
- .Ed
- .Sh SEE ALSO
- .Xr tcpdump 1 ,
- .Xr ioctl 2 ,
- .Xr kqueue 2 ,
- .Xr poll 2 ,
- .Xr select 2 ,
- .Xr byteorder 3 ,
- .Xr ng_bpf 4 ,
- .Xr bpf 9
- .Rs
- .%A McCanne, S.
- .%A Jacobson V.
- .%T "An efficient, extensible, and portable network monitor"
- .Re
- .Sh HISTORY
- The Enet packet filter was created in 1980 by Mike Accetta and
- Rick Rashid at Carnegie-Mellon University.
- Jeffrey Mogul, at
- Stanford, ported the code to
- .Bx
- and continued its development from
- 1983 on.
- Since then, it has evolved into the Ultrix Packet Filter at
- .Tn DEC ,
- a
- .Tn STREAMS
- .Tn NIT
- module under
- .Tn SunOS 4.1 ,
- and
- .Tn BPF .
- .Sh AUTHORS
- .An -nosplit
- .An Steven McCanne ,
- of Lawrence Berkeley Laboratory, implemented BPF in
- Summer 1990.
- Much of the design is due to
- .An Van Jacobson .
- .Pp
- Support for zero-copy buffers was added by
- .An Robert N. M. Watson
- under contract to Seccuris Inc.
- .Sh BUGS
- The read buffer must be of a fixed size (returned by the
- .Dv BIOCGBLEN
- ioctl).
- .Pp
- A file that does not request promiscuous mode may receive promiscuously
- received packets as a side effect of another file requesting this
- mode on the same hardware interface.
- This could be fixed in the kernel with additional processing overhead.
- However, we favor the model where
- all files must assume that the interface is promiscuous, and if
- so desired, must utilize a filter to reject foreign packets.
- .Pp
- Data link protocols with variable length headers are not currently supported.
- .Pp
- The
- .Dv SEESENT ,
- .Dv DIRECTION ,
- and
- .Dv FEEDBACK
- settings have been observed to work incorrectly on some interface
- types, including those with hardware loopback rather than software loopback,
- and point-to-point interfaces.
- They appear to function correctly on a
- broad range of Ethernet-style interfaces.