/share/man/man4/bpf.4

https://bitbucket.org/freebsd/freebsd-head/ · Forth · 1106 lines · 1100 code · 5 blank · 1 comment · 44 complexity · 047ed04913c4d91c074bb01f6d5a6b28 MD5 · raw file

  1. .\" Copyright (c) 2007 Seccuris Inc.
  2. .\" All rights reserved.
  3. .\"
  4. .\" This software was developed by Robert N. M. Watson under contract to
  5. .\" Seccuris Inc.
  6. .\"
  7. .\" Redistribution and use in source and binary forms, with or without
  8. .\" modification, are permitted provided that the following conditions
  9. .\" are met:
  10. .\" 1. Redistributions of source code must retain the above copyright
  11. .\" notice, this list of conditions and the following disclaimer.
  12. .\" 2. Redistributions in binary form must reproduce the above copyright
  13. .\" notice, this list of conditions and the following disclaimer in the
  14. .\" documentation and/or other materials provided with the distribution.
  15. .\"
  16. .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  17. .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  18. .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  19. .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  20. .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  21. .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  22. .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  23. .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  24. .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  25. .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  26. .\" SUCH DAMAGE.
  27. .\"
  28. .\" Copyright (c) 1990 The Regents of the University of California.
  29. .\" All rights reserved.
  30. .\"
  31. .\" Redistribution and use in source and binary forms, with or without
  32. .\" modification, are permitted provided that: (1) source code distributions
  33. .\" retain the above copyright notice and this paragraph in its entirety, (2)
  34. .\" distributions including binary code include the above copyright notice and
  35. .\" this paragraph in its entirety in the documentation or other materials
  36. .\" provided with the distribution, and (3) all advertising materials mentioning
  37. .\" features or use of this software display the following acknowledgement:
  38. .\" ``This product includes software developed by the University of California,
  39. .\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
  40. .\" the University nor the names of its contributors may be used to endorse
  41. .\" or promote products derived from this software without specific prior
  42. .\" written permission.
  43. .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
  44. .\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
  45. .\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
  46. .\"
  47. .\" This document is derived in part from the enet man page (enet.4)
  48. .\" distributed with 4.3BSD Unix.
  49. .\"
  50. .\" $FreeBSD$
  51. .\"
  52. .Dd June 15, 2010
  53. .Dt BPF 4
  54. .Os
  55. .Sh NAME
  56. .Nm bpf
  57. .Nd Berkeley Packet Filter
  58. .Sh SYNOPSIS
  59. .Cd device bpf
  60. .Sh DESCRIPTION
  61. The Berkeley Packet Filter
  62. provides a raw interface to data link layers in a protocol
  63. independent fashion.
  64. All packets on the network, even those destined for other hosts,
  65. are accessible through this mechanism.
  66. .Pp
  67. The packet filter appears as a character special device,
  68. .Pa /dev/bpf .
  69. After opening the device, the file descriptor must be bound to a
  70. specific network interface with the
  71. .Dv BIOCSETIF
  72. ioctl.
  73. A given interface can be shared by multiple listeners, and the filter
  74. underlying each descriptor will see an identical packet stream.
  75. .Pp
  76. A separate device file is required for each minor device.
  77. If a file is in use, the open will fail and
  78. .Va errno
  79. will be set to
  80. .Er EBUSY .
  81. .Pp
  82. Associated with each open instance of a
  83. .Nm
  84. file is a user-settable packet filter.
  85. Whenever a packet is received by an interface,
  86. all file descriptors listening on that interface apply their filter.
  87. Each descriptor that accepts the packet receives its own copy.
  88. .Pp
  89. The packet filter will support any link level protocol that has fixed length
  90. headers.
  91. Currently, only Ethernet,
  92. .Tn SLIP ,
  93. and
  94. .Tn PPP
  95. drivers have been modified to interact with
  96. .Nm .
  97. .Pp
  98. Since packet data is in network byte order, applications should use the
  99. .Xr byteorder 3
  100. macros to extract multi-byte values.
  101. .Pp
  102. A packet can be sent out on the network by writing to a
  103. .Nm
  104. file descriptor.
  105. The writes are unbuffered, meaning only one packet can be processed per write.
  106. Currently, only writes to Ethernets and
  107. .Tn SLIP
  108. links are supported.
  109. .Sh BUFFER MODES
  110. .Nm
  111. devices deliver packet data to the application via memory buffers provided by
  112. the application.
  113. The buffer mode is set using the
  114. .Dv BIOCSETBUFMODE
  115. ioctl, and read using the
  116. .Dv BIOCGETBUFMODE
  117. ioctl.
  118. .Ss Buffered read mode
  119. By default,
  120. .Nm
  121. devices operate in the
  122. .Dv BPF_BUFMODE_BUFFER
  123. mode, in which packet data is copied explicitly from kernel to user memory
  124. using the
  125. .Xr read 2
  126. system call.
  127. The user process will declare a fixed buffer size that will be used both for
  128. sizing internal buffers and for all
  129. .Xr read 2
  130. operations on the file.
  131. This size is queried using the
  132. .Dv BIOCGBLEN
  133. ioctl, and is set using the
  134. .Dv BIOCSBLEN
  135. ioctl.
  136. Note that an individual packet larger than the buffer size is necessarily
  137. truncated.
  138. .Ss Zero-copy buffer mode
  139. .Nm
  140. devices may also operate in the
  141. .Dv BPF_BUFMODE_ZEROCOPY
  142. mode, in which packet data is written directly into two user memory buffers
  143. by the kernel, avoiding both system call and copying overhead.
  144. Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
  145. the page size.
  146. The maximum zero-copy buffer size is returned by the
  147. .Dv BIOCGETZMAX
  148. ioctl.
  149. Note that an individual packet larger than the buffer size is necessarily
  150. truncated.
  151. .Pp
  152. The user process registers two memory buffers using the
  153. .Dv BIOCSETZBUF
  154. ioctl, which accepts a
  155. .Vt struct bpf_zbuf
  156. pointer as an argument:
  157. .Bd -literal
  158. struct bpf_zbuf {
  159. void *bz_bufa;
  160. void *bz_bufb;
  161. size_t bz_buflen;
  162. };
  163. .Ed
  164. .Pp
  165. .Vt bz_bufa
  166. is a pointer to the userspace address of the first buffer that will be
  167. filled, and
  168. .Vt bz_bufb
  169. is a pointer to the second buffer.
  170. .Nm
  171. will then cycle between the two buffers as they fill and are acknowledged.
  172. .Pp
  173. Each buffer begins with a fixed-length header to hold synchronization and
  174. data length information for the buffer:
  175. .Bd -literal
  176. struct bpf_zbuf_header {
  177. volatile u_int bzh_kernel_gen; /* Kernel generation number. */
  178. volatile u_int bzh_kernel_len; /* Length of data in the buffer. */
  179. volatile u_int bzh_user_gen; /* User generation number. */
  180. /* ...padding for future use... */
  181. };
  182. .Ed
  183. .Pp
  184. The header structure of each buffer, including all padding, should be zeroed
  185. before it is configured using
  186. .Dv BIOCSETZBUF .
  187. Remaining space in the buffer will be used by the kernel to store packet
  188. data, laid out in the same format as with buffered read mode.
  189. .Pp
  190. The kernel and the user process follow a simple acknowledgement protocol via
  191. the buffer header to synchronize access to the buffer: when the header
  192. generation numbers,
  193. .Vt bzh_kernel_gen
  194. and
  195. .Vt bzh_user_gen ,
  196. hold the same value, the kernel owns the buffer, and when they differ,
  197. userspace owns the buffer.
  198. .Pp
  199. While the kernel owns the buffer, the contents are unstable and may change
  200. asynchronously; while the user process owns the buffer, its contents are
  201. stable and will not be changed until the buffer has been acknowledged.
  202. .Pp
  203. Initializing the buffer headers to all 0's before registering the buffer has
  204. the effect of assigning initial ownership of both buffers to the kernel.
  205. The kernel signals that a buffer has been assigned to userspace by modifying
  206. .Vt bzh_kernel_gen ,
  207. and userspace acknowledges the buffer and returns it to the kernel by setting
  208. the value of
  209. .Vt bzh_user_gen
  210. to the value of
  211. .Vt bzh_kernel_gen .
  212. .Pp
  213. In order to avoid caching and memory re-ordering effects, the user process
  214. must use atomic operations and memory barriers when checking for and
  215. acknowledging buffers:
  216. .Bd -literal
  217. #include <machine/atomic.h>
  218. /*
  219. * Return ownership of a buffer to the kernel for reuse.
  220. */
  221. static void
  222. buffer_acknowledge(struct bpf_zbuf_header *bzh)
  223. {
  224. atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
  225. }
  226. /*
  227. * Check whether a buffer has been assigned to userspace by the kernel.
  228. * Return true if userspace owns the buffer, and false otherwise.
  229. */
  230. static int
  231. buffer_check(struct bpf_zbuf_header *bzh)
  232. {
  233. return (bzh->bzh_user_gen !=
  234. atomic_load_acq_int(&bzh->bzh_kernel_gen));
  235. }
  236. .Ed
  237. .Pp
  238. The user process may force the assignment of the next buffer, if any data
  239. is pending, to userspace using the
  240. .Dv BIOCROTZBUF
  241. ioctl.
  242. This allows the user process to retrieve data in a partially filled buffer
  243. before the buffer is full, such as following a timeout; the process must
  244. recheck for buffer ownership using the header generation numbers, as the
  245. buffer will not be assigned to userspace if no data was present.
  246. .Pp
  247. As in the buffered read mode,
  248. .Xr kqueue 2 ,
  249. .Xr poll 2 ,
  250. and
  251. .Xr select 2
  252. may be used to sleep awaiting the availability of a completed buffer.
  253. They will return a readable file descriptor when ownership of the next buffer
  254. is assigned to user space.
  255. .Pp
  256. In the current implementation, the kernel may assign zero, one, or both
  257. buffers to the user process; however, an earlier implementation maintained
  258. the invariant that at most one buffer could be assigned to the user process
  259. at a time.
  260. In order to both ensure progress and high performance, user processes should
  261. acknowledge a completely processed buffer as quickly as possible, returning
  262. it for reuse, and not block waiting on a second buffer while holding another
  263. buffer.
  264. .Sh IOCTLS
  265. The
  266. .Xr ioctl 2
  267. command codes below are defined in
  268. .In net/bpf.h .
  269. All commands require
  270. these includes:
  271. .Bd -literal
  272. #include <sys/types.h>
  273. #include <sys/time.h>
  274. #include <sys/ioctl.h>
  275. #include <net/bpf.h>
  276. .Ed
  277. .Pp
  278. Additionally,
  279. .Dv BIOCGETIF
  280. and
  281. .Dv BIOCSETIF
  282. require
  283. .In sys/socket.h
  284. and
  285. .In net/if.h .
  286. .Pp
  287. In addition to
  288. .Dv FIONREAD
  289. and
  290. .Dv SIOCGIFADDR ,
  291. the following commands may be applied to any open
  292. .Nm
  293. file.
  294. The (third) argument to
  295. .Xr ioctl 2
  296. should be a pointer to the type indicated.
  297. .Bl -tag -width BIOCGETBUFMODE
  298. .It Dv BIOCGBLEN
  299. .Pq Li u_int
  300. Returns the required buffer length for reads on
  301. .Nm
  302. files.
  303. .It Dv BIOCSBLEN
  304. .Pq Li u_int
  305. Sets the buffer length for reads on
  306. .Nm
  307. files.
  308. The buffer must be set before the file is attached to an interface
  309. with
  310. .Dv BIOCSETIF .
  311. If the requested buffer size cannot be accommodated, the closest
  312. allowable size will be set and returned in the argument.
  313. A read call will result in
  314. .Er EIO
  315. if it is passed a buffer that is not this size.
  316. .It Dv BIOCGDLT
  317. .Pq Li u_int
  318. Returns the type of the data link layer underlying the attached interface.
  319. .Er EINVAL
  320. is returned if no interface has been specified.
  321. The device types, prefixed with
  322. .Dq Li DLT_ ,
  323. are defined in
  324. .In net/bpf.h .
  325. .It Dv BIOCPROMISC
  326. Forces the interface into promiscuous mode.
  327. All packets, not just those destined for the local host, are processed.
  328. Since more than one file can be listening on a given interface,
  329. a listener that opened its interface non-promiscuously may receive
  330. packets promiscuously.
  331. This problem can be remedied with an appropriate filter.
  332. .It Dv BIOCFLUSH
  333. Flushes the buffer of incoming packets,
  334. and resets the statistics that are returned by BIOCGSTATS.
  335. .It Dv BIOCGETIF
  336. .Pq Li "struct ifreq"
  337. Returns the name of the hardware interface that the file is listening on.
  338. The name is returned in the ifr_name field of
  339. the
  340. .Li ifreq
  341. structure.
  342. All other fields are undefined.
  343. .It Dv BIOCSETIF
  344. .Pq Li "struct ifreq"
  345. Sets the hardware interface associate with the file.
  346. This
  347. command must be performed before any packets can be read.
  348. The device is indicated by name using the
  349. .Li ifr_name
  350. field of the
  351. .Li ifreq
  352. structure.
  353. Additionally, performs the actions of
  354. .Dv BIOCFLUSH .
  355. .It Dv BIOCSRTIMEOUT
  356. .It Dv BIOCGRTIMEOUT
  357. .Pq Li "struct timeval"
  358. Set or get the read timeout parameter.
  359. The argument
  360. specifies the length of time to wait before timing
  361. out on a read request.
  362. This parameter is initialized to zero by
  363. .Xr open 2 ,
  364. indicating no timeout.
  365. .It Dv BIOCGSTATS
  366. .Pq Li "struct bpf_stat"
  367. Returns the following structure of packet statistics:
  368. .Bd -literal
  369. struct bpf_stat {
  370. u_int bs_recv; /* number of packets received */
  371. u_int bs_drop; /* number of packets dropped */
  372. };
  373. .Ed
  374. .Pp
  375. The fields are:
  376. .Bl -hang -offset indent
  377. .It Li bs_recv
  378. the number of packets received by the descriptor since opened or reset
  379. (including any buffered since the last read call);
  380. and
  381. .It Li bs_drop
  382. the number of packets which were accepted by the filter but dropped by the
  383. kernel because of buffer overflows
  384. (i.e., the application's reads are not keeping up with the packet traffic).
  385. .El
  386. .It Dv BIOCIMMEDIATE
  387. .Pq Li u_int
  388. Enable or disable
  389. .Dq immediate mode ,
  390. based on the truth value of the argument.
  391. When immediate mode is enabled, reads return immediately upon packet
  392. reception.
  393. Otherwise, a read will block until either the kernel buffer
  394. becomes full or a timeout occurs.
  395. This is useful for programs like
  396. .Xr rarpd 8
  397. which must respond to messages in real time.
  398. The default for a new file is off.
  399. .It Dv BIOCSETF
  400. .It Dv BIOCSETFNR
  401. .Pq Li "struct bpf_program"
  402. Sets the read filter program used by the kernel to discard uninteresting
  403. packets.
  404. An array of instructions and its length is passed in using
  405. the following structure:
  406. .Bd -literal
  407. struct bpf_program {
  408. int bf_len;
  409. struct bpf_insn *bf_insns;
  410. };
  411. .Ed
  412. .Pp
  413. The filter program is pointed to by the
  414. .Li bf_insns
  415. field while its length in units of
  416. .Sq Li struct bpf_insn
  417. is given by the
  418. .Li bf_len
  419. field.
  420. See section
  421. .Sx "FILTER MACHINE"
  422. for an explanation of the filter language.
  423. The only difference between
  424. .Dv BIOCSETF
  425. and
  426. .Dv BIOCSETFNR
  427. is
  428. .Dv BIOCSETF
  429. performs the actions of
  430. .Dv BIOCFLUSH
  431. while
  432. .Dv BIOCSETFNR
  433. does not.
  434. .It Dv BIOCSETWF
  435. .Pq Li "struct bpf_program"
  436. Sets the write filter program used by the kernel to control what type of
  437. packets can be written to the interface.
  438. See the
  439. .Dv BIOCSETF
  440. command for more
  441. information on the
  442. .Nm
  443. filter program.
  444. .It Dv BIOCVERSION
  445. .Pq Li "struct bpf_version"
  446. Returns the major and minor version numbers of the filter language currently
  447. recognized by the kernel.
  448. Before installing a filter, applications must check
  449. that the current version is compatible with the running kernel.
  450. Version numbers are compatible if the major numbers match and the application minor
  451. is less than or equal to the kernel minor.
  452. The kernel version number is returned in the following structure:
  453. .Bd -literal
  454. struct bpf_version {
  455. u_short bv_major;
  456. u_short bv_minor;
  457. };
  458. .Ed
  459. .Pp
  460. The current version numbers are given by
  461. .Dv BPF_MAJOR_VERSION
  462. and
  463. .Dv BPF_MINOR_VERSION
  464. from
  465. .In net/bpf.h .
  466. An incompatible filter
  467. may result in undefined behavior (most likely, an error returned by
  468. .Fn ioctl
  469. or haphazard packet matching).
  470. .It Dv BIOCSHDRCMPLT
  471. .It Dv BIOCGHDRCMPLT
  472. .Pq Li u_int
  473. Set or get the status of the
  474. .Dq header complete
  475. flag.
  476. Set to zero if the link level source address should be filled in automatically
  477. by the interface output routine.
  478. Set to one if the link level source
  479. address will be written, as provided, to the wire.
  480. This flag is initialized to zero by default.
  481. .It Dv BIOCSSEESENT
  482. .It Dv BIOCGSEESENT
  483. .Pq Li u_int
  484. These commands are obsolete but left for compatibility.
  485. Use
  486. .Dv BIOCSDIRECTION
  487. and
  488. .Dv BIOCGDIRECTION
  489. instead.
  490. Set or get the flag determining whether locally generated packets on the
  491. interface should be returned by BPF.
  492. Set to zero to see only incoming packets on the interface.
  493. Set to one to see packets originating locally and remotely on the interface.
  494. This flag is initialized to one by default.
  495. .It Dv BIOCSDIRECTION
  496. .It Dv BIOCGDIRECTION
  497. .Pq Li u_int
  498. Set or get the setting determining whether incoming, outgoing, or all packets
  499. on the interface should be returned by BPF.
  500. Set to
  501. .Dv BPF_D_IN
  502. to see only incoming packets on the interface.
  503. Set to
  504. .Dv BPF_D_INOUT
  505. to see packets originating locally and remotely on the interface.
  506. Set to
  507. .Dv BPF_D_OUT
  508. to see only outgoing packets on the interface.
  509. This setting is initialized to
  510. .Dv BPF_D_INOUT
  511. by default.
  512. .It Dv BIOCSTSTAMP
  513. .It Dv BIOCGTSTAMP
  514. .Pq Li u_int
  515. Set or get format and resolution of the time stamps returned by BPF.
  516. Set to
  517. .Dv BPF_T_MICROTIME ,
  518. .Dv BPF_T_MICROTIME_FAST ,
  519. .Dv BPF_T_MICROTIME_MONOTONIC ,
  520. or
  521. .Dv BPF_T_MICROTIME_MONOTONIC_FAST
  522. to get time stamps in 64-bit
  523. .Vt struct timeval
  524. format.
  525. Set to
  526. .Dv BPF_T_NANOTIME ,
  527. .Dv BPF_T_NANOTIME_FAST ,
  528. .Dv BPF_T_NANOTIME_MONOTONIC ,
  529. or
  530. .Dv BPF_T_NANOTIME_MONOTONIC_FAST
  531. to get time stamps in 64-bit
  532. .Vt struct timespec
  533. format.
  534. Set to
  535. .Dv BPF_T_BINTIME ,
  536. .Dv BPF_T_BINTIME_FAST ,
  537. .Dv BPF_T_NANOTIME_MONOTONIC ,
  538. or
  539. .Dv BPF_T_BINTIME_MONOTONIC_FAST
  540. to get time stamps in 64-bit
  541. .Vt struct bintime
  542. format.
  543. Set to
  544. .Dv BPF_T_NONE
  545. to ignore time stamp.
  546. All 64-bit time stamp formats are wrapped in
  547. .Vt struct bpf_ts .
  548. The
  549. .Dv BPF_T_MICROTIME_FAST ,
  550. .Dv BPF_T_NANOTIME_FAST ,
  551. .Dv BPF_T_BINTIME_FAST ,
  552. .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
  553. .Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
  554. and
  555. .Dv BPF_T_BINTIME_MONOTONIC_FAST
  556. are analogs of corresponding formats without _FAST suffix but do not perform
  557. a full time counter query, so their accuracy is one timer tick.
  558. The
  559. .Dv BPF_T_MICROTIME_MONOTONIC ,
  560. .Dv BPF_T_NANOTIME_MONOTONIC ,
  561. .Dv BPF_T_BINTIME_MONOTONIC ,
  562. .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
  563. .Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
  564. and
  565. .Dv BPF_T_BINTIME_MONOTONIC_FAST
  566. store the time elapsed since kernel boot.
  567. This setting is initialized to
  568. .Dv BPF_T_MICROTIME
  569. by default.
  570. .It Dv BIOCFEEDBACK
  571. .Pq Li u_int
  572. Set packet feedback mode.
  573. This allows injected packets to be fed back as input to the interface when
  574. output via the interface is successful.
  575. When
  576. .Dv BPF_D_INOUT
  577. direction is set, injected outgoing packet is not returned by BPF to avoid
  578. duplication. This flag is initialized to zero by default.
  579. .It Dv BIOCLOCK
  580. Set the locked flag on the
  581. .Nm
  582. descriptor.
  583. This prevents the execution of
  584. ioctl commands which could change the underlying operating parameters of
  585. the device.
  586. .It Dv BIOCGETBUFMODE
  587. .It Dv BIOCSETBUFMODE
  588. .Pq Li u_int
  589. Get or set the current
  590. .Nm
  591. buffering mode; possible values are
  592. .Dv BPF_BUFMODE_BUFFER ,
  593. buffered read mode, and
  594. .Dv BPF_BUFMODE_ZBUF ,
  595. zero-copy buffer mode.
  596. .It Dv BIOCSETZBUF
  597. .Pq Li struct bpf_zbuf
  598. Set the current zero-copy buffer locations; buffer locations may be
  599. set only once zero-copy buffer mode has been selected, and prior to attaching
  600. to an interface.
  601. Buffers must be of identical size, page-aligned, and an integer multiple of
  602. pages in size.
  603. The three fields
  604. .Vt bz_bufa ,
  605. .Vt bz_bufb ,
  606. and
  607. .Vt bz_buflen
  608. must be filled out.
  609. If buffers have already been set for this device, the ioctl will fail.
  610. .It Dv BIOCGETZMAX
  611. .Pq Li size_t
  612. Get the largest individual zero-copy buffer size allowed.
  613. As two buffers are used in zero-copy buffer mode, the limit (in practice) is
  614. twice the returned size.
  615. As zero-copy buffers consume kernel address space, conservative selection of
  616. buffer size is suggested, especially when there are multiple
  617. .Nm
  618. descriptors in use on 32-bit systems.
  619. .It Dv BIOCROTZBUF
  620. Force ownership of the next buffer to be assigned to userspace, if any data
  621. present in the buffer.
  622. If no data is present, the buffer will remain owned by the kernel.
  623. This allows consumers of zero-copy buffering to implement timeouts and
  624. retrieve partially filled buffers.
  625. In order to handle the case where no data is present in the buffer and
  626. therefore ownership is not assigned, the user process must check
  627. .Vt bzh_kernel_gen
  628. against
  629. .Vt bzh_user_gen .
  630. .El
  631. .Sh BPF HEADER
  632. One of the following structures is prepended to each packet returned by
  633. .Xr read 2
  634. or via a zero-copy buffer:
  635. .Bd -literal
  636. struct bpf_xhdr {
  637. struct bpf_ts bh_tstamp; /* time stamp */
  638. uint32_t bh_caplen; /* length of captured portion */
  639. uint32_t bh_datalen; /* original length of packet */
  640. u_short bh_hdrlen; /* length of bpf header (this struct
  641. plus alignment padding) */
  642. };
  643. struct bpf_hdr {
  644. struct timeval bh_tstamp; /* time stamp */
  645. uint32_t bh_caplen; /* length of captured portion */
  646. uint32_t bh_datalen; /* original length of packet */
  647. u_short bh_hdrlen; /* length of bpf header (this struct
  648. plus alignment padding) */
  649. };
  650. .Ed
  651. .Pp
  652. The fields, whose values are stored in host order, and are:
  653. .Pp
  654. .Bl -tag -compact -width bh_datalen
  655. .It Li bh_tstamp
  656. The time at which the packet was processed by the packet filter.
  657. .It Li bh_caplen
  658. The length of the captured portion of the packet.
  659. This is the minimum of
  660. the truncation amount specified by the filter and the length of the packet.
  661. .It Li bh_datalen
  662. The length of the packet off the wire.
  663. This value is independent of the truncation amount specified by the filter.
  664. .It Li bh_hdrlen
  665. The length of the
  666. .Nm
  667. header, which may not be equal to
  668. .\" XXX - not really a function call
  669. .Fn sizeof "struct bpf_xhdr"
  670. or
  671. .Fn sizeof "struct bpf_hdr" .
  672. .El
  673. .Pp
  674. The
  675. .Li bh_hdrlen
  676. field exists to account for
  677. padding between the header and the link level protocol.
  678. The purpose here is to guarantee proper alignment of the packet
  679. data structures, which is required on alignment sensitive
  680. architectures and improves performance on many other architectures.
  681. The packet filter ensures that the
  682. .Vt bpf_xhdr ,
  683. .Vt bpf_hdr
  684. and the network layer
  685. header will be word aligned.
  686. Currently,
  687. .Vt bpf_hdr
  688. is used when the time stamp is set to
  689. .Dv BPF_T_MICROTIME ,
  690. .Dv BPF_T_MICROTIME_FAST ,
  691. .Dv BPF_T_MICROTIME_MONOTONIC ,
  692. .Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
  693. or
  694. .Dv BPF_T_NONE
  695. for backward compatibility reasons. Otherwise,
  696. .Vt bpf_xhdr
  697. is used. However,
  698. .Vt bpf_hdr
  699. may be deprecated in the near future.
  700. Suitable precautions
  701. must be taken when accessing the link layer protocol fields on alignment
  702. restricted machines.
  703. (This is not a problem on an Ethernet, since
  704. the type field is a short falling on an even offset,
  705. and the addresses are probably accessed in a bytewise fashion).
  706. .Pp
  707. Additionally, individual packets are padded so that each starts
  708. on a word boundary.
  709. This requires that an application
  710. has some knowledge of how to get from packet to packet.
  711. The macro
  712. .Dv BPF_WORDALIGN
  713. is defined in
  714. .In net/bpf.h
  715. to facilitate
  716. this process.
  717. It rounds up its argument to the nearest word aligned value (where a word is
  718. .Dv BPF_ALIGNMENT
  719. bytes wide).
  720. .Pp
  721. For example, if
  722. .Sq Li p
  723. points to the start of a packet, this expression
  724. will advance it to the next packet:
  725. .Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
  726. .Pp
  727. For the alignment mechanisms to work properly, the
  728. buffer passed to
  729. .Xr read 2
  730. must itself be word aligned.
  731. The
  732. .Xr malloc 3
  733. function
  734. will always return an aligned buffer.
  735. .Sh FILTER MACHINE
  736. A filter program is an array of instructions, with all branches forwardly
  737. directed, terminated by a
  738. .Em return
  739. instruction.
  740. Each instruction performs some action on the pseudo-machine state,
  741. which consists of an accumulator, index register, scratch memory store,
  742. and implicit program counter.
  743. .Pp
  744. The following structure defines the instruction format:
  745. .Bd -literal
  746. struct bpf_insn {
  747. u_short code;
  748. u_char jt;
  749. u_char jf;
  750. u_long k;
  751. };
  752. .Ed
  753. .Pp
  754. The
  755. .Li k
  756. field is used in different ways by different instructions,
  757. and the
  758. .Li jt
  759. and
  760. .Li jf
  761. fields are used as offsets
  762. by the branch instructions.
  763. The opcodes are encoded in a semi-hierarchical fashion.
  764. There are eight classes of instructions:
  765. .Dv BPF_LD ,
  766. .Dv BPF_LDX ,
  767. .Dv BPF_ST ,
  768. .Dv BPF_STX ,
  769. .Dv BPF_ALU ,
  770. .Dv BPF_JMP ,
  771. .Dv BPF_RET ,
  772. and
  773. .Dv BPF_MISC .
  774. Various other mode and
  775. operator bits are or'd into the class to give the actual instructions.
  776. The classes and modes are defined in
  777. .In net/bpf.h .
  778. .Pp
  779. Below are the semantics for each defined
  780. .Nm
  781. instruction.
  782. We use the convention that A is the accumulator, X is the index register,
  783. P[] packet data, and M[] scratch memory store.
  784. P[i:n] gives the data at byte offset
  785. .Dq i
  786. in the packet,
  787. interpreted as a word (n=4),
  788. unsigned halfword (n=2), or unsigned byte (n=1).
  789. M[i] gives the i'th word in the scratch memory store, which is only
  790. addressed in word units.
  791. The memory store is indexed from 0 to
  792. .Dv BPF_MEMWORDS
  793. - 1.
  794. .Li k ,
  795. .Li jt ,
  796. and
  797. .Li jf
  798. are the corresponding fields in the
  799. instruction definition.
  800. .Dq len
  801. refers to the length of the packet.
  802. .Bl -tag -width BPF_STXx
  803. .It Dv BPF_LD
  804. These instructions copy a value into the accumulator.
  805. The type of the source operand is specified by an
  806. .Dq addressing mode
  807. and can be a constant
  808. .Pq Dv BPF_IMM ,
  809. packet data at a fixed offset
  810. .Pq Dv BPF_ABS ,
  811. packet data at a variable offset
  812. .Pq Dv BPF_IND ,
  813. the packet length
  814. .Pq Dv BPF_LEN ,
  815. or a word in the scratch memory store
  816. .Pq Dv BPF_MEM .
  817. For
  818. .Dv BPF_IND
  819. and
  820. .Dv BPF_ABS ,
  821. the data size must be specified as a word
  822. .Pq Dv BPF_W ,
  823. halfword
  824. .Pq Dv BPF_H ,
  825. or byte
  826. .Pq Dv BPF_B .
  827. The semantics of all the recognized
  828. .Dv BPF_LD
  829. instructions follow.
  830. .Bd -literal
  831. BPF_LD+BPF_W+BPF_ABS A <- P[k:4]
  832. BPF_LD+BPF_H+BPF_ABS A <- P[k:2]
  833. BPF_LD+BPF_B+BPF_ABS A <- P[k:1]
  834. BPF_LD+BPF_W+BPF_IND A <- P[X+k:4]
  835. BPF_LD+BPF_H+BPF_IND A <- P[X+k:2]
  836. BPF_LD+BPF_B+BPF_IND A <- P[X+k:1]
  837. BPF_LD+BPF_W+BPF_LEN A <- len
  838. BPF_LD+BPF_IMM A <- k
  839. BPF_LD+BPF_MEM A <- M[k]
  840. .Ed
  841. .It Dv BPF_LDX
  842. These instructions load a value into the index register.
  843. Note that
  844. the addressing modes are more restrictive than those of the accumulator loads,
  845. but they include
  846. .Dv BPF_MSH ,
  847. a hack for efficiently loading the IP header length.
  848. .Bd -literal
  849. BPF_LDX+BPF_W+BPF_IMM X <- k
  850. BPF_LDX+BPF_W+BPF_MEM X <- M[k]
  851. BPF_LDX+BPF_W+BPF_LEN X <- len
  852. BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf)
  853. .Ed
  854. .It Dv BPF_ST
  855. This instruction stores the accumulator into the scratch memory.
  856. We do not need an addressing mode since there is only one possibility
  857. for the destination.
  858. .Bd -literal
  859. BPF_ST M[k] <- A
  860. .Ed
  861. .It Dv BPF_STX
  862. This instruction stores the index register in the scratch memory store.
  863. .Bd -literal
  864. BPF_STX M[k] <- X
  865. .Ed
  866. .It Dv BPF_ALU
  867. The alu instructions perform operations between the accumulator and
  868. index register or constant, and store the result back in the accumulator.
  869. For binary operations, a source mode is required
  870. .Dv ( BPF_K
  871. or
  872. .Dv BPF_X ) .
  873. .Bd -literal
  874. BPF_ALU+BPF_ADD+BPF_K A <- A + k
  875. BPF_ALU+BPF_SUB+BPF_K A <- A - k
  876. BPF_ALU+BPF_MUL+BPF_K A <- A * k
  877. BPF_ALU+BPF_DIV+BPF_K A <- A / k
  878. BPF_ALU+BPF_AND+BPF_K A <- A & k
  879. BPF_ALU+BPF_OR+BPF_K A <- A | k
  880. BPF_ALU+BPF_LSH+BPF_K A <- A << k
  881. BPF_ALU+BPF_RSH+BPF_K A <- A >> k
  882. BPF_ALU+BPF_ADD+BPF_X A <- A + X
  883. BPF_ALU+BPF_SUB+BPF_X A <- A - X
  884. BPF_ALU+BPF_MUL+BPF_X A <- A * X
  885. BPF_ALU+BPF_DIV+BPF_X A <- A / X
  886. BPF_ALU+BPF_AND+BPF_X A <- A & X
  887. BPF_ALU+BPF_OR+BPF_X A <- A | X
  888. BPF_ALU+BPF_LSH+BPF_X A <- A << X
  889. BPF_ALU+BPF_RSH+BPF_X A <- A >> X
  890. BPF_ALU+BPF_NEG A <- -A
  891. .Ed
  892. .It Dv BPF_JMP
  893. The jump instructions alter flow of control.
  894. Conditional jumps
  895. compare the accumulator against a constant
  896. .Pq Dv BPF_K
  897. or the index register
  898. .Pq Dv BPF_X .
  899. If the result is true (or non-zero),
  900. the true branch is taken, otherwise the false branch is taken.
  901. Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
  902. However, the jump always
  903. .Pq Dv BPF_JA
  904. opcode uses the 32 bit
  905. .Li k
  906. field as the offset, allowing arbitrarily distant destinations.
  907. All conditionals use unsigned comparison conventions.
  908. .Bd -literal
  909. BPF_JMP+BPF_JA pc += k
  910. BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf
  911. BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf
  912. BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf
  913. BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf
  914. BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf
  915. BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf
  916. BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf
  917. BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf
  918. .Ed
  919. .It Dv BPF_RET
  920. The return instructions terminate the filter program and specify the amount
  921. of packet to accept (i.e., they return the truncation amount).
  922. A return value of zero indicates that the packet should be ignored.
  923. The return value is either a constant
  924. .Pq Dv BPF_K
  925. or the accumulator
  926. .Pq Dv BPF_A .
  927. .Bd -literal
  928. BPF_RET+BPF_A accept A bytes
  929. BPF_RET+BPF_K accept k bytes
  930. .Ed
  931. .It Dv BPF_MISC
  932. The miscellaneous category was created for anything that does not
  933. fit into the above classes, and for any new instructions that might need to
  934. be added.
  935. Currently, these are the register transfer instructions
  936. that copy the index register to the accumulator or vice versa.
  937. .Bd -literal
  938. BPF_MISC+BPF_TAX X <- A
  939. BPF_MISC+BPF_TXA A <- X
  940. .Ed
  941. .El
  942. .Pp
  943. The
  944. .Nm
  945. interface provides the following macros to facilitate
  946. array initializers:
  947. .Fn BPF_STMT opcode operand
  948. and
  949. .Fn BPF_JUMP opcode operand true_offset false_offset .
  950. .Sh SYSCTL VARIABLES
  951. A set of
  952. .Xr sysctl 8
  953. variables controls the behaviour of the
  954. .Nm
  955. subsystem
  956. .Bl -tag -width indent
  957. .It Va net.bpf.optimize_writers: No 0
  958. Various programs use BPF to send (but not receive) raw packets
  959. (cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
  960. They do not need incoming packets to be send to them. Turning this option on
  961. makes new BPF users to be attached to write-only interface list until program
  962. explicitly specifies read filter via
  963. .Cm pcap_set_filter() .
  964. This removes any performance degradation for high-speed interfaces.
  965. .It Va net.bpf.stats:
  966. Binary interface for retrieving general statistics.
  967. .It Va net.bpf.zerocopy_enable: No 0
  968. Permits zero-copy to be used with net BPF readers. Use with caution.
  969. .It Va net.bpf.maxinsns: No 512
  970. Maximum number of instructions that BPF program can contain. Use
  971. .Xr tcpdump 1
  972. -d option to determine approximate number of instruction for any filter.
  973. .It Va net.bpf.maxbufsize: No 524288
  974. Maximum buffer size to allocate for packets buffer.
  975. .It Va net.bpf.bufsize: No 4096
  976. Default buffer size to allocate for packets buffer.
  977. .El
  978. .Sh EXAMPLES
  979. The following filter is taken from the Reverse ARP Daemon.
  980. It accepts only Reverse ARP requests.
  981. .Bd -literal
  982. struct bpf_insn insns[] = {
  983. BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
  984. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
  985. BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
  986. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
  987. BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
  988. sizeof(struct ether_header)),
  989. BPF_STMT(BPF_RET+BPF_K, 0),
  990. };
  991. .Ed
  992. .Pp
  993. This filter accepts only IP packets between host 128.3.112.15 and
  994. 128.3.112.35.
  995. .Bd -literal
  996. struct bpf_insn insns[] = {
  997. BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
  998. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
  999. BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
  1000. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
  1001. BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
  1002. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
  1003. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
  1004. BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
  1005. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
  1006. BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
  1007. BPF_STMT(BPF_RET+BPF_K, 0),
  1008. };
  1009. .Ed
  1010. .Pp
  1011. Finally, this filter returns only TCP finger packets.
  1012. We must parse the IP header to reach the TCP header.
  1013. The
  1014. .Dv BPF_JSET
  1015. instruction
  1016. checks that the IP fragment offset is 0 so we are sure
  1017. that we have a TCP header.
  1018. .Bd -literal
  1019. struct bpf_insn insns[] = {
  1020. BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
  1021. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
  1022. BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
  1023. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
  1024. BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
  1025. BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
  1026. BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
  1027. BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
  1028. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
  1029. BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
  1030. BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
  1031. BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
  1032. BPF_STMT(BPF_RET+BPF_K, 0),
  1033. };
  1034. .Ed
  1035. .Sh SEE ALSO
  1036. .Xr tcpdump 1 ,
  1037. .Xr ioctl 2 ,
  1038. .Xr kqueue 2 ,
  1039. .Xr poll 2 ,
  1040. .Xr select 2 ,
  1041. .Xr byteorder 3 ,
  1042. .Xr ng_bpf 4 ,
  1043. .Xr bpf 9
  1044. .Rs
  1045. .%A McCanne, S.
  1046. .%A Jacobson V.
  1047. .%T "An efficient, extensible, and portable network monitor"
  1048. .Re
  1049. .Sh HISTORY
  1050. The Enet packet filter was created in 1980 by Mike Accetta and
  1051. Rick Rashid at Carnegie-Mellon University.
  1052. Jeffrey Mogul, at
  1053. Stanford, ported the code to
  1054. .Bx
  1055. and continued its development from
  1056. 1983 on.
  1057. Since then, it has evolved into the Ultrix Packet Filter at
  1058. .Tn DEC ,
  1059. a
  1060. .Tn STREAMS
  1061. .Tn NIT
  1062. module under
  1063. .Tn SunOS 4.1 ,
  1064. and
  1065. .Tn BPF .
  1066. .Sh AUTHORS
  1067. .An -nosplit
  1068. .An Steven McCanne ,
  1069. of Lawrence Berkeley Laboratory, implemented BPF in
  1070. Summer 1990.
  1071. Much of the design is due to
  1072. .An Van Jacobson .
  1073. .Pp
  1074. Support for zero-copy buffers was added by
  1075. .An Robert N. M. Watson
  1076. under contract to Seccuris Inc.
  1077. .Sh BUGS
  1078. The read buffer must be of a fixed size (returned by the
  1079. .Dv BIOCGBLEN
  1080. ioctl).
  1081. .Pp
  1082. A file that does not request promiscuous mode may receive promiscuously
  1083. received packets as a side effect of another file requesting this
  1084. mode on the same hardware interface.
  1085. This could be fixed in the kernel with additional processing overhead.
  1086. However, we favor the model where
  1087. all files must assume that the interface is promiscuous, and if
  1088. so desired, must utilize a filter to reject foreign packets.
  1089. .Pp
  1090. Data link protocols with variable length headers are not currently supported.
  1091. .Pp
  1092. The
  1093. .Dv SEESENT ,
  1094. .Dv DIRECTION ,
  1095. and
  1096. .Dv FEEDBACK
  1097. settings have been observed to work incorrectly on some interface
  1098. types, including those with hardware loopback rather than software loopback,
  1099. and point-to-point interfaces.
  1100. They appear to function correctly on a
  1101. broad range of Ethernet-style interfaces.