/share/man/man7/tuning.7

https://bitbucket.org/freebsd/freebsd-head/ · Unknown · 1068 lines · 1068 code · 0 blank · 0 comment · 0 complexity · f01bf4b94ee88f69ec9317bd678c42d2 MD5 · raw file

  1. .\" Copyright (C) 2001 Matthew Dillon. All rights reserved.
  2. .\"
  3. .\" Redistribution and use in source and binary forms, with or without
  4. .\" modification, are permitted provided that the following conditions
  5. .\" are met:
  6. .\" 1. Redistributions of source code must retain the above copyright
  7. .\" notice, this list of conditions and the following disclaimer.
  8. .\" 2. Redistributions in binary form must reproduce the above copyright
  9. .\" notice, this list of conditions and the following disclaimer in the
  10. .\" documentation and/or other materials provided with the distribution.
  11. .\"
  12. .\" THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  13. .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  14. .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  15. .\" ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  16. .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  17. .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  18. .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  19. .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  20. .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  21. .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  22. .\" SUCH DAMAGE.
  23. .\"
  24. .\" $FreeBSD$
  25. .\"
  26. .Dd May 11, 2012
  27. .Dt TUNING 7
  28. .Os
  29. .Sh NAME
  30. .Nm tuning
  31. .Nd performance tuning under FreeBSD
  32. .Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
  33. When using
  34. .Xr bsdlabel 8
  35. or
  36. .Xr sysinstall 8
  37. to lay out your file systems on a hard disk it is important to remember
  38. that hard drives can transfer data much more quickly from outer tracks
  39. than they can from inner tracks.
  40. To take advantage of this you should
  41. try to pack your smaller file systems and swap closer to the outer tracks,
  42. follow with the larger file systems, and end with the largest file systems.
  43. It is also important to size system standard file systems such that you
  44. will not be forced to resize them later as you scale the machine up.
  45. I usually create, in order, a 128M root, 1G swap, 128M
  46. .Pa /var ,
  47. 128M
  48. .Pa /var/tmp ,
  49. 3G
  50. .Pa /usr ,
  51. and use any remaining space for
  52. .Pa /home .
  53. .Pp
  54. You should typically size your swap space to approximately 2x main memory
  55. for systems with less than 2GB of RAM, or approximately 1x main memory
  56. if you have more.
  57. If you do not have a lot of RAM, though, you will generally want a lot
  58. more swap.
  59. It is not recommended that you configure any less than
  60. 256M of swap on a system and you should keep in mind future memory
  61. expansion when sizing the swap partition.
  62. The kernel's VM paging algorithms are tuned to perform best when there is
  63. at least 2x swap versus main memory.
  64. Configuring too little swap can lead
  65. to inefficiencies in the VM page scanning code as well as create issues
  66. later on if you add more memory to your machine.
  67. Finally, on larger systems
  68. with multiple SCSI disks (or multiple IDE disks operating on different
  69. controllers), we strongly recommend that you configure swap on each drive.
  70. The swap partitions on the drives should be approximately the same size.
  71. The kernel can handle arbitrary sizes but
  72. internal data structures scale to 4 times the largest swap partition.
  73. Keeping
  74. the swap partitions near the same size will allow the kernel to optimally
  75. stripe swap space across the N disks.
  76. Do not worry about overdoing it a
  77. little, swap space is the saving grace of
  78. .Ux
  79. and even if you do not normally use much swap, it can give you more time to
  80. recover from a runaway program before being forced to reboot.
  81. .Pp
  82. How you size your
  83. .Pa /var
  84. partition depends heavily on what you intend to use the machine for.
  85. This
  86. partition is primarily used to hold mailboxes, the print spool, and log
  87. files.
  88. Some people even make
  89. .Pa /var/log
  90. its own partition (but except for extreme cases it is not worth the waste
  91. of a partition ID).
  92. If your machine is intended to act as a mail
  93. or print server,
  94. or you are running a heavily visited web server, you should consider
  95. creating a much larger partition \(en perhaps a gig or more.
  96. It is very easy
  97. to underestimate log file storage requirements.
  98. .Pp
  99. Sizing
  100. .Pa /var/tmp
  101. depends on the kind of temporary file usage you think you will need.
  102. 128M is
  103. the minimum we recommend.
  104. Also note that sysinstall will create a
  105. .Pa /tmp
  106. directory.
  107. Dedicating a partition for temporary file storage is important for
  108. two reasons: first, it reduces the possibility of file system corruption
  109. in a crash, and second it reduces the chance of a runaway process that
  110. fills up
  111. .Oo Pa /var Oc Ns Pa /tmp
  112. from blowing up more critical subsystems (mail,
  113. logging, etc).
  114. Filling up
  115. .Oo Pa /var Oc Ns Pa /tmp
  116. is a very common problem to have.
  117. .Pp
  118. In the old days there were differences between
  119. .Pa /tmp
  120. and
  121. .Pa /var/tmp ,
  122. but the introduction of
  123. .Pa /var
  124. (and
  125. .Pa /var/tmp )
  126. led to massive confusion
  127. by program writers so today programs haphazardly use one or the
  128. other and thus no real distinction can be made between the two.
  129. So it makes sense to have just one temporary directory and
  130. softlink to it from the other
  131. .Pa tmp
  132. directory locations.
  133. However you handle
  134. .Pa /tmp ,
  135. the one thing you do not want to do is leave it sitting
  136. on the root partition where it might cause root to fill up or possibly
  137. corrupt root in a crash/reboot situation.
  138. .Pp
  139. The
  140. .Pa /usr
  141. partition holds the bulk of the files required to support the system and
  142. a subdirectory within it called
  143. .Pa /usr/local
  144. holds the bulk of the files installed from the
  145. .Xr ports 7
  146. hierarchy.
  147. If you do not use ports all that much and do not intend to keep
  148. system source
  149. .Pq Pa /usr/src
  150. on the machine, you can get away with
  151. a 1 GB
  152. .Pa /usr
  153. partition.
  154. However, if you install a lot of ports
  155. (especially window managers and Linux-emulated binaries), we recommend
  156. at least a 2 GB
  157. .Pa /usr
  158. and if you also intend to keep system source
  159. on the machine, we recommend a 3 GB
  160. .Pa /usr .
  161. Do not underestimate the
  162. amount of space you will need in this partition, it can creep up and
  163. surprise you!
  164. .Pp
  165. The
  166. .Pa /home
  167. partition is typically used to hold user-specific data.
  168. I usually size it to the remainder of the disk.
  169. .Pp
  170. Why partition at all?
  171. Why not create one big
  172. .Pa /
  173. partition and be done with it?
  174. Then I do not have to worry about undersizing things!
  175. Well, there are several reasons this is not a good idea.
  176. First,
  177. each partition has different operational characteristics and separating them
  178. allows the file system to tune itself to those characteristics.
  179. For example,
  180. the root and
  181. .Pa /usr
  182. partitions are read-mostly, with very little writing, while
  183. a lot of reading and writing could occur in
  184. .Pa /var
  185. and
  186. .Pa /var/tmp .
  187. By properly
  188. partitioning your system fragmentation introduced in the smaller more
  189. heavily write-loaded partitions will not bleed over into the mostly-read
  190. partitions.
  191. Additionally, keeping the write-loaded partitions closer to
  192. the edge of the disk (i.e., before the really big partitions instead of after
  193. in the partition table) will increase I/O performance in the partitions
  194. where you need it the most.
  195. Now it is true that you might also need I/O
  196. performance in the larger partitions, but they are so large that shifting
  197. them more towards the edge of the disk will not lead to a significant
  198. performance improvement whereas moving
  199. .Pa /var
  200. to the edge can have a huge impact.
  201. Finally, there are safety concerns.
  202. Having a small neat root partition that
  203. is essentially read-only gives it a greater chance of surviving a bad crash
  204. intact.
  205. .Pp
  206. Properly partitioning your system also allows you to tune
  207. .Xr newfs 8 ,
  208. and
  209. .Xr tunefs 8
  210. parameters.
  211. Tuning
  212. .Xr newfs 8
  213. requires more experience but can lead to significant improvements in
  214. performance.
  215. There are three parameters that are relatively safe to tune:
  216. .Em blocksize , bytes/i-node ,
  217. and
  218. .Em cylinders/group .
  219. .Pp
  220. .Fx
  221. performs best when using 16K or 32K file system block sizes.
  222. The default file system block size is 32K,
  223. which provides best performance for most applications,
  224. with the exception of those that perform random access on large files
  225. (such as database server software).
  226. Such applications tend to perform better with a smaller block size,
  227. although modern disk characteristics are such that the performance
  228. gain from using a smaller block size may not be worth consideration.
  229. Using a block size larger than 32K
  230. can cause fragmentation of the buffer cache and
  231. lead to lower performance.
  232. .Pp
  233. The defaults may be unsuitable
  234. for a file system that requires a very large number of i-nodes
  235. or is intended to hold a large number of very small files.
  236. Such a file system should be created with an 4K, 8K, or 16K block size.
  237. This also requires you to specify a smaller
  238. fragment size.
  239. We recommend always using a fragment size that is 1/8
  240. the block size (less testing has been done on other fragment size factors).
  241. The
  242. .Xr newfs 8
  243. options for this would be
  244. .Dq Li "newfs -f 1024 -b 8192 ..." .
  245. .Pp
  246. If a large partition is intended to be used to hold fewer, larger files, such
  247. as database files, you can increase the
  248. .Em bytes/i-node
  249. ratio which reduces the number of i-nodes (maximum number of files and
  250. directories that can be created) for that partition.
  251. Decreasing the number
  252. of i-nodes in a file system can greatly reduce
  253. .Xr fsck 8
  254. recovery times after a crash.
  255. Do not use this option
  256. unless you are actually storing large files on the partition, because if you
  257. overcompensate you can wind up with a file system that has lots of free
  258. space remaining but cannot accommodate any more files.
  259. Using 65536, 131072, or 262144 bytes/i-node is recommended.
  260. You can go higher but
  261. it will have only incremental effects on
  262. .Xr fsck 8
  263. recovery times.
  264. For example,
  265. .Dq Li "newfs -i 65536 ..." .
  266. .Pp
  267. .Xr tunefs 8
  268. may be used to further tune a file system.
  269. This command can be run in
  270. single-user mode without having to reformat the file system.
  271. However, this is possibly the most abused program in the system.
  272. Many people attempt to
  273. increase available file system space by setting the min-free percentage to 0.
  274. This can lead to severe file system fragmentation and we do not recommend
  275. that you do this.
  276. Really the only
  277. .Xr tunefs 8
  278. option worthwhile here is turning on
  279. .Em softupdates
  280. with
  281. .Dq Li "tunefs -n enable /filesystem" .
  282. (Note: in
  283. .Fx 4.5
  284. and later, softupdates can be turned on using the
  285. .Fl U
  286. option to
  287. .Xr newfs 8 ,
  288. and
  289. .Xr sysinstall 8
  290. will typically enable softupdates automatically for non-root file systems).
  291. Softupdates drastically improves meta-data performance, mainly file
  292. creation and deletion.
  293. We recommend enabling softupdates on most file systems; however, there
  294. are two limitations to softupdates that you should be aware of when
  295. determining whether to use it on a file system.
  296. First, softupdates guarantees file system consistency in the
  297. case of a crash but could very easily be several seconds (even a minute!\&)
  298. behind on pending write to the physical disk.
  299. If you crash you may lose more work
  300. than otherwise.
  301. Secondly, softupdates delays the freeing of file system
  302. blocks.
  303. If you have a file system (such as the root file system) which is
  304. close to full, doing a major update of it, e.g.\&
  305. .Dq Li "make installworld" ,
  306. can run it out of space and cause the update to fail.
  307. For this reason, softupdates will not be enabled on the root file system
  308. during a typical install.
  309. There is no loss of performance since the root
  310. file system is rarely written to.
  311. .Pp
  312. A number of run-time
  313. .Xr mount 8
  314. options exist that can help you tune the system.
  315. The most obvious and most dangerous one is
  316. .Cm async .
  317. Only use this option in conjunction with
  318. .Xr gjournal 8 ,
  319. as it is far too dangerous on a normal file system.
  320. A less dangerous and more
  321. useful
  322. .Xr mount 8
  323. option is called
  324. .Cm noatime .
  325. .Ux
  326. file systems normally update the last-accessed time of a file or
  327. directory whenever it is accessed.
  328. This operation is handled in
  329. .Fx
  330. with a delayed write and normally does not create a burden on the system.
  331. However, if your system is accessing a huge number of files on a continuing
  332. basis the buffer cache can wind up getting polluted with atime updates,
  333. creating a burden on the system.
  334. For example, if you are running a heavily
  335. loaded web site, or a news server with lots of readers, you might want to
  336. consider turning off atime updates on your larger partitions with this
  337. .Xr mount 8
  338. option.
  339. However, you should not gratuitously turn off atime
  340. updates everywhere.
  341. For example, the
  342. .Pa /var
  343. file system customarily
  344. holds mailboxes, and atime (in combination with mtime) is used to
  345. determine whether a mailbox has new mail.
  346. You might as well leave
  347. atime turned on for mostly read-only partitions such as
  348. .Pa /
  349. and
  350. .Pa /usr
  351. as well.
  352. This is especially useful for
  353. .Pa /
  354. since some system utilities
  355. use the atime field for reporting.
  356. .Sh STRIPING DISKS
  357. In larger systems you can stripe partitions from several drives together
  358. to create a much larger overall partition.
  359. Striping can also improve
  360. the performance of a file system by splitting I/O operations across two
  361. or more disks.
  362. The
  363. .Xr gstripe 8 ,
  364. .Xr gvinum 8 ,
  365. and
  366. .Xr ccdconfig 8
  367. utilities may be used to create simple striped file systems.
  368. Generally
  369. speaking, striping smaller partitions such as the root and
  370. .Pa /var/tmp ,
  371. or essentially read-only partitions such as
  372. .Pa /usr
  373. is a complete waste of time.
  374. You should only stripe partitions that require serious I/O performance,
  375. typically
  376. .Pa /var , /home ,
  377. or custom partitions used to hold databases and web pages.
  378. Choosing the proper stripe size is also
  379. important.
  380. File systems tend to store meta-data on power-of-2 boundaries
  381. and you usually want to reduce seeking rather than increase seeking.
  382. This
  383. means you want to use a large off-center stripe size such as 1152 sectors
  384. so sequential I/O does not seek both disks and so meta-data is distributed
  385. across both disks rather than concentrated on a single disk.
  386. If
  387. you really need to get sophisticated, we recommend using a real hardware
  388. RAID controller from the list of
  389. .Fx
  390. supported controllers.
  391. .Sh SYSCTL TUNING
  392. .Xr sysctl 8
  393. variables permit system behavior to be monitored and controlled at
  394. run-time.
  395. Some sysctls simply report on the behavior of the system; others allow
  396. the system behavior to be modified;
  397. some may be set at boot time using
  398. .Xr rc.conf 5 ,
  399. but most will be set via
  400. .Xr sysctl.conf 5 .
  401. There are several hundred sysctls in the system, including many that appear
  402. to be candidates for tuning but actually are not.
  403. In this document we will only cover the ones that have the greatest effect
  404. on the system.
  405. .Pp
  406. The
  407. .Va vm.overcommit
  408. sysctl defines the overcommit behaviour of the vm subsystem.
  409. The virtual memory system always does accounting of the swap space
  410. reservation, both total for system and per-user.
  411. Corresponding values
  412. are available through sysctl
  413. .Va vm.swap_total ,
  414. that gives the total bytes available for swapping, and
  415. .Va vm.swap_reserved ,
  416. that gives number of bytes that may be needed to back all currently
  417. allocated anonymous memory.
  418. .Pp
  419. Setting bit 0 of the
  420. .Va vm.overcommit
  421. sysctl causes the virtual memory system to return failure
  422. to the process when allocation of memory causes
  423. .Va vm.swap_reserved
  424. to exceed
  425. .Va vm.swap_total .
  426. Bit 1 of the sysctl enforces
  427. .Dv RLIMIT_SWAP
  428. limit
  429. (see
  430. .Xr getrlimit 2 ) .
  431. Root is exempt from this limit.
  432. Bit 2 allows to count most of the physical
  433. memory as allocatable, except wired and free reserved pages
  434. (accounted by
  435. .Va vm.stats.vm.v_free_target
  436. and
  437. .Va vm.stats.vm.v_wire_count
  438. sysctls, respectively).
  439. .Pp
  440. The
  441. .Va kern.ipc.maxpipekva
  442. loader tunable is used to set a hard limit on the
  443. amount of kernel address space allocated to mapping of pipe buffers.
  444. Use of the mapping allows the kernel to eliminate a copy of the
  445. data from writer address space into the kernel, directly copying
  446. the content of mapped buffer to the reader.
  447. Increasing this value to a higher setting, such as `25165824' might
  448. improve performance on systems where space for mapping pipe buffers
  449. is quickly exhausted.
  450. This exhaustion is not fatal; however, and it will only cause pipes
  451. to fall back to using double-copy.
  452. .Pp
  453. The
  454. .Va kern.ipc.shm_use_phys
  455. sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).
  456. Setting
  457. this parameter to 1 will cause all System V shared memory segments to be
  458. mapped to unpageable physical RAM.
  459. This feature only has an effect if you
  460. are either (A) mapping small amounts of shared memory across many (hundreds)
  461. of processes, or (B) mapping large amounts of shared memory across any
  462. number of processes.
  463. This feature allows the kernel to remove a great deal
  464. of internal memory management page-tracking overhead at the cost of wiring
  465. the shared memory into core, making it unswappable.
  466. .Pp
  467. The
  468. .Va vfs.vmiodirenable
  469. sysctl defaults to 1 (on).
  470. This parameter controls how directories are cached
  471. by the system.
  472. Most directories are small and use but a single fragment
  473. (typically 2K) in the file system and even less (typically 512 bytes) in
  474. the buffer cache.
  475. However, when operating in the default mode the buffer
  476. cache will only cache a fixed number of directories even if you have a huge
  477. amount of memory.
  478. Turning on this sysctl allows the buffer cache to use
  479. the VM Page Cache to cache the directories.
  480. The advantage is that all of
  481. memory is now available for caching directories.
  482. The disadvantage is that
  483. the minimum in-core memory used to cache a directory is the physical page
  484. size (typically 4K) rather than 512 bytes.
  485. We recommend turning this option off in memory-constrained environments;
  486. however, when on, it will substantially improve the performance of services
  487. that manipulate a large number of files.
  488. Such services can include web caches, large mail systems, and news systems.
  489. Turning on this option will generally not reduce performance even with the
  490. wasted memory but you should experiment to find out.
  491. .Pp
  492. The
  493. .Va vfs.write_behind
  494. sysctl defaults to 1 (on).
  495. This tells the file system to issue media
  496. writes as full clusters are collected, which typically occurs when writing
  497. large sequential files.
  498. The idea is to avoid saturating the buffer
  499. cache with dirty buffers when it would not benefit I/O performance.
  500. However,
  501. this may stall processes and under certain circumstances you may wish to turn
  502. it off.
  503. .Pp
  504. The
  505. .Va vfs.hirunningspace
  506. sysctl determines how much outstanding write I/O may be queued to
  507. disk controllers system-wide at any given time.
  508. It is used by the UFS file system.
  509. The default is self-tuned and
  510. usually sufficient but on machines with advanced controllers and lots
  511. of disks this may be tuned up to match what the controllers buffer.
  512. Configuring this setting to match tagged queuing capabilities of
  513. controllers or drives with average IO size used in production works
  514. best (for example: 16 MiB will use 128 tags with IO requests of 128 KiB).
  515. Note that setting too high a value
  516. (exceeding the buffer cache's write threshold) can lead to extremely
  517. bad clustering performance.
  518. Do not set this value arbitrarily high!
  519. Higher write queueing values may also add latency to reads occurring at
  520. the same time.
  521. .Pp
  522. The
  523. .Va vfs.read_max
  524. sysctl governs VFS read-ahead and is expressed as the number of blocks
  525. to pre-read if the heuristics algorithm decides that the reads are
  526. issued sequentially.
  527. It is used by the UFS, ext2fs and msdosfs file systems.
  528. With the default UFS block size of 32 KiB, a setting of 64 will allow
  529. speculatively reading up to 2 MiB.
  530. This setting may be increased to get around disk I/O latencies, especially
  531. where these latencies are large such as in virtual machine emulated
  532. environments.
  533. It may be tuned down in specific cases where the I/O load is such that
  534. read-ahead adversely affects performance or where system memory is really
  535. low.
  536. .Pp
  537. The
  538. .Va vfs.ncsizefactor
  539. sysctl defines how large VFS namecache may grow.
  540. The number of currently allocated entries in namecache is provided by
  541. .Va debug.numcache
  542. sysctl and the condition
  543. debug.numcache < kern.maxvnodes * vfs.ncsizefactor
  544. is adhered to.
  545. .Pp
  546. The
  547. .Va vfs.ncnegfactor
  548. sysctl defines how many negative entries VFS namecache is allowed to create.
  549. The number of currently allocated negative entries is provided by
  550. .Va debug.numneg
  551. sysctl and the condition
  552. vfs.ncnegfactor * debug.numneg < debug.numcache
  553. is adhered to.
  554. .Pp
  555. There are various other buffer-cache and VM page cache related sysctls.
  556. We do not recommend modifying these values.
  557. As of
  558. .Fx 4.3 ,
  559. the VM system does an extremely good job tuning itself.
  560. .Pp
  561. The
  562. .Va net.inet.tcp.sendspace
  563. and
  564. .Va net.inet.tcp.recvspace
  565. sysctls are of particular interest if you are running network intensive
  566. applications.
  567. They control the amount of send and receive buffer space
  568. allowed for any given TCP connection.
  569. The default sending buffer is 32K; the default receiving buffer
  570. is 64K.
  571. You can often
  572. improve bandwidth utilization by increasing the default at the cost of
  573. eating up more kernel memory for each connection.
  574. We do not recommend
  575. increasing the defaults if you are serving hundreds or thousands of
  576. simultaneous connections because it is possible to quickly run the system
  577. out of memory due to stalled connections building up.
  578. But if you need
  579. high bandwidth over a fewer number of connections, especially if you have
  580. gigabit Ethernet, increasing these defaults can make a huge difference.
  581. You can adjust the buffer size for incoming and outgoing data separately.
  582. For example, if your machine is primarily doing web serving you may want
  583. to decrease the recvspace in order to be able to increase the
  584. sendspace without eating too much kernel memory.
  585. Note that the routing table (see
  586. .Xr route 8 )
  587. can be used to introduce route-specific send and receive buffer size
  588. defaults.
  589. .Pp
  590. As an additional management tool you can use pipes in your
  591. firewall rules (see
  592. .Xr ipfw 8 )
  593. to limit the bandwidth going to or from particular IP blocks or ports.
  594. For example, if you have a T1 you might want to limit your web traffic
  595. to 70% of the T1's bandwidth in order to leave the remainder available
  596. for mail and interactive use.
  597. Normally a heavily loaded web server
  598. will not introduce significant latencies into other services even if
  599. the network link is maxed out, but enforcing a limit can smooth things
  600. out and lead to longer term stability.
  601. Many people also enforce artificial
  602. bandwidth limitations in order to ensure that they are not charged for
  603. using too much bandwidth.
  604. .Pp
  605. Setting the send or receive TCP buffer to values larger than 65535 will result
  606. in a marginal performance improvement unless both hosts support the window
  607. scaling extension of the TCP protocol, which is controlled by the
  608. .Va net.inet.tcp.rfc1323
  609. sysctl.
  610. These extensions should be enabled and the TCP buffer size should be set
  611. to a value larger than 65536 in order to obtain good performance from
  612. certain types of network links; specifically, gigabit WAN links and
  613. high-latency satellite links.
  614. RFC1323 support is enabled by default.
  615. .Pp
  616. The
  617. .Va net.inet.tcp.always_keepalive
  618. sysctl determines whether or not the TCP implementation should attempt
  619. to detect dead TCP connections by intermittently delivering
  620. .Dq keepalives
  621. on the connection.
  622. By default, this is enabled for all applications; by setting this
  623. sysctl to 0, only applications that specifically request keepalives
  624. will use them.
  625. In most environments, TCP keepalives will improve the management of
  626. system state by expiring dead TCP connections, particularly for
  627. systems serving dialup users who may not always terminate individual
  628. TCP connections before disconnecting from the network.
  629. However, in some environments, temporary network outages may be
  630. incorrectly identified as dead sessions, resulting in unexpectedly
  631. terminated TCP connections.
  632. In such environments, setting the sysctl to 0 may reduce the occurrence of
  633. TCP session disconnections.
  634. .Pp
  635. The
  636. .Va net.inet.tcp.delayed_ack
  637. TCP feature is largely misunderstood.
  638. Historically speaking, this feature
  639. was designed to allow the acknowledgement to transmitted data to be returned
  640. along with the response.
  641. For example, when you type over a remote shell,
  642. the acknowledgement to the character you send can be returned along with the
  643. data representing the echo of the character.
  644. With delayed acks turned off,
  645. the acknowledgement may be sent in its own packet, before the remote service
  646. has a chance to echo the data it just received.
  647. This same concept also
  648. applies to any interactive protocol (e.g.\& SMTP, WWW, POP3), and can cut the
  649. number of tiny packets flowing across the network in half.
  650. The
  651. .Fx
  652. delayed ACK implementation also follows the TCP protocol rule that
  653. at least every other packet be acknowledged even if the standard 100ms
  654. timeout has not yet passed.
  655. Normally the worst a delayed ACK can do is
  656. slightly delay the teardown of a connection, or slightly delay the ramp-up
  657. of a slow-start TCP connection.
  658. While we are not sure we believe that
  659. the several FAQs related to packages such as SAMBA and SQUID which advise
  660. turning off delayed acks may be referring to the slow-start issue.
  661. In
  662. .Fx ,
  663. it would be more beneficial to increase the slow-start flightsize via
  664. the
  665. .Va net.inet.tcp.slowstart_flightsize
  666. sysctl rather than disable delayed acks.
  667. .Pp
  668. The
  669. .Va net.inet.tcp.inflight.enable
  670. sysctl turns on bandwidth delay product limiting for all TCP connections.
  671. The system will attempt to calculate the bandwidth delay product for each
  672. connection and limit the amount of data queued to the network to just the
  673. amount required to maintain optimum throughput.
  674. This feature is useful
  675. if you are serving data over modems, GigE, or high speed WAN links (or
  676. any other link with a high bandwidth*delay product), especially if you are
  677. also using window scaling or have configured a large send window.
  678. If you enable this option, you should also be sure to set
  679. .Va net.inet.tcp.inflight.debug
  680. to 0 (disable debugging), and for production use setting
  681. .Va net.inet.tcp.inflight.min
  682. to at least 6144 may be beneficial.
  683. Note however, that setting high
  684. minimums may effectively disable bandwidth limiting depending on the link.
  685. The limiting feature reduces the amount of data built up in intermediate
  686. router and switch packet queues as well as reduces the amount of data built
  687. up in the local host's interface queue.
  688. With fewer packets queued up,
  689. interactive connections, especially over slow modems, will also be able
  690. to operate with lower round trip times.
  691. However, note that this feature
  692. only affects data transmission (uploading / server-side).
  693. It does not
  694. affect data reception (downloading).
  695. .Pp
  696. Adjusting
  697. .Va net.inet.tcp.inflight.stab
  698. is not recommended.
  699. This parameter defaults to 20, representing 2 maximal packets added
  700. to the bandwidth delay product window calculation.
  701. The additional
  702. window is required to stabilize the algorithm and improve responsiveness
  703. to changing conditions, but it can also result in higher ping times
  704. over slow links (though still much lower than you would get without
  705. the inflight algorithm).
  706. In such cases you may
  707. wish to try reducing this parameter to 15, 10, or 5, and you may also
  708. have to reduce
  709. .Va net.inet.tcp.inflight.min
  710. (for example, to 3500) to get the desired effect.
  711. Reducing these parameters
  712. should be done as a last resort only.
  713. .Pp
  714. The
  715. .Va net.inet.ip.portrange.*
  716. sysctls control the port number ranges automatically bound to TCP and UDP
  717. sockets.
  718. There are three ranges: a low range, a default range, and a
  719. high range, selectable via the
  720. .Dv IP_PORTRANGE
  721. .Xr setsockopt 2
  722. call.
  723. Most
  724. network programs use the default range which is controlled by
  725. .Va net.inet.ip.portrange.first
  726. and
  727. .Va net.inet.ip.portrange.last ,
  728. which default to 49152 and 65535, respectively.
  729. Bound port ranges are
  730. used for outgoing connections, and it is possible to run the system out
  731. of ports under certain circumstances.
  732. This most commonly occurs when you are
  733. running a heavily loaded web proxy.
  734. The port range is not an issue
  735. when running a server which handles mainly incoming connections, such as a
  736. normal web server, or has a limited number of outgoing connections, such
  737. as a mail relay.
  738. For situations where you may run out of ports,
  739. we recommend decreasing
  740. .Va net.inet.ip.portrange.first
  741. modestly.
  742. A range of 10000 to 30000 ports may be reasonable.
  743. You should also consider firewall effects when changing the port range.
  744. Some firewalls
  745. may block large ranges of ports (usually low-numbered ports) and expect systems
  746. to use higher ranges of ports for outgoing connections.
  747. By default
  748. .Va net.inet.ip.portrange.last
  749. is set at the maximum allowable port number.
  750. .Pp
  751. The
  752. .Va kern.ipc.somaxconn
  753. sysctl limits the size of the listen queue for accepting new TCP connections.
  754. The default value of 128 is typically too low for robust handling of new
  755. connections in a heavily loaded web server environment.
  756. For such environments,
  757. we recommend increasing this value to 1024 or higher.
  758. The service daemon
  759. may itself limit the listen queue size (e.g.\&
  760. .Xr sendmail 8 ,
  761. apache) but will
  762. often have a directive in its configuration file to adjust the queue size up.
  763. Larger listen queues also do a better job of fending off denial of service
  764. attacks.
  765. .Pp
  766. The
  767. .Va kern.maxfiles
  768. sysctl determines how many open files the system supports.
  769. The default is
  770. typically a few thousand but you may need to bump this up to ten or twenty
  771. thousand if you are running databases or large descriptor-heavy daemons.
  772. The read-only
  773. .Va kern.openfiles
  774. sysctl may be interrogated to determine the current number of open files
  775. on the system.
  776. .Pp
  777. The
  778. .Va vm.swap_idle_enabled
  779. sysctl is useful in large multi-user systems where you have lots of users
  780. entering and leaving the system and lots of idle processes.
  781. Such systems
  782. tend to generate a great deal of continuous pressure on free memory reserves.
  783. Turning this feature on and adjusting the swapout hysteresis (in idle
  784. seconds) via
  785. .Va vm.swap_idle_threshold1
  786. and
  787. .Va vm.swap_idle_threshold2
  788. allows you to depress the priority of pages associated with idle processes
  789. more quickly then the normal pageout algorithm.
  790. This gives a helping hand
  791. to the pageout daemon.
  792. Do not turn this option on unless you need it,
  793. because the tradeoff you are making is to essentially pre-page memory sooner
  794. rather than later, eating more swap and disk bandwidth.
  795. In a small system
  796. this option will have a detrimental effect but in a large system that is
  797. already doing moderate paging this option allows the VM system to stage
  798. whole processes into and out of memory more easily.
  799. .Sh LOADER TUNABLES
  800. Some aspects of the system behavior may not be tunable at runtime because
  801. memory allocations they perform must occur early in the boot process.
  802. To change loader tunables, you must set their values in
  803. .Xr loader.conf 5
  804. and reboot the system.
  805. .Pp
  806. .Va kern.maxusers
  807. controls the scaling of a number of static system tables, including defaults
  808. for the maximum number of open files, sizing of network memory resources, etc.
  809. As of
  810. .Fx 4.5 ,
  811. .Va kern.maxusers
  812. is automatically sized at boot based on the amount of memory available in
  813. the system, and may be determined at run-time by inspecting the value of the
  814. read-only
  815. .Va kern.maxusers
  816. sysctl.
  817. Some sites will require larger or smaller values of
  818. .Va kern.maxusers
  819. and may set it as a loader tunable; values of 64, 128, and 256 are not
  820. uncommon.
  821. We do not recommend going above 256 unless you need a huge number
  822. of file descriptors; many of the tunable values set to their defaults by
  823. .Va kern.maxusers
  824. may be individually overridden at boot-time or run-time as described
  825. elsewhere in this document.
  826. Systems older than
  827. .Fx 4.4
  828. must set this value via the kernel
  829. .Xr config 8
  830. option
  831. .Cd maxusers
  832. instead.
  833. .Pp
  834. The
  835. .Va kern.dfldsiz
  836. and
  837. .Va kern.dflssiz
  838. tunables set the default soft limits for process data and stack size
  839. respectively.
  840. Processes may increase these up to the hard limits by calling
  841. .Xr setrlimit 2 .
  842. The
  843. .Va kern.maxdsiz ,
  844. .Va kern.maxssiz ,
  845. and
  846. .Va kern.maxtsiz
  847. tunables set the hard limits for process data, stack, and text size
  848. respectively; processes may not exceed these limits.
  849. The
  850. .Va kern.sgrowsiz
  851. tunable controls how much the stack segment will grow when a process
  852. needs to allocate more stack.
  853. .Pp
  854. .Va kern.ipc.nmbclusters
  855. may be adjusted to increase the number of network mbufs the system is
  856. willing to allocate.
  857. Each cluster represents approximately 2K of memory,
  858. so a value of 1024 represents 2M of kernel memory reserved for network
  859. buffers.
  860. You can do a simple calculation to figure out how many you need.
  861. If you have a web server which maxes out at 1000 simultaneous connections,
  862. and each connection eats a 16K receive and 16K send buffer, you need
  863. approximately 32MB worth of network buffers to deal with it.
  864. A good rule of
  865. thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
  866. So for this case
  867. you would want to set
  868. .Va kern.ipc.nmbclusters
  869. to 32768.
  870. We recommend values between
  871. 1024 and 4096 for machines with moderates amount of memory, and between 4096
  872. and 32768 for machines with greater amounts of memory.
  873. Under no circumstances
  874. should you specify an arbitrarily high value for this parameter, it could
  875. lead to a boot-time crash.
  876. The
  877. .Fl m
  878. option to
  879. .Xr netstat 1
  880. may be used to observe network cluster use.
  881. Older versions of
  882. .Fx
  883. do not have this tunable and require that the
  884. kernel
  885. .Xr config 8
  886. option
  887. .Dv NMBCLUSTERS
  888. be set instead.
  889. .Pp
  890. More and more programs are using the
  891. .Xr sendfile 2
  892. system call to transmit files over the network.
  893. The
  894. .Va kern.ipc.nsfbufs
  895. sysctl controls the number of file system buffers
  896. .Xr sendfile 2
  897. is allowed to use to perform its work.
  898. This parameter nominally scales
  899. with
  900. .Va kern.maxusers
  901. so you should not need to modify this parameter except under extreme
  902. circumstances.
  903. See the
  904. .Sx TUNING
  905. section in the
  906. .Xr sendfile 2
  907. manual page for details.
  908. .Sh KERNEL CONFIG TUNING
  909. There are a number of kernel options that you may have to fiddle with in
  910. a large-scale system.
  911. In order to change these options you need to be
  912. able to compile a new kernel from source.
  913. The
  914. .Xr config 8
  915. manual page and the handbook are good starting points for learning how to
  916. do this.
  917. Generally the first thing you do when creating your own custom
  918. kernel is to strip out all the drivers and services you do not use.
  919. Removing things like
  920. .Dv INET6
  921. and drivers you do not have will reduce the size of your kernel, sometimes
  922. by a megabyte or more, leaving more memory available for applications.
  923. .Pp
  924. .Dv SCSI_DELAY
  925. may be used to reduce system boot times.
  926. The defaults are fairly high and
  927. can be responsible for 5+ seconds of delay in the boot process.
  928. Reducing
  929. .Dv SCSI_DELAY
  930. to something below 5 seconds could work (especially with modern drives).
  931. .Pp
  932. There are a number of
  933. .Dv *_CPU
  934. options that can be commented out.
  935. If you only want the kernel to run
  936. on a Pentium class CPU, you can easily remove
  937. .Dv I486_CPU ,
  938. but only remove
  939. .Dv I586_CPU
  940. if you are sure your CPU is being recognized as a Pentium II or better.
  941. Some clones may be recognized as a Pentium or even a 486 and not be able
  942. to boot without those options.
  943. If it works, great!
  944. The operating system
  945. will be able to better use higher-end CPU features for MMU, task switching,
  946. timebase, and even device operations.
  947. Additionally, higher-end CPUs support
  948. 4MB MMU pages, which the kernel uses to map the kernel itself into memory,
  949. increasing its efficiency under heavy syscall loads.
  950. .Sh IDE WRITE CACHING
  951. .Fx 4.3
  952. flirted with turning off IDE write caching.
  953. This reduced write bandwidth
  954. to IDE disks but was considered necessary due to serious data consistency
  955. issues introduced by hard drive vendors.
  956. Basically the problem is that
  957. IDE drives lie about when a write completes.
  958. With IDE write caching turned
  959. on, IDE hard drives will not only write data to disk out of order, they
  960. will sometimes delay some of the blocks indefinitely under heavy disk
  961. load.
  962. A crash or power failure can result in serious file system
  963. corruption.
  964. So our default was changed to be safe.
  965. Unfortunately, the
  966. result was such a huge loss in performance that we caved in and changed the
  967. default back to on after the release.
  968. You should check the default on
  969. your system by observing the
  970. .Va hw.ata.wc
  971. sysctl variable.
  972. If IDE write caching is turned off, you can turn it back
  973. on by setting the
  974. .Va hw.ata.wc
  975. loader tunable to 1.
  976. More information on tuning the ATA driver system may be found in the
  977. .Xr ata 4
  978. manual page.
  979. If you need performance, go with SCSI.
  980. .Sh CPU, MEMORY, DISK, NETWORK
  981. The type of tuning you do depends heavily on where your system begins to
  982. bottleneck as load increases.
  983. If your system runs out of CPU (idle times
  984. are perpetually 0%) then you need to consider upgrading the CPU or moving to
  985. an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
  986. programs that are causing the load and try to optimize them.
  987. If your system
  988. is paging to swap a lot you need to consider adding more memory.
  989. If your
  990. system is saturating the disk you typically see high CPU idle times and
  991. total disk saturation.
  992. .Xr systat 1
  993. can be used to monitor this.
  994. There are many solutions to saturated disks:
  995. increasing memory for caching, mirroring disks, distributing operations across
  996. several machines, and so forth.
  997. If disk performance is an issue and you
  998. are using IDE drives, switching to SCSI can help a great deal.
  999. While modern
  1000. IDE drives compare with SCSI in raw sequential bandwidth, the moment you
  1001. start seeking around the disk SCSI drives usually win.
  1002. .Pp
  1003. Finally, you might run out of network suds.
  1004. The first line of defense for
  1005. improving network performance is to make sure you are using switches instead
  1006. of hubs, especially these days where switches are almost as cheap.
  1007. Hubs
  1008. have severe problems under heavy loads due to collision back-off and one bad
  1009. host can severely degrade the entire LAN.
  1010. Second, optimize the network path
  1011. as much as possible.
  1012. For example, in
  1013. .Xr firewall 7
  1014. we describe a firewall protecting internal hosts with a topology where
  1015. the externally visible hosts are not routed through it.
  1016. Use 100BaseT rather
  1017. than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
  1018. Most bottlenecks occur at the WAN link (e.g.\&
  1019. modem, T1, DSL, whatever).
  1020. If expanding the link is not an option it may be possible to use the
  1021. .Xr dummynet 4
  1022. feature to implement peak shaving or other forms of traffic shaping to
  1023. prevent the overloaded service (such as web services) from affecting other
  1024. services (such as email), or vice versa.
  1025. In home installations this could
  1026. be used to give interactive traffic (your browser,
  1027. .Xr ssh 1
  1028. logins) priority
  1029. over services you export from your box (web services, email).
  1030. .Sh SEE ALSO
  1031. .Xr netstat 1 ,
  1032. .Xr systat 1 ,
  1033. .Xr sendfile 2 ,
  1034. .Xr ata 4 ,
  1035. .Xr dummynet 4 ,
  1036. .Xr login.conf 5 ,
  1037. .Xr rc.conf 5 ,
  1038. .Xr sysctl.conf 5 ,
  1039. .Xr firewall 7 ,
  1040. .Xr eventtimers 7 ,
  1041. .Xr hier 7 ,
  1042. .Xr ports 7 ,
  1043. .Xr boot 8 ,
  1044. .Xr bsdlabel 8 ,
  1045. .Xr ccdconfig 8 ,
  1046. .Xr config 8 ,
  1047. .Xr fsck 8 ,
  1048. .Xr gjournal 8 ,
  1049. .Xr gstripe 8 ,
  1050. .Xr gvinum 8 ,
  1051. .Xr ifconfig 8 ,
  1052. .Xr ipfw 8 ,
  1053. .Xr loader 8 ,
  1054. .Xr mount 8 ,
  1055. .Xr newfs 8 ,
  1056. .Xr route 8 ,
  1057. .Xr sysctl 8 ,
  1058. .Xr sysinstall 8 ,
  1059. .Xr tunefs 8
  1060. .Sh HISTORY
  1061. The
  1062. .Nm
  1063. manual page was originally written by
  1064. .An Matthew Dillon
  1065. and first appeared
  1066. in
  1067. .Fx 4.3 ,
  1068. May 2001.