/share/man/man7/tuning.7
https://bitbucket.org/freebsd/freebsd-head/ · Unknown · 1068 lines · 1068 code · 0 blank · 0 comment · 0 complexity · f01bf4b94ee88f69ec9317bd678c42d2 MD5 · raw file
- .\" Copyright (C) 2001 Matthew Dillon. All rights reserved.
- .\"
- .\" Redistribution and use in source and binary forms, with or without
- .\" modification, are permitted provided that the following conditions
- .\" are met:
- .\" 1. Redistributions of source code must retain the above copyright
- .\" notice, this list of conditions and the following disclaimer.
- .\" 2. Redistributions in binary form must reproduce the above copyright
- .\" notice, this list of conditions and the following disclaimer in the
- .\" documentation and/or other materials provided with the distribution.
- .\"
- .\" THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- .\" ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
- .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- .\" SUCH DAMAGE.
- .\"
- .\" $FreeBSD$
- .\"
- .Dd May 11, 2012
- .Dt TUNING 7
- .Os
- .Sh NAME
- .Nm tuning
- .Nd performance tuning under FreeBSD
- .Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
- When using
- .Xr bsdlabel 8
- or
- .Xr sysinstall 8
- to lay out your file systems on a hard disk it is important to remember
- that hard drives can transfer data much more quickly from outer tracks
- than they can from inner tracks.
- To take advantage of this you should
- try to pack your smaller file systems and swap closer to the outer tracks,
- follow with the larger file systems, and end with the largest file systems.
- It is also important to size system standard file systems such that you
- will not be forced to resize them later as you scale the machine up.
- I usually create, in order, a 128M root, 1G swap, 128M
- .Pa /var ,
- 128M
- .Pa /var/tmp ,
- 3G
- .Pa /usr ,
- and use any remaining space for
- .Pa /home .
- .Pp
- You should typically size your swap space to approximately 2x main memory
- for systems with less than 2GB of RAM, or approximately 1x main memory
- if you have more.
- If you do not have a lot of RAM, though, you will generally want a lot
- more swap.
- It is not recommended that you configure any less than
- 256M of swap on a system and you should keep in mind future memory
- expansion when sizing the swap partition.
- The kernel's VM paging algorithms are tuned to perform best when there is
- at least 2x swap versus main memory.
- Configuring too little swap can lead
- to inefficiencies in the VM page scanning code as well as create issues
- later on if you add more memory to your machine.
- Finally, on larger systems
- with multiple SCSI disks (or multiple IDE disks operating on different
- controllers), we strongly recommend that you configure swap on each drive.
- The swap partitions on the drives should be approximately the same size.
- The kernel can handle arbitrary sizes but
- internal data structures scale to 4 times the largest swap partition.
- Keeping
- the swap partitions near the same size will allow the kernel to optimally
- stripe swap space across the N disks.
- Do not worry about overdoing it a
- little, swap space is the saving grace of
- .Ux
- and even if you do not normally use much swap, it can give you more time to
- recover from a runaway program before being forced to reboot.
- .Pp
- How you size your
- .Pa /var
- partition depends heavily on what you intend to use the machine for.
- This
- partition is primarily used to hold mailboxes, the print spool, and log
- files.
- Some people even make
- .Pa /var/log
- its own partition (but except for extreme cases it is not worth the waste
- of a partition ID).
- If your machine is intended to act as a mail
- or print server,
- or you are running a heavily visited web server, you should consider
- creating a much larger partition \(en perhaps a gig or more.
- It is very easy
- to underestimate log file storage requirements.
- .Pp
- Sizing
- .Pa /var/tmp
- depends on the kind of temporary file usage you think you will need.
- 128M is
- the minimum we recommend.
- Also note that sysinstall will create a
- .Pa /tmp
- directory.
- Dedicating a partition for temporary file storage is important for
- two reasons: first, it reduces the possibility of file system corruption
- in a crash, and second it reduces the chance of a runaway process that
- fills up
- .Oo Pa /var Oc Ns Pa /tmp
- from blowing up more critical subsystems (mail,
- logging, etc).
- Filling up
- .Oo Pa /var Oc Ns Pa /tmp
- is a very common problem to have.
- .Pp
- In the old days there were differences between
- .Pa /tmp
- and
- .Pa /var/tmp ,
- but the introduction of
- .Pa /var
- (and
- .Pa /var/tmp )
- led to massive confusion
- by program writers so today programs haphazardly use one or the
- other and thus no real distinction can be made between the two.
- So it makes sense to have just one temporary directory and
- softlink to it from the other
- .Pa tmp
- directory locations.
- However you handle
- .Pa /tmp ,
- the one thing you do not want to do is leave it sitting
- on the root partition where it might cause root to fill up or possibly
- corrupt root in a crash/reboot situation.
- .Pp
- The
- .Pa /usr
- partition holds the bulk of the files required to support the system and
- a subdirectory within it called
- .Pa /usr/local
- holds the bulk of the files installed from the
- .Xr ports 7
- hierarchy.
- If you do not use ports all that much and do not intend to keep
- system source
- .Pq Pa /usr/src
- on the machine, you can get away with
- a 1 GB
- .Pa /usr
- partition.
- However, if you install a lot of ports
- (especially window managers and Linux-emulated binaries), we recommend
- at least a 2 GB
- .Pa /usr
- and if you also intend to keep system source
- on the machine, we recommend a 3 GB
- .Pa /usr .
- Do not underestimate the
- amount of space you will need in this partition, it can creep up and
- surprise you!
- .Pp
- The
- .Pa /home
- partition is typically used to hold user-specific data.
- I usually size it to the remainder of the disk.
- .Pp
- Why partition at all?
- Why not create one big
- .Pa /
- partition and be done with it?
- Then I do not have to worry about undersizing things!
- Well, there are several reasons this is not a good idea.
- First,
- each partition has different operational characteristics and separating them
- allows the file system to tune itself to those characteristics.
- For example,
- the root and
- .Pa /usr
- partitions are read-mostly, with very little writing, while
- a lot of reading and writing could occur in
- .Pa /var
- and
- .Pa /var/tmp .
- By properly
- partitioning your system fragmentation introduced in the smaller more
- heavily write-loaded partitions will not bleed over into the mostly-read
- partitions.
- Additionally, keeping the write-loaded partitions closer to
- the edge of the disk (i.e., before the really big partitions instead of after
- in the partition table) will increase I/O performance in the partitions
- where you need it the most.
- Now it is true that you might also need I/O
- performance in the larger partitions, but they are so large that shifting
- them more towards the edge of the disk will not lead to a significant
- performance improvement whereas moving
- .Pa /var
- to the edge can have a huge impact.
- Finally, there are safety concerns.
- Having a small neat root partition that
- is essentially read-only gives it a greater chance of surviving a bad crash
- intact.
- .Pp
- Properly partitioning your system also allows you to tune
- .Xr newfs 8 ,
- and
- .Xr tunefs 8
- parameters.
- Tuning
- .Xr newfs 8
- requires more experience but can lead to significant improvements in
- performance.
- There are three parameters that are relatively safe to tune:
- .Em blocksize , bytes/i-node ,
- and
- .Em cylinders/group .
- .Pp
- .Fx
- performs best when using 16K or 32K file system block sizes.
- The default file system block size is 32K,
- which provides best performance for most applications,
- with the exception of those that perform random access on large files
- (such as database server software).
- Such applications tend to perform better with a smaller block size,
- although modern disk characteristics are such that the performance
- gain from using a smaller block size may not be worth consideration.
- Using a block size larger than 32K
- can cause fragmentation of the buffer cache and
- lead to lower performance.
- .Pp
- The defaults may be unsuitable
- for a file system that requires a very large number of i-nodes
- or is intended to hold a large number of very small files.
- Such a file system should be created with an 4K, 8K, or 16K block size.
- This also requires you to specify a smaller
- fragment size.
- We recommend always using a fragment size that is 1/8
- the block size (less testing has been done on other fragment size factors).
- The
- .Xr newfs 8
- options for this would be
- .Dq Li "newfs -f 1024 -b 8192 ..." .
- .Pp
- If a large partition is intended to be used to hold fewer, larger files, such
- as database files, you can increase the
- .Em bytes/i-node
- ratio which reduces the number of i-nodes (maximum number of files and
- directories that can be created) for that partition.
- Decreasing the number
- of i-nodes in a file system can greatly reduce
- .Xr fsck 8
- recovery times after a crash.
- Do not use this option
- unless you are actually storing large files on the partition, because if you
- overcompensate you can wind up with a file system that has lots of free
- space remaining but cannot accommodate any more files.
- Using 65536, 131072, or 262144 bytes/i-node is recommended.
- You can go higher but
- it will have only incremental effects on
- .Xr fsck 8
- recovery times.
- For example,
- .Dq Li "newfs -i 65536 ..." .
- .Pp
- .Xr tunefs 8
- may be used to further tune a file system.
- This command can be run in
- single-user mode without having to reformat the file system.
- However, this is possibly the most abused program in the system.
- Many people attempt to
- increase available file system space by setting the min-free percentage to 0.
- This can lead to severe file system fragmentation and we do not recommend
- that you do this.
- Really the only
- .Xr tunefs 8
- option worthwhile here is turning on
- .Em softupdates
- with
- .Dq Li "tunefs -n enable /filesystem" .
- (Note: in
- .Fx 4.5
- and later, softupdates can be turned on using the
- .Fl U
- option to
- .Xr newfs 8 ,
- and
- .Xr sysinstall 8
- will typically enable softupdates automatically for non-root file systems).
- Softupdates drastically improves meta-data performance, mainly file
- creation and deletion.
- We recommend enabling softupdates on most file systems; however, there
- are two limitations to softupdates that you should be aware of when
- determining whether to use it on a file system.
- First, softupdates guarantees file system consistency in the
- case of a crash but could very easily be several seconds (even a minute!\&)
- behind on pending write to the physical disk.
- If you crash you may lose more work
- than otherwise.
- Secondly, softupdates delays the freeing of file system
- blocks.
- If you have a file system (such as the root file system) which is
- close to full, doing a major update of it, e.g.\&
- .Dq Li "make installworld" ,
- can run it out of space and cause the update to fail.
- For this reason, softupdates will not be enabled on the root file system
- during a typical install.
- There is no loss of performance since the root
- file system is rarely written to.
- .Pp
- A number of run-time
- .Xr mount 8
- options exist that can help you tune the system.
- The most obvious and most dangerous one is
- .Cm async .
- Only use this option in conjunction with
- .Xr gjournal 8 ,
- as it is far too dangerous on a normal file system.
- A less dangerous and more
- useful
- .Xr mount 8
- option is called
- .Cm noatime .
- .Ux
- file systems normally update the last-accessed time of a file or
- directory whenever it is accessed.
- This operation is handled in
- .Fx
- with a delayed write and normally does not create a burden on the system.
- However, if your system is accessing a huge number of files on a continuing
- basis the buffer cache can wind up getting polluted with atime updates,
- creating a burden on the system.
- For example, if you are running a heavily
- loaded web site, or a news server with lots of readers, you might want to
- consider turning off atime updates on your larger partitions with this
- .Xr mount 8
- option.
- However, you should not gratuitously turn off atime
- updates everywhere.
- For example, the
- .Pa /var
- file system customarily
- holds mailboxes, and atime (in combination with mtime) is used to
- determine whether a mailbox has new mail.
- You might as well leave
- atime turned on for mostly read-only partitions such as
- .Pa /
- and
- .Pa /usr
- as well.
- This is especially useful for
- .Pa /
- since some system utilities
- use the atime field for reporting.
- .Sh STRIPING DISKS
- In larger systems you can stripe partitions from several drives together
- to create a much larger overall partition.
- Striping can also improve
- the performance of a file system by splitting I/O operations across two
- or more disks.
- The
- .Xr gstripe 8 ,
- .Xr gvinum 8 ,
- and
- .Xr ccdconfig 8
- utilities may be used to create simple striped file systems.
- Generally
- speaking, striping smaller partitions such as the root and
- .Pa /var/tmp ,
- or essentially read-only partitions such as
- .Pa /usr
- is a complete waste of time.
- You should only stripe partitions that require serious I/O performance,
- typically
- .Pa /var , /home ,
- or custom partitions used to hold databases and web pages.
- Choosing the proper stripe size is also
- important.
- File systems tend to store meta-data on power-of-2 boundaries
- and you usually want to reduce seeking rather than increase seeking.
- This
- means you want to use a large off-center stripe size such as 1152 sectors
- so sequential I/O does not seek both disks and so meta-data is distributed
- across both disks rather than concentrated on a single disk.
- If
- you really need to get sophisticated, we recommend using a real hardware
- RAID controller from the list of
- .Fx
- supported controllers.
- .Sh SYSCTL TUNING
- .Xr sysctl 8
- variables permit system behavior to be monitored and controlled at
- run-time.
- Some sysctls simply report on the behavior of the system; others allow
- the system behavior to be modified;
- some may be set at boot time using
- .Xr rc.conf 5 ,
- but most will be set via
- .Xr sysctl.conf 5 .
- There are several hundred sysctls in the system, including many that appear
- to be candidates for tuning but actually are not.
- In this document we will only cover the ones that have the greatest effect
- on the system.
- .Pp
- The
- .Va vm.overcommit
- sysctl defines the overcommit behaviour of the vm subsystem.
- The virtual memory system always does accounting of the swap space
- reservation, both total for system and per-user.
- Corresponding values
- are available through sysctl
- .Va vm.swap_total ,
- that gives the total bytes available for swapping, and
- .Va vm.swap_reserved ,
- that gives number of bytes that may be needed to back all currently
- allocated anonymous memory.
- .Pp
- Setting bit 0 of the
- .Va vm.overcommit
- sysctl causes the virtual memory system to return failure
- to the process when allocation of memory causes
- .Va vm.swap_reserved
- to exceed
- .Va vm.swap_total .
- Bit 1 of the sysctl enforces
- .Dv RLIMIT_SWAP
- limit
- (see
- .Xr getrlimit 2 ) .
- Root is exempt from this limit.
- Bit 2 allows to count most of the physical
- memory as allocatable, except wired and free reserved pages
- (accounted by
- .Va vm.stats.vm.v_free_target
- and
- .Va vm.stats.vm.v_wire_count
- sysctls, respectively).
- .Pp
- The
- .Va kern.ipc.maxpipekva
- loader tunable is used to set a hard limit on the
- amount of kernel address space allocated to mapping of pipe buffers.
- Use of the mapping allows the kernel to eliminate a copy of the
- data from writer address space into the kernel, directly copying
- the content of mapped buffer to the reader.
- Increasing this value to a higher setting, such as `25165824' might
- improve performance on systems where space for mapping pipe buffers
- is quickly exhausted.
- This exhaustion is not fatal; however, and it will only cause pipes
- to fall back to using double-copy.
- .Pp
- The
- .Va kern.ipc.shm_use_phys
- sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).
- Setting
- this parameter to 1 will cause all System V shared memory segments to be
- mapped to unpageable physical RAM.
- This feature only has an effect if you
- are either (A) mapping small amounts of shared memory across many (hundreds)
- of processes, or (B) mapping large amounts of shared memory across any
- number of processes.
- This feature allows the kernel to remove a great deal
- of internal memory management page-tracking overhead at the cost of wiring
- the shared memory into core, making it unswappable.
- .Pp
- The
- .Va vfs.vmiodirenable
- sysctl defaults to 1 (on).
- This parameter controls how directories are cached
- by the system.
- Most directories are small and use but a single fragment
- (typically 2K) in the file system and even less (typically 512 bytes) in
- the buffer cache.
- However, when operating in the default mode the buffer
- cache will only cache a fixed number of directories even if you have a huge
- amount of memory.
- Turning on this sysctl allows the buffer cache to use
- the VM Page Cache to cache the directories.
- The advantage is that all of
- memory is now available for caching directories.
- The disadvantage is that
- the minimum in-core memory used to cache a directory is the physical page
- size (typically 4K) rather than 512 bytes.
- We recommend turning this option off in memory-constrained environments;
- however, when on, it will substantially improve the performance of services
- that manipulate a large number of files.
- Such services can include web caches, large mail systems, and news systems.
- Turning on this option will generally not reduce performance even with the
- wasted memory but you should experiment to find out.
- .Pp
- The
- .Va vfs.write_behind
- sysctl defaults to 1 (on).
- This tells the file system to issue media
- writes as full clusters are collected, which typically occurs when writing
- large sequential files.
- The idea is to avoid saturating the buffer
- cache with dirty buffers when it would not benefit I/O performance.
- However,
- this may stall processes and under certain circumstances you may wish to turn
- it off.
- .Pp
- The
- .Va vfs.hirunningspace
- sysctl determines how much outstanding write I/O may be queued to
- disk controllers system-wide at any given time.
- It is used by the UFS file system.
- The default is self-tuned and
- usually sufficient but on machines with advanced controllers and lots
- of disks this may be tuned up to match what the controllers buffer.
- Configuring this setting to match tagged queuing capabilities of
- controllers or drives with average IO size used in production works
- best (for example: 16 MiB will use 128 tags with IO requests of 128 KiB).
- Note that setting too high a value
- (exceeding the buffer cache's write threshold) can lead to extremely
- bad clustering performance.
- Do not set this value arbitrarily high!
- Higher write queueing values may also add latency to reads occurring at
- the same time.
- .Pp
- The
- .Va vfs.read_max
- sysctl governs VFS read-ahead and is expressed as the number of blocks
- to pre-read if the heuristics algorithm decides that the reads are
- issued sequentially.
- It is used by the UFS, ext2fs and msdosfs file systems.
- With the default UFS block size of 32 KiB, a setting of 64 will allow
- speculatively reading up to 2 MiB.
- This setting may be increased to get around disk I/O latencies, especially
- where these latencies are large such as in virtual machine emulated
- environments.
- It may be tuned down in specific cases where the I/O load is such that
- read-ahead adversely affects performance or where system memory is really
- low.
- .Pp
- The
- .Va vfs.ncsizefactor
- sysctl defines how large VFS namecache may grow.
- The number of currently allocated entries in namecache is provided by
- .Va debug.numcache
- sysctl and the condition
- debug.numcache < kern.maxvnodes * vfs.ncsizefactor
- is adhered to.
- .Pp
- The
- .Va vfs.ncnegfactor
- sysctl defines how many negative entries VFS namecache is allowed to create.
- The number of currently allocated negative entries is provided by
- .Va debug.numneg
- sysctl and the condition
- vfs.ncnegfactor * debug.numneg < debug.numcache
- is adhered to.
- .Pp
- There are various other buffer-cache and VM page cache related sysctls.
- We do not recommend modifying these values.
- As of
- .Fx 4.3 ,
- the VM system does an extremely good job tuning itself.
- .Pp
- The
- .Va net.inet.tcp.sendspace
- and
- .Va net.inet.tcp.recvspace
- sysctls are of particular interest if you are running network intensive
- applications.
- They control the amount of send and receive buffer space
- allowed for any given TCP connection.
- The default sending buffer is 32K; the default receiving buffer
- is 64K.
- You can often
- improve bandwidth utilization by increasing the default at the cost of
- eating up more kernel memory for each connection.
- We do not recommend
- increasing the defaults if you are serving hundreds or thousands of
- simultaneous connections because it is possible to quickly run the system
- out of memory due to stalled connections building up.
- But if you need
- high bandwidth over a fewer number of connections, especially if you have
- gigabit Ethernet, increasing these defaults can make a huge difference.
- You can adjust the buffer size for incoming and outgoing data separately.
- For example, if your machine is primarily doing web serving you may want
- to decrease the recvspace in order to be able to increase the
- sendspace without eating too much kernel memory.
- Note that the routing table (see
- .Xr route 8 )
- can be used to introduce route-specific send and receive buffer size
- defaults.
- .Pp
- As an additional management tool you can use pipes in your
- firewall rules (see
- .Xr ipfw 8 )
- to limit the bandwidth going to or from particular IP blocks or ports.
- For example, if you have a T1 you might want to limit your web traffic
- to 70% of the T1's bandwidth in order to leave the remainder available
- for mail and interactive use.
- Normally a heavily loaded web server
- will not introduce significant latencies into other services even if
- the network link is maxed out, but enforcing a limit can smooth things
- out and lead to longer term stability.
- Many people also enforce artificial
- bandwidth limitations in order to ensure that they are not charged for
- using too much bandwidth.
- .Pp
- Setting the send or receive TCP buffer to values larger than 65535 will result
- in a marginal performance improvement unless both hosts support the window
- scaling extension of the TCP protocol, which is controlled by the
- .Va net.inet.tcp.rfc1323
- sysctl.
- These extensions should be enabled and the TCP buffer size should be set
- to a value larger than 65536 in order to obtain good performance from
- certain types of network links; specifically, gigabit WAN links and
- high-latency satellite links.
- RFC1323 support is enabled by default.
- .Pp
- The
- .Va net.inet.tcp.always_keepalive
- sysctl determines whether or not the TCP implementation should attempt
- to detect dead TCP connections by intermittently delivering
- .Dq keepalives
- on the connection.
- By default, this is enabled for all applications; by setting this
- sysctl to 0, only applications that specifically request keepalives
- will use them.
- In most environments, TCP keepalives will improve the management of
- system state by expiring dead TCP connections, particularly for
- systems serving dialup users who may not always terminate individual
- TCP connections before disconnecting from the network.
- However, in some environments, temporary network outages may be
- incorrectly identified as dead sessions, resulting in unexpectedly
- terminated TCP connections.
- In such environments, setting the sysctl to 0 may reduce the occurrence of
- TCP session disconnections.
- .Pp
- The
- .Va net.inet.tcp.delayed_ack
- TCP feature is largely misunderstood.
- Historically speaking, this feature
- was designed to allow the acknowledgement to transmitted data to be returned
- along with the response.
- For example, when you type over a remote shell,
- the acknowledgement to the character you send can be returned along with the
- data representing the echo of the character.
- With delayed acks turned off,
- the acknowledgement may be sent in its own packet, before the remote service
- has a chance to echo the data it just received.
- This same concept also
- applies to any interactive protocol (e.g.\& SMTP, WWW, POP3), and can cut the
- number of tiny packets flowing across the network in half.
- The
- .Fx
- delayed ACK implementation also follows the TCP protocol rule that
- at least every other packet be acknowledged even if the standard 100ms
- timeout has not yet passed.
- Normally the worst a delayed ACK can do is
- slightly delay the teardown of a connection, or slightly delay the ramp-up
- of a slow-start TCP connection.
- While we are not sure we believe that
- the several FAQs related to packages such as SAMBA and SQUID which advise
- turning off delayed acks may be referring to the slow-start issue.
- In
- .Fx ,
- it would be more beneficial to increase the slow-start flightsize via
- the
- .Va net.inet.tcp.slowstart_flightsize
- sysctl rather than disable delayed acks.
- .Pp
- The
- .Va net.inet.tcp.inflight.enable
- sysctl turns on bandwidth delay product limiting for all TCP connections.
- The system will attempt to calculate the bandwidth delay product for each
- connection and limit the amount of data queued to the network to just the
- amount required to maintain optimum throughput.
- This feature is useful
- if you are serving data over modems, GigE, or high speed WAN links (or
- any other link with a high bandwidth*delay product), especially if you are
- also using window scaling or have configured a large send window.
- If you enable this option, you should also be sure to set
- .Va net.inet.tcp.inflight.debug
- to 0 (disable debugging), and for production use setting
- .Va net.inet.tcp.inflight.min
- to at least 6144 may be beneficial.
- Note however, that setting high
- minimums may effectively disable bandwidth limiting depending on the link.
- The limiting feature reduces the amount of data built up in intermediate
- router and switch packet queues as well as reduces the amount of data built
- up in the local host's interface queue.
- With fewer packets queued up,
- interactive connections, especially over slow modems, will also be able
- to operate with lower round trip times.
- However, note that this feature
- only affects data transmission (uploading / server-side).
- It does not
- affect data reception (downloading).
- .Pp
- Adjusting
- .Va net.inet.tcp.inflight.stab
- is not recommended.
- This parameter defaults to 20, representing 2 maximal packets added
- to the bandwidth delay product window calculation.
- The additional
- window is required to stabilize the algorithm and improve responsiveness
- to changing conditions, but it can also result in higher ping times
- over slow links (though still much lower than you would get without
- the inflight algorithm).
- In such cases you may
- wish to try reducing this parameter to 15, 10, or 5, and you may also
- have to reduce
- .Va net.inet.tcp.inflight.min
- (for example, to 3500) to get the desired effect.
- Reducing these parameters
- should be done as a last resort only.
- .Pp
- The
- .Va net.inet.ip.portrange.*
- sysctls control the port number ranges automatically bound to TCP and UDP
- sockets.
- There are three ranges: a low range, a default range, and a
- high range, selectable via the
- .Dv IP_PORTRANGE
- .Xr setsockopt 2
- call.
- Most
- network programs use the default range which is controlled by
- .Va net.inet.ip.portrange.first
- and
- .Va net.inet.ip.portrange.last ,
- which default to 49152 and 65535, respectively.
- Bound port ranges are
- used for outgoing connections, and it is possible to run the system out
- of ports under certain circumstances.
- This most commonly occurs when you are
- running a heavily loaded web proxy.
- The port range is not an issue
- when running a server which handles mainly incoming connections, such as a
- normal web server, or has a limited number of outgoing connections, such
- as a mail relay.
- For situations where you may run out of ports,
- we recommend decreasing
- .Va net.inet.ip.portrange.first
- modestly.
- A range of 10000 to 30000 ports may be reasonable.
- You should also consider firewall effects when changing the port range.
- Some firewalls
- may block large ranges of ports (usually low-numbered ports) and expect systems
- to use higher ranges of ports for outgoing connections.
- By default
- .Va net.inet.ip.portrange.last
- is set at the maximum allowable port number.
- .Pp
- The
- .Va kern.ipc.somaxconn
- sysctl limits the size of the listen queue for accepting new TCP connections.
- The default value of 128 is typically too low for robust handling of new
- connections in a heavily loaded web server environment.
- For such environments,
- we recommend increasing this value to 1024 or higher.
- The service daemon
- may itself limit the listen queue size (e.g.\&
- .Xr sendmail 8 ,
- apache) but will
- often have a directive in its configuration file to adjust the queue size up.
- Larger listen queues also do a better job of fending off denial of service
- attacks.
- .Pp
- The
- .Va kern.maxfiles
- sysctl determines how many open files the system supports.
- The default is
- typically a few thousand but you may need to bump this up to ten or twenty
- thousand if you are running databases or large descriptor-heavy daemons.
- The read-only
- .Va kern.openfiles
- sysctl may be interrogated to determine the current number of open files
- on the system.
- .Pp
- The
- .Va vm.swap_idle_enabled
- sysctl is useful in large multi-user systems where you have lots of users
- entering and leaving the system and lots of idle processes.
- Such systems
- tend to generate a great deal of continuous pressure on free memory reserves.
- Turning this feature on and adjusting the swapout hysteresis (in idle
- seconds) via
- .Va vm.swap_idle_threshold1
- and
- .Va vm.swap_idle_threshold2
- allows you to depress the priority of pages associated with idle processes
- more quickly then the normal pageout algorithm.
- This gives a helping hand
- to the pageout daemon.
- Do not turn this option on unless you need it,
- because the tradeoff you are making is to essentially pre-page memory sooner
- rather than later, eating more swap and disk bandwidth.
- In a small system
- this option will have a detrimental effect but in a large system that is
- already doing moderate paging this option allows the VM system to stage
- whole processes into and out of memory more easily.
- .Sh LOADER TUNABLES
- Some aspects of the system behavior may not be tunable at runtime because
- memory allocations they perform must occur early in the boot process.
- To change loader tunables, you must set their values in
- .Xr loader.conf 5
- and reboot the system.
- .Pp
- .Va kern.maxusers
- controls the scaling of a number of static system tables, including defaults
- for the maximum number of open files, sizing of network memory resources, etc.
- As of
- .Fx 4.5 ,
- .Va kern.maxusers
- is automatically sized at boot based on the amount of memory available in
- the system, and may be determined at run-time by inspecting the value of the
- read-only
- .Va kern.maxusers
- sysctl.
- Some sites will require larger or smaller values of
- .Va kern.maxusers
- and may set it as a loader tunable; values of 64, 128, and 256 are not
- uncommon.
- We do not recommend going above 256 unless you need a huge number
- of file descriptors; many of the tunable values set to their defaults by
- .Va kern.maxusers
- may be individually overridden at boot-time or run-time as described
- elsewhere in this document.
- Systems older than
- .Fx 4.4
- must set this value via the kernel
- .Xr config 8
- option
- .Cd maxusers
- instead.
- .Pp
- The
- .Va kern.dfldsiz
- and
- .Va kern.dflssiz
- tunables set the default soft limits for process data and stack size
- respectively.
- Processes may increase these up to the hard limits by calling
- .Xr setrlimit 2 .
- The
- .Va kern.maxdsiz ,
- .Va kern.maxssiz ,
- and
- .Va kern.maxtsiz
- tunables set the hard limits for process data, stack, and text size
- respectively; processes may not exceed these limits.
- The
- .Va kern.sgrowsiz
- tunable controls how much the stack segment will grow when a process
- needs to allocate more stack.
- .Pp
- .Va kern.ipc.nmbclusters
- may be adjusted to increase the number of network mbufs the system is
- willing to allocate.
- Each cluster represents approximately 2K of memory,
- so a value of 1024 represents 2M of kernel memory reserved for network
- buffers.
- You can do a simple calculation to figure out how many you need.
- If you have a web server which maxes out at 1000 simultaneous connections,
- and each connection eats a 16K receive and 16K send buffer, you need
- approximately 32MB worth of network buffers to deal with it.
- A good rule of
- thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
- So for this case
- you would want to set
- .Va kern.ipc.nmbclusters
- to 32768.
- We recommend values between
- 1024 and 4096 for machines with moderates amount of memory, and between 4096
- and 32768 for machines with greater amounts of memory.
- Under no circumstances
- should you specify an arbitrarily high value for this parameter, it could
- lead to a boot-time crash.
- The
- .Fl m
- option to
- .Xr netstat 1
- may be used to observe network cluster use.
- Older versions of
- .Fx
- do not have this tunable and require that the
- kernel
- .Xr config 8
- option
- .Dv NMBCLUSTERS
- be set instead.
- .Pp
- More and more programs are using the
- .Xr sendfile 2
- system call to transmit files over the network.
- The
- .Va kern.ipc.nsfbufs
- sysctl controls the number of file system buffers
- .Xr sendfile 2
- is allowed to use to perform its work.
- This parameter nominally scales
- with
- .Va kern.maxusers
- so you should not need to modify this parameter except under extreme
- circumstances.
- See the
- .Sx TUNING
- section in the
- .Xr sendfile 2
- manual page for details.
- .Sh KERNEL CONFIG TUNING
- There are a number of kernel options that you may have to fiddle with in
- a large-scale system.
- In order to change these options you need to be
- able to compile a new kernel from source.
- The
- .Xr config 8
- manual page and the handbook are good starting points for learning how to
- do this.
- Generally the first thing you do when creating your own custom
- kernel is to strip out all the drivers and services you do not use.
- Removing things like
- .Dv INET6
- and drivers you do not have will reduce the size of your kernel, sometimes
- by a megabyte or more, leaving more memory available for applications.
- .Pp
- .Dv SCSI_DELAY
- may be used to reduce system boot times.
- The defaults are fairly high and
- can be responsible for 5+ seconds of delay in the boot process.
- Reducing
- .Dv SCSI_DELAY
- to something below 5 seconds could work (especially with modern drives).
- .Pp
- There are a number of
- .Dv *_CPU
- options that can be commented out.
- If you only want the kernel to run
- on a Pentium class CPU, you can easily remove
- .Dv I486_CPU ,
- but only remove
- .Dv I586_CPU
- if you are sure your CPU is being recognized as a Pentium II or better.
- Some clones may be recognized as a Pentium or even a 486 and not be able
- to boot without those options.
- If it works, great!
- The operating system
- will be able to better use higher-end CPU features for MMU, task switching,
- timebase, and even device operations.
- Additionally, higher-end CPUs support
- 4MB MMU pages, which the kernel uses to map the kernel itself into memory,
- increasing its efficiency under heavy syscall loads.
- .Sh IDE WRITE CACHING
- .Fx 4.3
- flirted with turning off IDE write caching.
- This reduced write bandwidth
- to IDE disks but was considered necessary due to serious data consistency
- issues introduced by hard drive vendors.
- Basically the problem is that
- IDE drives lie about when a write completes.
- With IDE write caching turned
- on, IDE hard drives will not only write data to disk out of order, they
- will sometimes delay some of the blocks indefinitely under heavy disk
- load.
- A crash or power failure can result in serious file system
- corruption.
- So our default was changed to be safe.
- Unfortunately, the
- result was such a huge loss in performance that we caved in and changed the
- default back to on after the release.
- You should check the default on
- your system by observing the
- .Va hw.ata.wc
- sysctl variable.
- If IDE write caching is turned off, you can turn it back
- on by setting the
- .Va hw.ata.wc
- loader tunable to 1.
- More information on tuning the ATA driver system may be found in the
- .Xr ata 4
- manual page.
- If you need performance, go with SCSI.
- .Sh CPU, MEMORY, DISK, NETWORK
- The type of tuning you do depends heavily on where your system begins to
- bottleneck as load increases.
- If your system runs out of CPU (idle times
- are perpetually 0%) then you need to consider upgrading the CPU or moving to
- an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
- programs that are causing the load and try to optimize them.
- If your system
- is paging to swap a lot you need to consider adding more memory.
- If your
- system is saturating the disk you typically see high CPU idle times and
- total disk saturation.
- .Xr systat 1
- can be used to monitor this.
- There are many solutions to saturated disks:
- increasing memory for caching, mirroring disks, distributing operations across
- several machines, and so forth.
- If disk performance is an issue and you
- are using IDE drives, switching to SCSI can help a great deal.
- While modern
- IDE drives compare with SCSI in raw sequential bandwidth, the moment you
- start seeking around the disk SCSI drives usually win.
- .Pp
- Finally, you might run out of network suds.
- The first line of defense for
- improving network performance is to make sure you are using switches instead
- of hubs, especially these days where switches are almost as cheap.
- Hubs
- have severe problems under heavy loads due to collision back-off and one bad
- host can severely degrade the entire LAN.
- Second, optimize the network path
- as much as possible.
- For example, in
- .Xr firewall 7
- we describe a firewall protecting internal hosts with a topology where
- the externally visible hosts are not routed through it.
- Use 100BaseT rather
- than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
- Most bottlenecks occur at the WAN link (e.g.\&
- modem, T1, DSL, whatever).
- If expanding the link is not an option it may be possible to use the
- .Xr dummynet 4
- feature to implement peak shaving or other forms of traffic shaping to
- prevent the overloaded service (such as web services) from affecting other
- services (such as email), or vice versa.
- In home installations this could
- be used to give interactive traffic (your browser,
- .Xr ssh 1
- logins) priority
- over services you export from your box (web services, email).
- .Sh SEE ALSO
- .Xr netstat 1 ,
- .Xr systat 1 ,
- .Xr sendfile 2 ,
- .Xr ata 4 ,
- .Xr dummynet 4 ,
- .Xr login.conf 5 ,
- .Xr rc.conf 5 ,
- .Xr sysctl.conf 5 ,
- .Xr firewall 7 ,
- .Xr eventtimers 7 ,
- .Xr hier 7 ,
- .Xr ports 7 ,
- .Xr boot 8 ,
- .Xr bsdlabel 8 ,
- .Xr ccdconfig 8 ,
- .Xr config 8 ,
- .Xr fsck 8 ,
- .Xr gjournal 8 ,
- .Xr gstripe 8 ,
- .Xr gvinum 8 ,
- .Xr ifconfig 8 ,
- .Xr ipfw 8 ,
- .Xr loader 8 ,
- .Xr mount 8 ,
- .Xr newfs 8 ,
- .Xr route 8 ,
- .Xr sysctl 8 ,
- .Xr sysinstall 8 ,
- .Xr tunefs 8
- .Sh HISTORY
- The
- .Nm
- manual page was originally written by
- .An Matthew Dillon
- and first appeared
- in
- .Fx 4.3 ,
- May 2001.