PageRenderTime 44ms CodeModel.GetById 24ms app.highlight 9ms RepoModel.GetById 2ms app.codeStats 0ms

/share/doc/psd/03.iosys/iosys

https://bitbucket.org/freebsd/freebsd-head/
#! | 1086 lines | 1086 code | 0 blank | 0 comment | 0 complexity | d4b888b3cac7320e308d953d1ebe9426 MD5 | raw file
   1.\" Copyright (C) Caldera International Inc. 2001-2002.  All rights reserved.
   2.\" 
   3.\" Redistribution and use in source and binary forms, with or without
   4.\" modification, are permitted provided that the following conditions are
   5.\" met:
   6.\" 
   7.\" Redistributions of source code and documentation must retain the above
   8.\" copyright notice, this list of conditions and the following
   9.\" disclaimer.
  10.\" 
  11.\" Redistributions in binary form must reproduce the above copyright
  12.\" notice, this list of conditions and the following disclaimer in the
  13.\" documentation and/or other materials provided with the distribution.
  14.\" 
  15.\" All advertising materials mentioning features or use of this software
  16.\" must display the following acknowledgement:
  17.\" 
  18.\" This product includes software developed or owned by Caldera
  19.\" International, Inc.  Neither the name of Caldera International, Inc.
  20.\" nor the names of other contributors may be used to endorse or promote
  21.\" products derived from this software without specific prior written
  22.\" permission.
  23.\" 
  24.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
  25.\" INTERNATIONAL, INC.  AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
  26.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  27.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  28.\" DISCLAIMED.  IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
  29.\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
  30.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  31.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
  32.\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  33.\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
  34.\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
  35.\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  36.\" 
  37.\"	@(#)iosys	8.1 (Berkeley) 6/8/93
  38.\"
  39.\" $FreeBSD$
  40.EH 'PSD:3-%''The UNIX I/O System'
  41.OH 'The UNIX I/O System''PSD:3-%'
  42.TL
  43The UNIX I/O System
  44.AU
  45Dennis M. Ritchie
  46.AI
  47AT&T Bell Laboratories
  48Murray Hill, NJ
  49.PP
  50This paper gives an overview of the workings of the UNIX\(dg
  51.FS
  52\(dgUNIX is a Trademark of Bell Laboratories.
  53.FE
  54I/O system.
  55It was written with an eye toward providing
  56guidance to writers of device driver routines,
  57and is oriented more toward describing the environment
  58and nature of device drivers than the implementation
  59of that part of the file system which deals with
  60ordinary files.
  61.PP
  62It is assumed that the reader has a good knowledge
  63of the overall structure of the file system as discussed
  64in the paper ``The UNIX Time-sharing System.''
  65A more detailed discussion
  66appears in
  67``UNIX Implementation;''
  68the current document restates parts of that one,
  69but is still more detailed.
  70It is most useful in
  71conjunction with a copy of the system code,
  72since it is basically an exegesis of that code.
  73.SH
  74Device Classes
  75.PP
  76There are two classes of device:
  77.I block
  78and
  79.I character.
  80The block interface is suitable for devices
  81like disks, tapes, and DECtape
  82which work, or can work, with addressible 512-byte blocks.
  83Ordinary magnetic tape just barely fits in this category,
  84since by use of forward
  85and
  86backward spacing any block can be read, even though
  87blocks can be written only at the end of the tape.
  88Block devices can at least potentially contain a mounted
  89file system.
  90The interface to block devices is very highly structured;
  91the drivers for these devices share a great many routines
  92as well as a pool of buffers.
  93.PP
  94Character-type devices have a much
  95more straightforward interface, although
  96more work must be done by the driver itself.
  97.PP
  98Devices of both types are named by a
  99.I major
 100and a
 101.I minor
 102device number.
 103These numbers are generally stored as an integer
 104with the minor device number
 105in the low-order 8 bits and the major device number
 106in the next-higher 8 bits;
 107macros
 108.I major
 109and
 110.I minor
 111are available to access these numbers.
 112The major device number selects which driver will deal with
 113the device; the minor device number is not used
 114by the rest of the system but is passed to the
 115driver at appropriate times.
 116Typically the minor number
 117selects a subdevice attached to
 118a given controller, or one of
 119several similar hardware interfaces.
 120.PP
 121The major device numbers for block and character devices
 122are used as indices in separate tables;
 123they both start at 0 and therefore overlap.
 124.SH
 125Overview of I/O
 126.PP
 127The purpose of
 128the
 129.I open
 130and
 131.I creat
 132system calls is to set up entries in three separate
 133system tables.
 134The first of these is the
 135.I u_ofile
 136table,
 137which is stored in the system's per-process
 138data area
 139.I u.
 140This table is indexed by
 141the file descriptor returned by the
 142.I open
 143or
 144.I creat,
 145and is accessed during
 146a
 147.I read,
 148.I write,
 149or other operation on the open file.
 150An entry contains only
 151a pointer to the corresponding
 152entry of the
 153.I file
 154table,
 155which is a per-system data base.
 156There is one entry in the
 157.I file
 158table for each
 159instance of
 160.I open
 161or
 162.I creat.
 163This table is per-system because the same instance
 164of an open file must be shared among the several processes
 165which can result from
 166.I forks
 167after the file is opened.
 168A
 169.I file
 170table entry contains
 171flags which indicate whether the file
 172was open for reading or writing or is a pipe, and
 173a count which is used to decide when all processes
 174using the entry have terminated or closed the file
 175(so the entry can be abandoned).
 176There is also a 32-bit file offset
 177which is used to indicate where in the file the next read
 178or write will take place.
 179Finally, there is a pointer to the
 180entry for the file in the
 181.I inode
 182table,
 183which contains a copy of the file's i-node.
 184.PP
 185Certain open files can be designated ``multiplexed''
 186files, and several other flags apply to such
 187channels.
 188In such a case, instead of an offset,
 189there is a pointer to an associated multiplex channel table.
 190Multiplex channels will not be discussed here.
 191.PP
 192An entry in the
 193.I file
 194table corresponds precisely to an instance of
 195.I open
 196or
 197.I creat;
 198if the same file is opened several times,
 199it will have several
 200entries in this table.
 201However,
 202there is at most one entry
 203in the
 204.I inode
 205table for a given file.
 206Also, a file may enter the
 207.I inode
 208table not only because it is open,
 209but also because it is the current directory
 210of some process or because it
 211is a special file containing a currently-mounted
 212file system.
 213.PP
 214An entry in the
 215.I inode
 216table differs somewhat from the
 217corresponding i-node as stored on the disk;
 218the modified and accessed times are not stored,
 219and the entry is augmented
 220by a flag word containing information about the entry,
 221a count used to determine when it may be
 222allowed to disappear,
 223and the device and i-number
 224whence the entry came.
 225Also, the several block numbers that give addressing
 226information for the file are expanded from
 227the 3-byte, compressed format used on the disk to full
 228.I long
 229quantities.
 230.PP
 231During the processing of an
 232.I open
 233or
 234.I creat
 235call for a special file,
 236the system always calls the device's
 237.I open
 238routine to allow for any special processing
 239required (rewinding a tape, turning on
 240the data-terminal-ready lead of a modem, etc.).
 241However,
 242the
 243.I close
 244routine is called only when the last
 245process closes a file,
 246that is, when the i-node table entry
 247is being deallocated.
 248Thus it is not feasible
 249for a device to maintain, or depend on,
 250a count of its users, although it is quite
 251possible to
 252implement an exclusive-use device which cannot
 253be reopened until it has been closed.
 254.PP
 255When a
 256.I read
 257or
 258.I write
 259takes place,
 260the user's arguments
 261and the
 262.I file
 263table entry are used to set up the
 264variables
 265.I u.u_base,
 266.I u.u_count,
 267and
 268.I u.u_offset
 269which respectively contain the (user) address
 270of the I/O target area, the byte-count for the transfer,
 271and the current location in the file.
 272If the file referred to is
 273a character-type special file, the appropriate read
 274or write routine is called; it is responsible
 275for transferring data and updating the
 276count and current location appropriately
 277as discussed below.
 278Otherwise, the current location is used to calculate
 279a logical block number in the file.
 280If the file is an ordinary file the logical block
 281number must be mapped (possibly using indirect blocks)
 282to a physical block number; a block-type
 283special file need not be mapped.
 284This mapping is performed by the
 285.I bmap
 286routine.
 287In any event, the resulting physical block number
 288is used, as discussed below, to
 289read or write the appropriate device.
 290.SH
 291Character Device Drivers
 292.PP
 293The
 294.I cdevsw
 295table specifies the interface routines present for
 296character devices.
 297Each device provides five routines:
 298open, close, read, write, and special-function
 299(to implement the
 300.I ioctl
 301system call).
 302Any of these may be missing.
 303If a call on the routine
 304should be ignored,
 305(e.g.
 306.I open
 307on non-exclusive devices that require no setup)
 308the
 309.I cdevsw
 310entry can be given as
 311.I nulldev;
 312if it should be considered an error,
 313(e.g.
 314.I write
 315on read-only devices)
 316.I nodev
 317is used.
 318For terminals,
 319the
 320.I cdevsw
 321structure also contains a pointer to the
 322.I tty
 323structure associated with the terminal.
 324.PP
 325The
 326.I open
 327routine is called each time the file
 328is opened with the full device number as argument.
 329The second argument is a flag which is
 330non-zero only if the device is to be written upon.
 331.PP
 332The
 333.I close
 334routine is called only when the file
 335is closed for the last time,
 336that is when the very last process in
 337which the file is open closes it.
 338This means it is not possible for the driver to
 339maintain its own count of its users.
 340The first argument is the device number;
 341the second is a flag which is non-zero
 342if the file was open for writing in the process which
 343performs the final
 344.I close.
 345.PP
 346When
 347.I write
 348is called, it is supplied the device
 349as argument.
 350The per-user variable
 351.I u.u_count
 352has been set to
 353the number of characters indicated by the user;
 354for character devices, this number may be 0
 355initially.
 356.I u.u_base
 357is the address supplied by the user from which to start
 358taking characters.
 359The system may call the
 360routine internally, so the
 361flag
 362.I u.u_segflg
 363is supplied that indicates,
 364if
 365.I on,
 366that
 367.I u.u_base
 368refers to the system address space instead of
 369the user's.
 370.PP
 371The
 372.I write
 373routine
 374should copy up to
 375.I u.u_count
 376characters from the user's buffer to the device,
 377decrementing
 378.I u.u_count
 379for each character passed.
 380For most drivers, which work one character at a time,
 381the routine
 382.I "cpass( )"
 383is used to pick up characters
 384from the user's buffer.
 385Successive calls on it return
 386the characters to be written until
 387.I u.u_count
 388goes to 0 or an error occurs,
 389when it returns \(mi1.
 390.I Cpass
 391takes care of interrogating
 392.I u.u_segflg
 393and updating
 394.I u.u_count.
 395.PP
 396Write routines which want to transfer
 397a probably large number of characters into an internal
 398buffer may also use the routine
 399.I "iomove(buffer, offset, count, flag)"
 400which is faster when many characters must be moved.
 401.I Iomove
 402transfers up to
 403.I count
 404characters into the
 405.I buffer
 406starting
 407.I offset
 408bytes from the start of the buffer;
 409.I flag
 410should be
 411.I B_WRITE
 412(which is 0) in the write case.
 413Caution:
 414the caller is responsible for making sure
 415the count is not too large and is non-zero.
 416As an efficiency note,
 417.I iomove
 418is much slower if any of
 419.I "buffer+offset, count"
 420or
 421.I u.u_base
 422is odd.
 423.PP
 424The device's
 425.I read
 426routine is called under conditions similar to
 427.I write,
 428except that
 429.I u.u_count
 430is guaranteed to be non-zero.
 431To return characters to the user, the routine
 432.I "passc(c)"
 433is available; it takes care of housekeeping
 434like
 435.I cpass
 436and returns \(mi1 as the last character
 437specified by
 438.I u.u_count
 439is returned to the user;
 440before that time, 0 is returned.
 441.I Iomove
 442is also usable as with
 443.I write;
 444the flag should be
 445.I B_READ
 446but the same cautions apply.
 447.PP
 448The ``special-functions'' routine
 449is invoked by the
 450.I stty
 451and
 452.I gtty
 453system calls as follows:
 454.I "(*p) (dev, v)"
 455where
 456.I p
 457is a pointer to the device's routine,
 458.I dev
 459is the device number,
 460and
 461.I v
 462is a vector.
 463In the
 464.I gtty
 465case,
 466the device is supposed to place up to 3 words of status information
 467into the vector; this will be returned to the caller.
 468In the
 469.I stty
 470case,
 471.I v
 472is 0;
 473the device should take up to 3 words of
 474control information from
 475the array
 476.I "u.u_arg[0...2]."
 477.PP
 478Finally, each device should have appropriate interrupt-time
 479routines.
 480When an interrupt occurs, it is turned into a C-compatible call
 481on the devices's interrupt routine.
 482The interrupt-catching mechanism makes
 483the low-order four bits of the ``new PS'' word in the
 484trap vector for the interrupt available
 485to the interrupt handler.
 486This is conventionally used by drivers
 487which deal with multiple similar devices
 488to encode the minor device number.
 489After the interrupt has been processed,
 490a return from the interrupt handler will
 491return from the interrupt itself.
 492.PP
 493A number of subroutines are available which are useful
 494to character device drivers.
 495Most of these handlers, for example, need a place
 496to buffer characters in the internal interface
 497between their ``top half'' (read/write)
 498and ``bottom half'' (interrupt) routines.
 499For relatively low data-rate devices, the best mechanism
 500is the character queue maintained by the
 501routines
 502.I getc
 503and
 504.I putc.
 505A queue header has the structure
 506.DS
 507struct {
 508	int	c_cc;	/* character count */
 509	char	*c_cf;	/* first character */
 510	char	*c_cl;	/* last character */
 511} queue;
 512.DE
 513A character is placed on the end of a queue by
 514.I "putc(c, &queue)"
 515where
 516.I c
 517is the character and
 518.I queue
 519is the queue header.
 520The routine returns \(mi1 if there is no space
 521to put the character, 0 otherwise.
 522The first character on the queue may be retrieved
 523by
 524.I "getc(&queue)"
 525which returns either the (non-negative) character
 526or \(mi1 if the queue is empty.
 527.PP
 528Notice that the space for characters in queues is
 529shared among all devices in the system
 530and in the standard system there are only some 600
 531character slots available.
 532Thus device handlers,
 533especially write routines, must take
 534care to avoid gobbling up excessive numbers of characters.
 535.PP
 536The other major help available
 537to device handlers is the sleep-wakeup mechanism.
 538The call
 539.I "sleep(event, priority)"
 540causes the process to wait (allowing other processes to run)
 541until the
 542.I event
 543occurs;
 544at that time, the process is marked ready-to-run
 545and the call will return when there is no
 546process with higher
 547.I priority.
 548.PP
 549The call
 550.I "wakeup(event)"
 551indicates that the
 552.I event
 553has happened, that is, causes processes sleeping
 554on the event to be awakened.
 555The
 556.I event
 557is an arbitrary quantity agreed upon
 558by the sleeper and the waker-up.
 559By convention, it is the address of some data area used
 560by the driver, which guarantees that events
 561are unique.
 562.PP
 563Processes sleeping on an event should not assume
 564that the event has really happened;
 565they should check that the conditions which
 566caused them to sleep no longer hold.
 567.PP
 568Priorities can range from 0 to 127;
 569a higher numerical value indicates a less-favored
 570scheduling situation.
 571A distinction is made between processes sleeping
 572at priority less than the parameter
 573.I PZERO
 574and those at numerically larger priorities.
 575The former cannot
 576be interrupted by signals, although it
 577is conceivable that it may be swapped out.
 578Thus it is a bad idea to sleep with
 579priority less than PZERO on an event which might never occur.
 580On the other hand, calls to
 581.I sleep
 582with larger priority
 583may never return if the process is terminated by
 584some signal in the meantime.
 585Incidentally, it is a gross error to call
 586.I sleep
 587in a routine called at interrupt time, since the process
 588which is running is almost certainly not the
 589process which should go to sleep.
 590Likewise, none of the variables in the user area
 591``\fIu\fB.\fR''
 592should be touched, let alone changed, by an interrupt routine.
 593.PP
 594If a device driver
 595wishes to wait for some event for which it is inconvenient
 596or impossible to supply a
 597.I wakeup,
 598(for example, a device going on-line, which does not
 599generally cause an interrupt),
 600the call
 601.I "sleep(&lbolt, priority)
 602may be given.
 603.I Lbolt
 604is an external cell whose address is awakened once every 4 seconds
 605by the clock interrupt routine.
 606.PP
 607The routines
 608.I "spl4( ), spl5( ), spl6( ), spl7( )"
 609are available to
 610set the processor priority level as indicated to avoid
 611inconvenient interrupts from the device.
 612.PP
 613If a device needs to know about real-time intervals,
 614then
 615.I "timeout(func, arg, interval)
 616will be useful.
 617This routine arranges that after
 618.I interval
 619sixtieths of a second, the
 620.I func
 621will be called with
 622.I arg
 623as argument, in the style
 624.I "(*func)(arg).
 625Timeouts are used, for example,
 626to provide real-time delays after function characters
 627like new-line and tab in typewriter output,
 628and to terminate an attempt to
 629read the 201 Dataphone
 630.I dp
 631if there is no response within a specified number
 632of seconds.
 633Notice that the number of sixtieths of a second is limited to 32767,
 634since it must appear to be positive,
 635and that only a bounded number of timeouts
 636can be going on at once.
 637Also, the specified
 638.I func
 639is called at clock-interrupt time, so it should
 640conform to the requirements of interrupt routines
 641in general.
 642.SH
 643The Block-device Interface
 644.PP
 645Handling of block devices is mediated by a collection
 646of routines that manage a set of buffers containing
 647the images of blocks of data on the various devices.
 648The most important purpose of these routines is to assure
 649that several processes that access the same block of the same
 650device in multiprogrammed fashion maintain a consistent
 651view of the data in the block.
 652A secondary but still important purpose is to increase
 653the efficiency of the system by
 654keeping in-core copies of blocks that are being
 655accessed frequently.
 656The main data base for this mechanism is the
 657table of buffers
 658.I buf.
 659Each buffer header contains a pair of pointers
 660.I "(b_forw, b_back)"
 661which maintain a doubly-linked list
 662of the buffers associated with a particular
 663block device, and a
 664pair of pointers
 665.I "(av_forw, av_back)"
 666which generally maintain a doubly-linked list of blocks
 667which are ``free,'' that is,
 668eligible to be reallocated for another transaction.
 669Buffers that have I/O in progress
 670or are busy for other purposes do not appear in this list.
 671The buffer header
 672also contains the device and block number to which the
 673buffer refers, and a pointer to the actual storage associated with
 674the buffer.
 675There is a word count
 676which is the negative of the number of words
 677to be transferred to or from the buffer;
 678there is also an error byte and a residual word
 679count used to communicate information
 680from an I/O routine to its caller.
 681Finally, there is a flag word
 682with bits indicating the status of the buffer.
 683These flags will be discussed below.
 684.PP
 685Seven routines constitute
 686the most important part of the interface with the
 687rest of the system.
 688Given a device and block number,
 689both
 690.I bread
 691and
 692.I getblk
 693return a pointer to a buffer header for the block;
 694the difference is that
 695.I bread
 696is guaranteed to return a buffer actually containing the
 697current data for the block,
 698while
 699.I getblk
 700returns a buffer which contains the data in the
 701block only if it is already in core (whether it is
 702or not is indicated by the
 703.I B_DONE
 704bit; see below).
 705In either case the buffer, and the corresponding
 706device block, is made ``busy,''
 707so that other processes referring to it
 708are obliged to wait until it becomes free.
 709.I Getblk
 710is used, for example,
 711when a block is about to be totally rewritten,
 712so that its previous contents are
 713not useful;
 714still, no other process can be allowed to refer to the block
 715until the new data is placed into it.
 716.PP
 717The
 718.I breada
 719routine is used to implement read-ahead.
 720it is logically similar to
 721.I bread,
 722but takes as an additional argument the number of
 723a block (on the same device) to be read asynchronously
 724after the specifically requested block is available.
 725.PP
 726Given a pointer to a buffer,
 727the
 728.I brelse
 729routine
 730makes the buffer again available to other processes.
 731It is called, for example, after
 732data has been extracted following a
 733.I bread.
 734There are three subtly-different write routines,
 735all of which take a buffer pointer as argument,
 736and all of which logically release the buffer for
 737use by others and place it on the free list.
 738.I Bwrite
 739puts the
 740buffer on the appropriate device queue,
 741waits for the write to be done,
 742and sets the user's error flag if required.
 743.I Bawrite
 744places the buffer on the device's queue, but does not wait
 745for completion, so that errors cannot be reflected directly to
 746the user.
 747.I Bdwrite
 748does not start any I/O operation at all,
 749but merely marks
 750the buffer so that if it happens
 751to be grabbed from the free list to contain
 752data from some other block, the data in it will
 753first be written
 754out.
 755.PP
 756.I Bwrite
 757is used when one wants to be sure that
 758I/O takes place correctly, and that
 759errors are reflected to the proper user;
 760it is used, for example, when updating i-nodes.
 761.I Bawrite
 762is useful when more overlap is desired
 763(because no wait is required for I/O to finish)
 764but when it is reasonably certain that the
 765write is really required.
 766.I Bdwrite
 767is used when there is doubt that the write is
 768needed at the moment.
 769For example,
 770.I bdwrite
 771is called when the last byte of a
 772.I write
 773system call falls short of the end of a
 774block, on the assumption that
 775another
 776.I write
 777will be given soon which will re-use the same block.
 778On the other hand,
 779as the end of a block is passed,
 780.I bawrite
 781is called, since probably the block will
 782not be accessed again soon and one might as
 783well start the writing process as soon as possible.
 784.PP
 785In any event, notice that the routines
 786.I "getblk"
 787and
 788.I bread
 789dedicate the given block exclusively to the
 790use of the caller, and make others wait,
 791while one of
 792.I "brelse, bwrite, bawrite,"
 793or
 794.I bdwrite
 795must eventually be called to free the block for use by others.
 796.PP
 797As mentioned, each buffer header contains a flag
 798word which indicates the status of the buffer.
 799Since they provide
 800one important channel for information between the drivers and the
 801block I/O system, it is important to understand these flags.
 802The following names are manifest constants which
 803select the associated flag bits.
 804.IP B_READ 10
 805This bit is set when the buffer is handed to the device strategy routine
 806(see below) to indicate a read operation.
 807The symbol
 808.I B_WRITE
 809is defined as 0 and does not define a flag; it is provided
 810as a mnemonic convenience to callers of routines like
 811.I swap
 812which have a separate argument
 813which indicates read or write.
 814.IP B_DONE 10
 815This bit is set
 816to 0 when a block is handed to the device strategy
 817routine and is turned on when the operation completes,
 818whether normally as the result of an error.
 819It is also used as part of the return argument of
 820.I getblk
 821to indicate if 1 that the returned
 822buffer actually contains the data in the requested block.
 823.IP B_ERROR 10
 824This bit may be set to 1 when
 825.I B_DONE
 826is set to indicate that an I/O or other error occurred.
 827If it is set the
 828.I b_error
 829byte of the buffer header may contain an error code
 830if it is non-zero.
 831If
 832.I b_error
 833is 0 the nature of the error is not specified.
 834Actually no driver at present sets
 835.I b_error;
 836the latter is provided for a future improvement
 837whereby a more detailed error-reporting
 838scheme may be implemented.
 839.IP B_BUSY 10
 840This bit indicates that the buffer header is not on
 841the free list, i.e. is
 842dedicated to someone's exclusive use.
 843The buffer still remains attached to the list of
 844blocks associated with its device, however.
 845When
 846.I getblk
 847(or
 848.I bread,
 849which calls it) searches the buffer list
 850for a given device and finds the requested
 851block with this bit on, it sleeps until the bit
 852clears.
 853.IP B_PHYS 10
 854This bit is set for raw I/O transactions that
 855need to allocate the Unibus map on an 11/70.
 856.IP B_MAP 10
 857This bit is set on buffers that have the Unibus map allocated,
 858so that the
 859.I iodone
 860routine knows to deallocate the map.
 861.IP B_WANTED 10
 862This flag is used in conjunction with the
 863.I B_BUSY
 864bit.
 865Before sleeping as described
 866just above,
 867.I getblk
 868sets this flag.
 869Conversely, when the block is freed and the busy bit
 870goes down (in
 871.I brelse)
 872a
 873.I wakeup
 874is given for the block header whenever
 875.I B_WANTED
 876is on.
 877This strategem avoids the overhead
 878of having to call
 879.I wakeup
 880every time a buffer is freed on the chance that someone
 881might want it.
 882.IP B_AGE
 883This bit may be set on buffers just before releasing them; if it
 884is on,
 885the buffer is placed at the head of the free list, rather than at the
 886tail.
 887It is a performance heuristic
 888used when the caller judges that the same block will not soon be used again.
 889.IP B_ASYNC 10
 890This bit is set by
 891.I bawrite
 892to indicate to the appropriate device driver
 893that the buffer should be released when the
 894write has been finished, usually at interrupt time.
 895The difference between
 896.I bwrite
 897and
 898.I bawrite
 899is that the former starts I/O, waits until it is done, and
 900frees the buffer.
 901The latter merely sets this bit and starts I/O.
 902The bit indicates that
 903.I relse
 904should be called for the buffer on completion.
 905.IP B_DELWRI 10
 906This bit is set by
 907.I bdwrite
 908before releasing the buffer.
 909When
 910.I getblk,
 911while searching for a free block,
 912discovers the bit is 1 in a buffer it would otherwise grab,
 913it causes the block to be written out before reusing it.
 914.SH
 915Block Device Drivers
 916.PP
 917The
 918.I bdevsw
 919table contains the names of the interface routines
 920and that of a table for each block device.
 921.PP
 922Just as for character devices, block device drivers may supply
 923an
 924.I open
 925and a
 926.I close
 927routine
 928called respectively on each open and on the final close
 929of the device.
 930Instead of separate read and write routines,
 931each block device driver has a
 932.I strategy
 933routine which is called with a pointer to a buffer
 934header as argument.
 935As discussed, the buffer header contains
 936a read/write flag, the core address,
 937the block number, a (negative) word count,
 938and the major and minor device number.
 939The role of the strategy routine
 940is to carry out the operation as requested by the
 941information in the buffer header.
 942When the transaction is complete the
 943.I B_DONE
 944(and possibly the
 945.I B_ERROR)
 946bits should be set.
 947Then if the
 948.I B_ASYNC
 949bit is set,
 950.I brelse
 951should be called;
 952otherwise,
 953.I wakeup.
 954In cases where the device
 955is capable, under error-free operation,
 956of transferring fewer words than requested,
 957the device's word-count register should be placed
 958in the residual count slot of
 959the buffer header;
 960otherwise, the residual count should be set to 0.
 961This particular mechanism is really for the benefit
 962of the magtape driver;
 963when reading this device
 964records shorter than requested are quite normal,
 965and the user should be told the actual length of the record.
 966.PP
 967Although the most usual argument
 968to the strategy routines
 969is a genuine buffer header allocated as discussed above,
 970all that is actually required
 971is that the argument be a pointer to a place containing the
 972appropriate information.
 973For example the
 974.I swap
 975routine, which manages movement
 976of core images to and from the swapping device,
 977uses the strategy routine
 978for this device.
 979Care has to be taken that
 980no extraneous bits get turned on in the
 981flag word.
 982.PP
 983The device's table specified by
 984.I bdevsw
 985has a
 986byte to contain an active flag and an error count,
 987a pair of links which constitute the
 988head of the chain of buffers for the device
 989.I "(b_forw, b_back),"
 990and a first and last pointer for a device queue.
 991Of these things, all are used solely by the device driver
 992itself
 993except for the buffer-chain pointers.
 994Typically the flag encodes the state of the
 995device, and is used at a minimum to
 996indicate that the device is currently engaged in
 997transferring information and no new command should be issued.
 998The error count is useful for counting retries
 999when errors occur.
1000The device queue is used to remember stacked requests;
1001in the simplest case it may be maintained as a first-in
1002first-out list.
1003Since buffers which have been handed over to
1004the strategy routines are never
1005on the list of free buffers,
1006the pointers in the buffer which maintain the free list
1007.I "(av_forw, av_back)"
1008are also used to contain the pointers
1009which maintain the device queues.
1010.PP
1011A couple of routines
1012are provided which are useful to block device drivers.
1013.I "iodone(bp)"
1014arranges that the buffer to which
1015.I bp
1016points be released or awakened,
1017as appropriate,
1018when the
1019strategy module has finished with the buffer,
1020either normally or after an error.
1021(In the latter case the
1022.I B_ERROR
1023bit has presumably been set.)
1024.PP
1025The routine
1026.I "geterror(bp)"
1027can be used to examine the error bit in a buffer header
1028and arrange that any error indication found therein is
1029reflected to the user.
1030It may be called only in the non-interrupt
1031part of a driver when I/O has completed
1032.I (B_DONE
1033has been set).
1034.SH
1035Raw Block-device I/O
1036.PP
1037A scheme has been set up whereby block device drivers may
1038provide the ability to transfer information
1039directly between the user's core image and the device
1040without the use of buffers and in blocks as large as
1041the caller requests.
1042The method involves setting up a character-type special file
1043corresponding to the raw device
1044and providing
1045.I read
1046and
1047.I write
1048routines which set up what is usually a private,
1049non-shared buffer header with the appropriate information
1050and call the device's strategy routine.
1051If desired, separate
1052.I open
1053and
1054.I close
1055routines may be provided but this is usually unnecessary.
1056A special-function routine might come in handy, especially for
1057magtape.
1058.PP
1059A great deal of work has to be done to generate the
1060``appropriate information''
1061to put in the argument buffer for
1062the strategy module;
1063the worst part is to map relocated user addresses to physical addresses.
1064Most of this work is done by
1065.I "physio(strat, bp, dev, rw)
1066whose arguments are the name of the
1067strategy routine
1068.I strat,
1069the buffer pointer
1070.I bp,
1071the device number
1072.I dev,
1073and a read-write flag
1074.I rw
1075whose value is either
1076.I B_READ
1077or
1078.I B_WRITE.
1079.I Physio
1080makes sure that the user's base address and count are
1081even (because most devices work in words)
1082and that the core area affected is contiguous
1083in physical space;
1084it delays until the buffer is not busy, and makes it
1085busy while the operation is in progress;
1086and it sets up user error return information.