/share/doc/psd/03.iosys/iosys

https://bitbucket.org/freebsd/freebsd-head/ · #! · 1086 lines · 1086 code · 0 blank · 0 comment · 0 complexity · d4b888b3cac7320e308d953d1ebe9426 MD5 · raw file

  1. .\" Copyright (C) Caldera International Inc. 2001-2002. All rights reserved.
  2. .\"
  3. .\" Redistribution and use in source and binary forms, with or without
  4. .\" modification, are permitted provided that the following conditions are
  5. .\" met:
  6. .\"
  7. .\" Redistributions of source code and documentation must retain the above
  8. .\" copyright notice, this list of conditions and the following
  9. .\" disclaimer.
  10. .\"
  11. .\" Redistributions in binary form must reproduce the above copyright
  12. .\" notice, this list of conditions and the following disclaimer in the
  13. .\" documentation and/or other materials provided with the distribution.
  14. .\"
  15. .\" All advertising materials mentioning features or use of this software
  16. .\" must display the following acknowledgement:
  17. .\"
  18. .\" This product includes software developed or owned by Caldera
  19. .\" International, Inc. Neither the name of Caldera International, Inc.
  20. .\" nor the names of other contributors may be used to endorse or promote
  21. .\" products derived from this software without specific prior written
  22. .\" permission.
  23. .\"
  24. .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
  25. .\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
  26. .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  27. .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  28. .\" DISCLAIMED. IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
  29. .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
  30. .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  31. .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
  32. .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  33. .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
  34. .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
  35. .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  36. .\"
  37. .\" @(#)iosys 8.1 (Berkeley) 6/8/93
  38. .\"
  39. .\" $FreeBSD$
  40. .EH 'PSD:3-%''The UNIX I/O System'
  41. .OH 'The UNIX I/O System''PSD:3-%'
  42. .TL
  43. The UNIX I/O System
  44. .AU
  45. Dennis M. Ritchie
  46. .AI
  47. AT&T Bell Laboratories
  48. Murray Hill, NJ
  49. .PP
  50. This paper gives an overview of the workings of the UNIX\(dg
  51. .FS
  52. \(dgUNIX is a Trademark of Bell Laboratories.
  53. .FE
  54. I/O system.
  55. It was written with an eye toward providing
  56. guidance to writers of device driver routines,
  57. and is oriented more toward describing the environment
  58. and nature of device drivers than the implementation
  59. of that part of the file system which deals with
  60. ordinary files.
  61. .PP
  62. It is assumed that the reader has a good knowledge
  63. of the overall structure of the file system as discussed
  64. in the paper ``The UNIX Time-sharing System.''
  65. A more detailed discussion
  66. appears in
  67. ``UNIX Implementation;''
  68. the current document restates parts of that one,
  69. but is still more detailed.
  70. It is most useful in
  71. conjunction with a copy of the system code,
  72. since it is basically an exegesis of that code.
  73. .SH
  74. Device Classes
  75. .PP
  76. There are two classes of device:
  77. .I block
  78. and
  79. .I character.
  80. The block interface is suitable for devices
  81. like disks, tapes, and DECtape
  82. which work, or can work, with addressible 512-byte blocks.
  83. Ordinary magnetic tape just barely fits in this category,
  84. since by use of forward
  85. and
  86. backward spacing any block can be read, even though
  87. blocks can be written only at the end of the tape.
  88. Block devices can at least potentially contain a mounted
  89. file system.
  90. The interface to block devices is very highly structured;
  91. the drivers for these devices share a great many routines
  92. as well as a pool of buffers.
  93. .PP
  94. Character-type devices have a much
  95. more straightforward interface, although
  96. more work must be done by the driver itself.
  97. .PP
  98. Devices of both types are named by a
  99. .I major
  100. and a
  101. .I minor
  102. device number.
  103. These numbers are generally stored as an integer
  104. with the minor device number
  105. in the low-order 8 bits and the major device number
  106. in the next-higher 8 bits;
  107. macros
  108. .I major
  109. and
  110. .I minor
  111. are available to access these numbers.
  112. The major device number selects which driver will deal with
  113. the device; the minor device number is not used
  114. by the rest of the system but is passed to the
  115. driver at appropriate times.
  116. Typically the minor number
  117. selects a subdevice attached to
  118. a given controller, or one of
  119. several similar hardware interfaces.
  120. .PP
  121. The major device numbers for block and character devices
  122. are used as indices in separate tables;
  123. they both start at 0 and therefore overlap.
  124. .SH
  125. Overview of I/O
  126. .PP
  127. The purpose of
  128. the
  129. .I open
  130. and
  131. .I creat
  132. system calls is to set up entries in three separate
  133. system tables.
  134. The first of these is the
  135. .I u_ofile
  136. table,
  137. which is stored in the system's per-process
  138. data area
  139. .I u.
  140. This table is indexed by
  141. the file descriptor returned by the
  142. .I open
  143. or
  144. .I creat,
  145. and is accessed during
  146. a
  147. .I read,
  148. .I write,
  149. or other operation on the open file.
  150. An entry contains only
  151. a pointer to the corresponding
  152. entry of the
  153. .I file
  154. table,
  155. which is a per-system data base.
  156. There is one entry in the
  157. .I file
  158. table for each
  159. instance of
  160. .I open
  161. or
  162. .I creat.
  163. This table is per-system because the same instance
  164. of an open file must be shared among the several processes
  165. which can result from
  166. .I forks
  167. after the file is opened.
  168. A
  169. .I file
  170. table entry contains
  171. flags which indicate whether the file
  172. was open for reading or writing or is a pipe, and
  173. a count which is used to decide when all processes
  174. using the entry have terminated or closed the file
  175. (so the entry can be abandoned).
  176. There is also a 32-bit file offset
  177. which is used to indicate where in the file the next read
  178. or write will take place.
  179. Finally, there is a pointer to the
  180. entry for the file in the
  181. .I inode
  182. table,
  183. which contains a copy of the file's i-node.
  184. .PP
  185. Certain open files can be designated ``multiplexed''
  186. files, and several other flags apply to such
  187. channels.
  188. In such a case, instead of an offset,
  189. there is a pointer to an associated multiplex channel table.
  190. Multiplex channels will not be discussed here.
  191. .PP
  192. An entry in the
  193. .I file
  194. table corresponds precisely to an instance of
  195. .I open
  196. or
  197. .I creat;
  198. if the same file is opened several times,
  199. it will have several
  200. entries in this table.
  201. However,
  202. there is at most one entry
  203. in the
  204. .I inode
  205. table for a given file.
  206. Also, a file may enter the
  207. .I inode
  208. table not only because it is open,
  209. but also because it is the current directory
  210. of some process or because it
  211. is a special file containing a currently-mounted
  212. file system.
  213. .PP
  214. An entry in the
  215. .I inode
  216. table differs somewhat from the
  217. corresponding i-node as stored on the disk;
  218. the modified and accessed times are not stored,
  219. and the entry is augmented
  220. by a flag word containing information about the entry,
  221. a count used to determine when it may be
  222. allowed to disappear,
  223. and the device and i-number
  224. whence the entry came.
  225. Also, the several block numbers that give addressing
  226. information for the file are expanded from
  227. the 3-byte, compressed format used on the disk to full
  228. .I long
  229. quantities.
  230. .PP
  231. During the processing of an
  232. .I open
  233. or
  234. .I creat
  235. call for a special file,
  236. the system always calls the device's
  237. .I open
  238. routine to allow for any special processing
  239. required (rewinding a tape, turning on
  240. the data-terminal-ready lead of a modem, etc.).
  241. However,
  242. the
  243. .I close
  244. routine is called only when the last
  245. process closes a file,
  246. that is, when the i-node table entry
  247. is being deallocated.
  248. Thus it is not feasible
  249. for a device to maintain, or depend on,
  250. a count of its users, although it is quite
  251. possible to
  252. implement an exclusive-use device which cannot
  253. be reopened until it has been closed.
  254. .PP
  255. When a
  256. .I read
  257. or
  258. .I write
  259. takes place,
  260. the user's arguments
  261. and the
  262. .I file
  263. table entry are used to set up the
  264. variables
  265. .I u.u_base,
  266. .I u.u_count,
  267. and
  268. .I u.u_offset
  269. which respectively contain the (user) address
  270. of the I/O target area, the byte-count for the transfer,
  271. and the current location in the file.
  272. If the file referred to is
  273. a character-type special file, the appropriate read
  274. or write routine is called; it is responsible
  275. for transferring data and updating the
  276. count and current location appropriately
  277. as discussed below.
  278. Otherwise, the current location is used to calculate
  279. a logical block number in the file.
  280. If the file is an ordinary file the logical block
  281. number must be mapped (possibly using indirect blocks)
  282. to a physical block number; a block-type
  283. special file need not be mapped.
  284. This mapping is performed by the
  285. .I bmap
  286. routine.
  287. In any event, the resulting physical block number
  288. is used, as discussed below, to
  289. read or write the appropriate device.
  290. .SH
  291. Character Device Drivers
  292. .PP
  293. The
  294. .I cdevsw
  295. table specifies the interface routines present for
  296. character devices.
  297. Each device provides five routines:
  298. open, close, read, write, and special-function
  299. (to implement the
  300. .I ioctl
  301. system call).
  302. Any of these may be missing.
  303. If a call on the routine
  304. should be ignored,
  305. (e.g.
  306. .I open
  307. on non-exclusive devices that require no setup)
  308. the
  309. .I cdevsw
  310. entry can be given as
  311. .I nulldev;
  312. if it should be considered an error,
  313. (e.g.
  314. .I write
  315. on read-only devices)
  316. .I nodev
  317. is used.
  318. For terminals,
  319. the
  320. .I cdevsw
  321. structure also contains a pointer to the
  322. .I tty
  323. structure associated with the terminal.
  324. .PP
  325. The
  326. .I open
  327. routine is called each time the file
  328. is opened with the full device number as argument.
  329. The second argument is a flag which is
  330. non-zero only if the device is to be written upon.
  331. .PP
  332. The
  333. .I close
  334. routine is called only when the file
  335. is closed for the last time,
  336. that is when the very last process in
  337. which the file is open closes it.
  338. This means it is not possible for the driver to
  339. maintain its own count of its users.
  340. The first argument is the device number;
  341. the second is a flag which is non-zero
  342. if the file was open for writing in the process which
  343. performs the final
  344. .I close.
  345. .PP
  346. When
  347. .I write
  348. is called, it is supplied the device
  349. as argument.
  350. The per-user variable
  351. .I u.u_count
  352. has been set to
  353. the number of characters indicated by the user;
  354. for character devices, this number may be 0
  355. initially.
  356. .I u.u_base
  357. is the address supplied by the user from which to start
  358. taking characters.
  359. The system may call the
  360. routine internally, so the
  361. flag
  362. .I u.u_segflg
  363. is supplied that indicates,
  364. if
  365. .I on,
  366. that
  367. .I u.u_base
  368. refers to the system address space instead of
  369. the user's.
  370. .PP
  371. The
  372. .I write
  373. routine
  374. should copy up to
  375. .I u.u_count
  376. characters from the user's buffer to the device,
  377. decrementing
  378. .I u.u_count
  379. for each character passed.
  380. For most drivers, which work one character at a time,
  381. the routine
  382. .I "cpass( )"
  383. is used to pick up characters
  384. from the user's buffer.
  385. Successive calls on it return
  386. the characters to be written until
  387. .I u.u_count
  388. goes to 0 or an error occurs,
  389. when it returns \(mi1.
  390. .I Cpass
  391. takes care of interrogating
  392. .I u.u_segflg
  393. and updating
  394. .I u.u_count.
  395. .PP
  396. Write routines which want to transfer
  397. a probably large number of characters into an internal
  398. buffer may also use the routine
  399. .I "iomove(buffer, offset, count, flag)"
  400. which is faster when many characters must be moved.
  401. .I Iomove
  402. transfers up to
  403. .I count
  404. characters into the
  405. .I buffer
  406. starting
  407. .I offset
  408. bytes from the start of the buffer;
  409. .I flag
  410. should be
  411. .I B_WRITE
  412. (which is 0) in the write case.
  413. Caution:
  414. the caller is responsible for making sure
  415. the count is not too large and is non-zero.
  416. As an efficiency note,
  417. .I iomove
  418. is much slower if any of
  419. .I "buffer+offset, count"
  420. or
  421. .I u.u_base
  422. is odd.
  423. .PP
  424. The device's
  425. .I read
  426. routine is called under conditions similar to
  427. .I write,
  428. except that
  429. .I u.u_count
  430. is guaranteed to be non-zero.
  431. To return characters to the user, the routine
  432. .I "passc(c)"
  433. is available; it takes care of housekeeping
  434. like
  435. .I cpass
  436. and returns \(mi1 as the last character
  437. specified by
  438. .I u.u_count
  439. is returned to the user;
  440. before that time, 0 is returned.
  441. .I Iomove
  442. is also usable as with
  443. .I write;
  444. the flag should be
  445. .I B_READ
  446. but the same cautions apply.
  447. .PP
  448. The ``special-functions'' routine
  449. is invoked by the
  450. .I stty
  451. and
  452. .I gtty
  453. system calls as follows:
  454. .I "(*p) (dev, v)"
  455. where
  456. .I p
  457. is a pointer to the device's routine,
  458. .I dev
  459. is the device number,
  460. and
  461. .I v
  462. is a vector.
  463. In the
  464. .I gtty
  465. case,
  466. the device is supposed to place up to 3 words of status information
  467. into the vector; this will be returned to the caller.
  468. In the
  469. .I stty
  470. case,
  471. .I v
  472. is 0;
  473. the device should take up to 3 words of
  474. control information from
  475. the array
  476. .I "u.u_arg[0...2]."
  477. .PP
  478. Finally, each device should have appropriate interrupt-time
  479. routines.
  480. When an interrupt occurs, it is turned into a C-compatible call
  481. on the devices's interrupt routine.
  482. The interrupt-catching mechanism makes
  483. the low-order four bits of the ``new PS'' word in the
  484. trap vector for the interrupt available
  485. to the interrupt handler.
  486. This is conventionally used by drivers
  487. which deal with multiple similar devices
  488. to encode the minor device number.
  489. After the interrupt has been processed,
  490. a return from the interrupt handler will
  491. return from the interrupt itself.
  492. .PP
  493. A number of subroutines are available which are useful
  494. to character device drivers.
  495. Most of these handlers, for example, need a place
  496. to buffer characters in the internal interface
  497. between their ``top half'' (read/write)
  498. and ``bottom half'' (interrupt) routines.
  499. For relatively low data-rate devices, the best mechanism
  500. is the character queue maintained by the
  501. routines
  502. .I getc
  503. and
  504. .I putc.
  505. A queue header has the structure
  506. .DS
  507. struct {
  508. int c_cc; /* character count */
  509. char *c_cf; /* first character */
  510. char *c_cl; /* last character */
  511. } queue;
  512. .DE
  513. A character is placed on the end of a queue by
  514. .I "putc(c, &queue)"
  515. where
  516. .I c
  517. is the character and
  518. .I queue
  519. is the queue header.
  520. The routine returns \(mi1 if there is no space
  521. to put the character, 0 otherwise.
  522. The first character on the queue may be retrieved
  523. by
  524. .I "getc(&queue)"
  525. which returns either the (non-negative) character
  526. or \(mi1 if the queue is empty.
  527. .PP
  528. Notice that the space for characters in queues is
  529. shared among all devices in the system
  530. and in the standard system there are only some 600
  531. character slots available.
  532. Thus device handlers,
  533. especially write routines, must take
  534. care to avoid gobbling up excessive numbers of characters.
  535. .PP
  536. The other major help available
  537. to device handlers is the sleep-wakeup mechanism.
  538. The call
  539. .I "sleep(event, priority)"
  540. causes the process to wait (allowing other processes to run)
  541. until the
  542. .I event
  543. occurs;
  544. at that time, the process is marked ready-to-run
  545. and the call will return when there is no
  546. process with higher
  547. .I priority.
  548. .PP
  549. The call
  550. .I "wakeup(event)"
  551. indicates that the
  552. .I event
  553. has happened, that is, causes processes sleeping
  554. on the event to be awakened.
  555. The
  556. .I event
  557. is an arbitrary quantity agreed upon
  558. by the sleeper and the waker-up.
  559. By convention, it is the address of some data area used
  560. by the driver, which guarantees that events
  561. are unique.
  562. .PP
  563. Processes sleeping on an event should not assume
  564. that the event has really happened;
  565. they should check that the conditions which
  566. caused them to sleep no longer hold.
  567. .PP
  568. Priorities can range from 0 to 127;
  569. a higher numerical value indicates a less-favored
  570. scheduling situation.
  571. A distinction is made between processes sleeping
  572. at priority less than the parameter
  573. .I PZERO
  574. and those at numerically larger priorities.
  575. The former cannot
  576. be interrupted by signals, although it
  577. is conceivable that it may be swapped out.
  578. Thus it is a bad idea to sleep with
  579. priority less than PZERO on an event which might never occur.
  580. On the other hand, calls to
  581. .I sleep
  582. with larger priority
  583. may never return if the process is terminated by
  584. some signal in the meantime.
  585. Incidentally, it is a gross error to call
  586. .I sleep
  587. in a routine called at interrupt time, since the process
  588. which is running is almost certainly not the
  589. process which should go to sleep.
  590. Likewise, none of the variables in the user area
  591. ``\fIu\fB.\fR''
  592. should be touched, let alone changed, by an interrupt routine.
  593. .PP
  594. If a device driver
  595. wishes to wait for some event for which it is inconvenient
  596. or impossible to supply a
  597. .I wakeup,
  598. (for example, a device going on-line, which does not
  599. generally cause an interrupt),
  600. the call
  601. .I "sleep(&lbolt, priority)
  602. may be given.
  603. .I Lbolt
  604. is an external cell whose address is awakened once every 4 seconds
  605. by the clock interrupt routine.
  606. .PP
  607. The routines
  608. .I "spl4( ), spl5( ), spl6( ), spl7( )"
  609. are available to
  610. set the processor priority level as indicated to avoid
  611. inconvenient interrupts from the device.
  612. .PP
  613. If a device needs to know about real-time intervals,
  614. then
  615. .I "timeout(func, arg, interval)
  616. will be useful.
  617. This routine arranges that after
  618. .I interval
  619. sixtieths of a second, the
  620. .I func
  621. will be called with
  622. .I arg
  623. as argument, in the style
  624. .I "(*func)(arg).
  625. Timeouts are used, for example,
  626. to provide real-time delays after function characters
  627. like new-line and tab in typewriter output,
  628. and to terminate an attempt to
  629. read the 201 Dataphone
  630. .I dp
  631. if there is no response within a specified number
  632. of seconds.
  633. Notice that the number of sixtieths of a second is limited to 32767,
  634. since it must appear to be positive,
  635. and that only a bounded number of timeouts
  636. can be going on at once.
  637. Also, the specified
  638. .I func
  639. is called at clock-interrupt time, so it should
  640. conform to the requirements of interrupt routines
  641. in general.
  642. .SH
  643. The Block-device Interface
  644. .PP
  645. Handling of block devices is mediated by a collection
  646. of routines that manage a set of buffers containing
  647. the images of blocks of data on the various devices.
  648. The most important purpose of these routines is to assure
  649. that several processes that access the same block of the same
  650. device in multiprogrammed fashion maintain a consistent
  651. view of the data in the block.
  652. A secondary but still important purpose is to increase
  653. the efficiency of the system by
  654. keeping in-core copies of blocks that are being
  655. accessed frequently.
  656. The main data base for this mechanism is the
  657. table of buffers
  658. .I buf.
  659. Each buffer header contains a pair of pointers
  660. .I "(b_forw, b_back)"
  661. which maintain a doubly-linked list
  662. of the buffers associated with a particular
  663. block device, and a
  664. pair of pointers
  665. .I "(av_forw, av_back)"
  666. which generally maintain a doubly-linked list of blocks
  667. which are ``free,'' that is,
  668. eligible to be reallocated for another transaction.
  669. Buffers that have I/O in progress
  670. or are busy for other purposes do not appear in this list.
  671. The buffer header
  672. also contains the device and block number to which the
  673. buffer refers, and a pointer to the actual storage associated with
  674. the buffer.
  675. There is a word count
  676. which is the negative of the number of words
  677. to be transferred to or from the buffer;
  678. there is also an error byte and a residual word
  679. count used to communicate information
  680. from an I/O routine to its caller.
  681. Finally, there is a flag word
  682. with bits indicating the status of the buffer.
  683. These flags will be discussed below.
  684. .PP
  685. Seven routines constitute
  686. the most important part of the interface with the
  687. rest of the system.
  688. Given a device and block number,
  689. both
  690. .I bread
  691. and
  692. .I getblk
  693. return a pointer to a buffer header for the block;
  694. the difference is that
  695. .I bread
  696. is guaranteed to return a buffer actually containing the
  697. current data for the block,
  698. while
  699. .I getblk
  700. returns a buffer which contains the data in the
  701. block only if it is already in core (whether it is
  702. or not is indicated by the
  703. .I B_DONE
  704. bit; see below).
  705. In either case the buffer, and the corresponding
  706. device block, is made ``busy,''
  707. so that other processes referring to it
  708. are obliged to wait until it becomes free.
  709. .I Getblk
  710. is used, for example,
  711. when a block is about to be totally rewritten,
  712. so that its previous contents are
  713. not useful;
  714. still, no other process can be allowed to refer to the block
  715. until the new data is placed into it.
  716. .PP
  717. The
  718. .I breada
  719. routine is used to implement read-ahead.
  720. it is logically similar to
  721. .I bread,
  722. but takes as an additional argument the number of
  723. a block (on the same device) to be read asynchronously
  724. after the specifically requested block is available.
  725. .PP
  726. Given a pointer to a buffer,
  727. the
  728. .I brelse
  729. routine
  730. makes the buffer again available to other processes.
  731. It is called, for example, after
  732. data has been extracted following a
  733. .I bread.
  734. There are three subtly-different write routines,
  735. all of which take a buffer pointer as argument,
  736. and all of which logically release the buffer for
  737. use by others and place it on the free list.
  738. .I Bwrite
  739. puts the
  740. buffer on the appropriate device queue,
  741. waits for the write to be done,
  742. and sets the user's error flag if required.
  743. .I Bawrite
  744. places the buffer on the device's queue, but does not wait
  745. for completion, so that errors cannot be reflected directly to
  746. the user.
  747. .I Bdwrite
  748. does not start any I/O operation at all,
  749. but merely marks
  750. the buffer so that if it happens
  751. to be grabbed from the free list to contain
  752. data from some other block, the data in it will
  753. first be written
  754. out.
  755. .PP
  756. .I Bwrite
  757. is used when one wants to be sure that
  758. I/O takes place correctly, and that
  759. errors are reflected to the proper user;
  760. it is used, for example, when updating i-nodes.
  761. .I Bawrite
  762. is useful when more overlap is desired
  763. (because no wait is required for I/O to finish)
  764. but when it is reasonably certain that the
  765. write is really required.
  766. .I Bdwrite
  767. is used when there is doubt that the write is
  768. needed at the moment.
  769. For example,
  770. .I bdwrite
  771. is called when the last byte of a
  772. .I write
  773. system call falls short of the end of a
  774. block, on the assumption that
  775. another
  776. .I write
  777. will be given soon which will re-use the same block.
  778. On the other hand,
  779. as the end of a block is passed,
  780. .I bawrite
  781. is called, since probably the block will
  782. not be accessed again soon and one might as
  783. well start the writing process as soon as possible.
  784. .PP
  785. In any event, notice that the routines
  786. .I "getblk"
  787. and
  788. .I bread
  789. dedicate the given block exclusively to the
  790. use of the caller, and make others wait,
  791. while one of
  792. .I "brelse, bwrite, bawrite,"
  793. or
  794. .I bdwrite
  795. must eventually be called to free the block for use by others.
  796. .PP
  797. As mentioned, each buffer header contains a flag
  798. word which indicates the status of the buffer.
  799. Since they provide
  800. one important channel for information between the drivers and the
  801. block I/O system, it is important to understand these flags.
  802. The following names are manifest constants which
  803. select the associated flag bits.
  804. .IP B_READ 10
  805. This bit is set when the buffer is handed to the device strategy routine
  806. (see below) to indicate a read operation.
  807. The symbol
  808. .I B_WRITE
  809. is defined as 0 and does not define a flag; it is provided
  810. as a mnemonic convenience to callers of routines like
  811. .I swap
  812. which have a separate argument
  813. which indicates read or write.
  814. .IP B_DONE 10
  815. This bit is set
  816. to 0 when a block is handed to the device strategy
  817. routine and is turned on when the operation completes,
  818. whether normally as the result of an error.
  819. It is also used as part of the return argument of
  820. .I getblk
  821. to indicate if 1 that the returned
  822. buffer actually contains the data in the requested block.
  823. .IP B_ERROR 10
  824. This bit may be set to 1 when
  825. .I B_DONE
  826. is set to indicate that an I/O or other error occurred.
  827. If it is set the
  828. .I b_error
  829. byte of the buffer header may contain an error code
  830. if it is non-zero.
  831. If
  832. .I b_error
  833. is 0 the nature of the error is not specified.
  834. Actually no driver at present sets
  835. .I b_error;
  836. the latter is provided for a future improvement
  837. whereby a more detailed error-reporting
  838. scheme may be implemented.
  839. .IP B_BUSY 10
  840. This bit indicates that the buffer header is not on
  841. the free list, i.e. is
  842. dedicated to someone's exclusive use.
  843. The buffer still remains attached to the list of
  844. blocks associated with its device, however.
  845. When
  846. .I getblk
  847. (or
  848. .I bread,
  849. which calls it) searches the buffer list
  850. for a given device and finds the requested
  851. block with this bit on, it sleeps until the bit
  852. clears.
  853. .IP B_PHYS 10
  854. This bit is set for raw I/O transactions that
  855. need to allocate the Unibus map on an 11/70.
  856. .IP B_MAP 10
  857. This bit is set on buffers that have the Unibus map allocated,
  858. so that the
  859. .I iodone
  860. routine knows to deallocate the map.
  861. .IP B_WANTED 10
  862. This flag is used in conjunction with the
  863. .I B_BUSY
  864. bit.
  865. Before sleeping as described
  866. just above,
  867. .I getblk
  868. sets this flag.
  869. Conversely, when the block is freed and the busy bit
  870. goes down (in
  871. .I brelse)
  872. a
  873. .I wakeup
  874. is given for the block header whenever
  875. .I B_WANTED
  876. is on.
  877. This strategem avoids the overhead
  878. of having to call
  879. .I wakeup
  880. every time a buffer is freed on the chance that someone
  881. might want it.
  882. .IP B_AGE
  883. This bit may be set on buffers just before releasing them; if it
  884. is on,
  885. the buffer is placed at the head of the free list, rather than at the
  886. tail.
  887. It is a performance heuristic
  888. used when the caller judges that the same block will not soon be used again.
  889. .IP B_ASYNC 10
  890. This bit is set by
  891. .I bawrite
  892. to indicate to the appropriate device driver
  893. that the buffer should be released when the
  894. write has been finished, usually at interrupt time.
  895. The difference between
  896. .I bwrite
  897. and
  898. .I bawrite
  899. is that the former starts I/O, waits until it is done, and
  900. frees the buffer.
  901. The latter merely sets this bit and starts I/O.
  902. The bit indicates that
  903. .I relse
  904. should be called for the buffer on completion.
  905. .IP B_DELWRI 10
  906. This bit is set by
  907. .I bdwrite
  908. before releasing the buffer.
  909. When
  910. .I getblk,
  911. while searching for a free block,
  912. discovers the bit is 1 in a buffer it would otherwise grab,
  913. it causes the block to be written out before reusing it.
  914. .SH
  915. Block Device Drivers
  916. .PP
  917. The
  918. .I bdevsw
  919. table contains the names of the interface routines
  920. and that of a table for each block device.
  921. .PP
  922. Just as for character devices, block device drivers may supply
  923. an
  924. .I open
  925. and a
  926. .I close
  927. routine
  928. called respectively on each open and on the final close
  929. of the device.
  930. Instead of separate read and write routines,
  931. each block device driver has a
  932. .I strategy
  933. routine which is called with a pointer to a buffer
  934. header as argument.
  935. As discussed, the buffer header contains
  936. a read/write flag, the core address,
  937. the block number, a (negative) word count,
  938. and the major and minor device number.
  939. The role of the strategy routine
  940. is to carry out the operation as requested by the
  941. information in the buffer header.
  942. When the transaction is complete the
  943. .I B_DONE
  944. (and possibly the
  945. .I B_ERROR)
  946. bits should be set.
  947. Then if the
  948. .I B_ASYNC
  949. bit is set,
  950. .I brelse
  951. should be called;
  952. otherwise,
  953. .I wakeup.
  954. In cases where the device
  955. is capable, under error-free operation,
  956. of transferring fewer words than requested,
  957. the device's word-count register should be placed
  958. in the residual count slot of
  959. the buffer header;
  960. otherwise, the residual count should be set to 0.
  961. This particular mechanism is really for the benefit
  962. of the magtape driver;
  963. when reading this device
  964. records shorter than requested are quite normal,
  965. and the user should be told the actual length of the record.
  966. .PP
  967. Although the most usual argument
  968. to the strategy routines
  969. is a genuine buffer header allocated as discussed above,
  970. all that is actually required
  971. is that the argument be a pointer to a place containing the
  972. appropriate information.
  973. For example the
  974. .I swap
  975. routine, which manages movement
  976. of core images to and from the swapping device,
  977. uses the strategy routine
  978. for this device.
  979. Care has to be taken that
  980. no extraneous bits get turned on in the
  981. flag word.
  982. .PP
  983. The device's table specified by
  984. .I bdevsw
  985. has a
  986. byte to contain an active flag and an error count,
  987. a pair of links which constitute the
  988. head of the chain of buffers for the device
  989. .I "(b_forw, b_back),"
  990. and a first and last pointer for a device queue.
  991. Of these things, all are used solely by the device driver
  992. itself
  993. except for the buffer-chain pointers.
  994. Typically the flag encodes the state of the
  995. device, and is used at a minimum to
  996. indicate that the device is currently engaged in
  997. transferring information and no new command should be issued.
  998. The error count is useful for counting retries
  999. when errors occur.
  1000. The device queue is used to remember stacked requests;
  1001. in the simplest case it may be maintained as a first-in
  1002. first-out list.
  1003. Since buffers which have been handed over to
  1004. the strategy routines are never
  1005. on the list of free buffers,
  1006. the pointers in the buffer which maintain the free list
  1007. .I "(av_forw, av_back)"
  1008. are also used to contain the pointers
  1009. which maintain the device queues.
  1010. .PP
  1011. A couple of routines
  1012. are provided which are useful to block device drivers.
  1013. .I "iodone(bp)"
  1014. arranges that the buffer to which
  1015. .I bp
  1016. points be released or awakened,
  1017. as appropriate,
  1018. when the
  1019. strategy module has finished with the buffer,
  1020. either normally or after an error.
  1021. (In the latter case the
  1022. .I B_ERROR
  1023. bit has presumably been set.)
  1024. .PP
  1025. The routine
  1026. .I "geterror(bp)"
  1027. can be used to examine the error bit in a buffer header
  1028. and arrange that any error indication found therein is
  1029. reflected to the user.
  1030. It may be called only in the non-interrupt
  1031. part of a driver when I/O has completed
  1032. .I (B_DONE
  1033. has been set).
  1034. .SH
  1035. Raw Block-device I/O
  1036. .PP
  1037. A scheme has been set up whereby block device drivers may
  1038. provide the ability to transfer information
  1039. directly between the user's core image and the device
  1040. without the use of buffers and in blocks as large as
  1041. the caller requests.
  1042. The method involves setting up a character-type special file
  1043. corresponding to the raw device
  1044. and providing
  1045. .I read
  1046. and
  1047. .I write
  1048. routines which set up what is usually a private,
  1049. non-shared buffer header with the appropriate information
  1050. and call the device's strategy routine.
  1051. If desired, separate
  1052. .I open
  1053. and
  1054. .I close
  1055. routines may be provided but this is usually unnecessary.
  1056. A special-function routine might come in handy, especially for
  1057. magtape.
  1058. .PP
  1059. A great deal of work has to be done to generate the
  1060. ``appropriate information''
  1061. to put in the argument buffer for
  1062. the strategy module;
  1063. the worst part is to map relocated user addresses to physical addresses.
  1064. Most of this work is done by
  1065. .I "physio(strat, bp, dev, rw)
  1066. whose arguments are the name of the
  1067. strategy routine
  1068. .I strat,
  1069. the buffer pointer
  1070. .I bp,
  1071. the device number
  1072. .I dev,
  1073. and a read-write flag
  1074. .I rw
  1075. whose value is either
  1076. .I B_READ
  1077. or
  1078. .I B_WRITE.
  1079. .I Physio
  1080. makes sure that the user's base address and count are
  1081. even (because most devices work in words)
  1082. and that the core area affected is contiguous
  1083. in physical space;
  1084. it delays until the buffer is not busy, and makes it
  1085. busy while the operation is in progress;
  1086. and it sets up user error return information.