/share/doc/papers/bufbio/bio.ms
https://bitbucket.org/freebsd/freebsd-head/ · Unknown · 830 lines · 813 code · 17 blank · 0 comment · 0 complexity · 7fa35f3fa72f926343b02caf15ce63f8 MD5 · raw file
- .\" ----------------------------------------------------------------------------
- .\" "THE BEER-WARE LICENSE" (Revision 42):
- .\" <phk@FreeBSD.ORG> wrote this file. As long as you retain this notice you
- .\" can do whatever you want with this stuff. If we meet some day, and you think
- .\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
- .\" ----------------------------------------------------------------------------
- .\"
- .\" $FreeBSD$
- .\"
- .if n .ftr C R
- .nr PI 2n
- .TL
- The case for struct bio
- .br
- - or -
- .br
- A road map for a stackable BIO subsystem in FreeBSD
- .AU
- Poul-Henning Kamp <phk@FreeBSD.org>
- .AI
- The FreeBSD Project
- .AB
- Historically, the only translation performed on I/O requests after
- they they left the file-system layer were logical sub disk implementation
- done in the device driver. No universal standard for how sub disks are
- configured and implemented exists, in fact pretty much every single platform
- and operating system have done it their own way. As FreeBSD migrates to
- other platforms it needs to understand these local conventions to be
- able to co-exist with other operating systems on the same disk.
- .PP
- Recently a number of technologies like RAID have expanded the
- concept of "a disk" a fair bit and while these technologies initially
- were implemented in separate hardware they increasingly migrate into
- the operating systems as standard functionality.
- .PP
- Both of these factors indicate the need for a structured approach to
- systematic "geometry manipulation" facilities in FreeBSD.
- .PP
- This paper contains the road-map for a stackable "BIO" system in
- FreeBSD, which will support these facilities.
- .AE
- .NH
- The miseducation of \fCstruct buf\fP.
- .PP
- To fully appreciate the topic, I include a little historic overview
- of struct buf, it is a most enlightening case of not exactly bit-rot
- but more appropriately design-rot.
- .PP
- In the beginning, which for this purpose extends until virtual
- memory is was introduced into UNIX, all disk I/O were done from or
- to a struct buf. In the 6th edition sources, as printed in Lions
- Book, struct buf looks like this:
- .DS
- .ft C
- .ps -1
- struct buf
- {
- int b_flags; /* see defines below */
- struct buf *b_forw; /* headed by devtab of b_dev */
- struct buf *b_back; /* ' */
- struct buf *av_forw; /* position on free list, */
- struct buf *av_back; /* if not BUSY*/
- int b_dev; /* major+minor device name */
- int b_wcount; /* transfer count (usu. words) */
- char *b_addr; /* low order core address */
- char *b_xmem; /* high order core address */
- char *b_blkno; /* block # on device */
- char b_error; /* returned after I/O */
- char *b_resid; /* words not transferred after
- error */
- } buf[NBUF];
- .ps +1
- .ft P
- .DE
- .PP
- At this point in time, struct buf had only two functions:
- To act as a cache
- and to transport I/O operations to device drivers. For the purpose of
- this document, the cache functionality is uninteresting and will be
- ignored.
- .PP
- The I/O operations functionality consists of three parts:
- .IP "" 5n
- \(bu Where in Ram/Core is the data located (b_addr, b_xmem, b_wcount).
- .IP
- \(bu Where on disk is the data located (b_dev, b_blkno)
- .IP
- \(bu Request and result information (b_flags, b_error, b_resid)
- .PP
- In addition to this, the av_forw and av_back elements are
- used by the disk device drivers to put requests on a linked list.
- All in all the majority of struct buf is involved with the I/O
- aspect and only a few fields relate exclusively to the cache aspect.
- .PP
- If we step forward to the BSD 4.4-Lite-2 release, struct buf has grown
- a bit here or there:
- .DS
- .ft C
- .ps -1
- struct buf {
- LIST_ENTRY(buf) b_hash; /* Hash chain. */
- LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
- TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
- struct buf *b_actf, **b_actb; /* Device driver queue when active. */
- struct proc *b_proc; /* Associated proc; NULL if kernel. */
- volatile long b_flags; /* B_* flags. */
- int b_error; /* Errno value. */
- long b_bufsize; /* Allocated buffer size. */
- long b_bcount; /* Valid bytes in buffer. */
- long b_resid; /* Remaining I/O. */
- dev_t b_dev; /* Device associated with buffer. */
- struct {
- caddr_t b_addr; /* Memory, superblocks, indirect etc. */
- } b_un;
- void *b_saveaddr; /* Original b_addr for physio. */
- daddr_t b_lblkno; /* Logical block number. */
- daddr_t b_blkno; /* Underlying physical block number. */
- /* Function to call upon completion. */
- void (*b_iodone) __P((struct buf *));
- struct vnode *b_vp; /* Device vnode. */
- long b_pfcent; /* Center page when swapping cluster. */
- /* XXX pfcent should be int; overld. */
- int b_dirtyoff; /* Offset in buffer of dirty region. */
- int b_dirtyend; /* Offset of end of dirty region. */
- struct ucred *b_rcred; /* Read credentials reference. */
- struct ucred *b_wcred; /* Write credentials reference. */
- int b_validoff; /* Offset in buffer of valid region. */
- int b_validend; /* Offset of end of valid region. */
- };
- .ps +1
- .ft P
- .DE
- .PP
- The main piece of action is the addition of vnodes, a VM system and a
- prototype LFS filesystem, all of which needed some handles on struct
- buf. Comparison will show that the I/O aspect of struct buf is in
- essence unchanged, the length field is now in bytes instead of words,
- the linked list the drivers can use has been renamed (b_actf,
- b_actb) and a b_iodone pointer for callback notification has been added
- but otherwise there is no change to the fields which
- represent the I/O aspect. All the new fields relate to the cache
- aspect, link buffers to the VM system, provide hacks for file-systems
- (b_lblkno) etc etc.
- .PP
- By the time we get to FreeBSD 3.0 more stuff has grown on struct buf:
- .DS
- .ft C
- .ps -1
- struct buf {
- LIST_ENTRY(buf) b_hash; /* Hash chain. */
- LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
- TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
- TAILQ_ENTRY(buf) b_act; /* Device driver queue when active. *new* */
- struct proc *b_proc; /* Associated proc; NULL if kernel. */
- long b_flags; /* B_* flags. */
- unsigned short b_qindex; /* buffer queue index */
- unsigned char b_usecount; /* buffer use count */
- int b_error; /* Errno value. */
- long b_bufsize; /* Allocated buffer size. */
- long b_bcount; /* Valid bytes in buffer. */
- long b_resid; /* Remaining I/O. */
- dev_t b_dev; /* Device associated with buffer. */
- caddr_t b_data; /* Memory, superblocks, indirect etc. */
- caddr_t b_kvabase; /* base kva for buffer */
- int b_kvasize; /* size of kva for buffer */
- daddr_t b_lblkno; /* Logical block number. */
- daddr_t b_blkno; /* Underlying physical block number. */
- off_t b_offset; /* Offset into file */
- /* Function to call upon completion. */
- void (*b_iodone) __P((struct buf *));
- /* For nested b_iodone's. */
- struct iodone_chain *b_iodone_chain;
- struct vnode *b_vp; /* Device vnode. */
- int b_dirtyoff; /* Offset in buffer of dirty region. */
- int b_dirtyend; /* Offset of end of dirty region. */
- struct ucred *b_rcred; /* Read credentials reference. */
- struct ucred *b_wcred; /* Write credentials reference. */
- int b_validoff; /* Offset in buffer of valid region. */
- int b_validend; /* Offset of end of valid region. */
- daddr_t b_pblkno; /* physical block number */
- void *b_saveaddr; /* Original b_addr for physio. */
- caddr_t b_savekva; /* saved kva for transfer while bouncing */
- void *b_driver1; /* for private use by the driver */
- void *b_driver2; /* for private use by the driver */
- void *b_spc;
- union cluster_info {
- TAILQ_HEAD(cluster_list_head, buf) cluster_head;
- TAILQ_ENTRY(buf) cluster_entry;
- } b_cluster;
- struct vm_page *b_pages[btoc(MAXPHYS)];
- int b_npages;
- struct workhead b_dep; /* List of filesystem dependencies. */
- };
- .ps +1
- .ft P
- .DE
- .PP
- Still we find that the I/O aspect of struct buf is in essence unchanged. A couple of fields have been added which allows the driver to hang local data off the buf while working on it have been added (b_driver1, b_driver2) and a "physical block number" (b_pblkno) have been added.
- .PP
- This p_blkno is relevant, it has been added because the disklabel/slice
- code have been abstracted out of the device drivers, the filesystem
- ask for b_blkno, the slice/label code translates this into b_pblkno
- which the device driver operates on.
- .PP
- After this point some minor cleanups have happened, some unused fields
- have been removed etc but the I/O aspect of struct buf is still only
- a fraction of the entire structure: less than a quarter of the
- bytes in a struct buf are used for the I/O aspect and struct buf
- seems to continue to grow and grow.
- .PP
- Since version 6 as documented in Lions book, a three significant pieces
- of code have emerged which need to do non-trivial translations of
- the I/O request before it reaches the device drivers: CCD, slice/label
- and Vinum. They all basically do the same: they map I/O requests from
- a logical space to a physical space, and the mappings they perform
- can be 1:1 or 1:N. \**
- .FS
- It is interesting to note that Lions in his comments to the \fCrkaddr\fP
- routine (p. 16-2) writes \fIThe code in this procedure incorporates
- a special feature for files which extend over more than one disk
- drive. This feature is described in the UPM Section "RK(IV)". Its
- usefulness seems to be restricted.\fP This more than hints at the
- presence already then of various hacks to stripe/span multiple devices.
- .FE
- .PP
- The 1:1 mapping of the slice/label code is rather trivial, and the
- addition of the b_pblkno field catered for the majority of the issues
- this resulted in, leaving but one: Reads or writes to the magic "disklabel"
- or equally magic "MBR" sectors on a disk must be caught, examined and in
- some cases modified before being passed on to the device driver. This need
- resulted in the addition of the b_iodone_chain field which adds a limited
- ability to stack I/O operations;
- .PP
- The 1:N mapping of CCD and Vinum are far more interesting. These two
- subsystems look like a device driver, but rather than drive some piece
- of hardware, they allocate new struct buf data structures populates
- these and pass them on to other device drivers.
- .PP
- Apart from it being inefficient to lug about a 348 bytes data structure
- when 80 bytes would have done, it also leads to significant code rot
- when programmers don't know what to do about the remaining fields or
- even worse: "borrow" a field or two for their own uses.
- .PP
- .ID
- .if t .PSPIC bufsize.eps
- .if n [graph not available in this format]
- .DE
- .I
- Conclusions:
- .IP "" 5n
- \(bu Struct buf is victim of chronic bloat.
- .IP
- \(bu The I/O aspect of
- struct buf is practically constant and only about \(14 of the total bytes.
- .IP
- \(bu Struct buf currently have several users, vinum, ccd and to
- limited extent diskslice/label, which
- need only the I/O aspect, not the vnode, caching or VM linkage.
- .IP
- .I
- The I/O aspect of struct buf should be put in a separate \fCstruct bio\fP.
- .R
- .NH 1
- Implications for future struct buf improvements
- .PP
- Concerns have been raised about the implications this separation
- will have for future work on struct buf, I will try to address
- these concerns here.
- .PP
- As the existence and popularity of vinum and ccd proves, there is
- a legitimate and valid requirement to be able to do I/O operations
- which are not initiated by a vnode or filesystem operation.
- In other words, an I/O request is a fully valid entity in its own
- right and should be treated like that.
- .PP
- Without doubt, the I/O request has to be tuned to fit the needs
- of struct buf users in the best possible way, and consequently
- any future changes in struct buf are likely to affect the I/O request
- semantics.
- .PP
- One particular change which has been proposed is to drop the present
- requirement that a struct buf be mapped contiguously into kernel
- address space. The argument goes that since many modern drivers use
- physical address DMA to transfer the data maintaining such a mapping
- is needless overhead.
- .PP
- Of course some drivers will still need to be able to access the
- buffer in kernel address space and some kind of compatibility
- must be provided there.
- .PP
- The question is, if such a change is made impossible by the
- separation of the I/O aspect into its own data structure?
- .PP
- The answer to this is ``no''.
- Anything that could be added to or done with
- the I/O aspect of struct buf can also be added to or done
- with the I/O aspect if it lives in a new "struct bio".
- .NH 1
- Implementing a \fCstruct bio\fP
- .PP
- The first decision to be made was who got to use the name "struct buf",
- and considering the fact that it is the I/O aspect which gets separated
- out and that it only covers about \(14 of the bytes in struct buf,
- obviously the new structure for the I/O aspect gets a new name.
- Examining the naming in the kernel, the "bio" prefix seemed a given,
- for instance, the function to signal completion of an I/O request is
- already named "biodone()".
- .PP
- Making the transition smooth is obviously also a priority and after
- some prototyping \**
- .FS
- The software development technique previously known as "Trial & Error".
- .FE
- it was found that a totally transparent transition could be made by
- embedding a copy of the new "struct bio" as the first element of "struct buf"
- and by using cpp(1) macros to alias the fields to the legacy struct buf
- names.
- .NH 2
- The b_flags problem.
- .PP
- Struct bio was defined by examining all code existing in the driver tree
- and finding all the struct buf fields which were legitimately used (as
- opposed to "hi-jacked" fields).
- One field was found to have "dual-use": the b_flags field.
- This required special attention.
- Examination showed that b_flags were used for three things:
- .IP "" 5n
- \(bu Communication of the I/O command (READ, WRITE, FORMAT, DELETE)
- .IP
- \(bu Communication of ordering and error status
- .IP
- \(bu General status for non I/O aspect consumers of struct buf.
- .PP
- For historic reasons B_WRITE was defined to be zero, which lead to
- confusion and bugs, this pushed the decision to have a separate
- "b_iocmd" field in struct buf and struct bio for communicating
- only the action to be performed.
- .PP
- The ordering and error status bits were put in a new flag field "b_ioflag".
- This has left sufficiently many now unused bits in b_flags that the b_xflags element
- can now be merged back into b_flags.
- .NH 2
- Definition of struct bio
- .PP
- With the cleanup of b_flags in place, the definition of struct bio looks like this:
- .DS
- .ft C
- .ps -1
- struct bio {
- u_int bio_cmd; /* I/O operation. */
- dev_t bio_dev; /* Device to do I/O on. */
- daddr_t bio_blkno; /* Underlying physical block number. */
- off_t bio_offset; /* Offset into file. */
- long bio_bcount; /* Valid bytes in buffer. */
- caddr_t bio_data; /* Memory, superblocks, indirect etc. */
- u_int bio_flags; /* BIO_ flags. */
- struct buf *_bio_buf; /* Parent buffer. */
- int bio_error; /* Errno for BIO_ERROR. */
- long bio_resid; /* Remaining I/O in bytes. */
- void (*bio_done) __P((struct buf *));
- void *bio_driver1; /* Private use by the callee. */
- void *bio_driver2; /* Private use by the callee. */
- void *bio_caller1; /* Private use by the caller. */
- void *bio_caller2; /* Private use by the caller. */
- TAILQ_ENTRY(bio) bio_queue; /* Disksort queue. */
- daddr_t bio_pblkno; /* physical block number */
- struct iodone_chain *bio_done_chain;
- };
- .ps +1
- .ft P
- .DE
- .NH 2
- Definition of struct buf
- .PP
- After adding a struct bio to struct buf and the fields aliased into it
- struct buf looks like this:
- .DS
- .ft C
- .ps -1
- struct buf {
- /* XXX: b_io must be the first element of struct buf for now /phk */
- struct bio b_io; /* "Builtin" I/O request. */
- #define b_bcount b_io.bio_bcount
- #define b_blkno b_io.bio_blkno
- #define b_caller1 b_io.bio_caller1
- #define b_caller2 b_io.bio_caller2
- #define b_data b_io.bio_data
- #define b_dev b_io.bio_dev
- #define b_driver1 b_io.bio_driver1
- #define b_driver2 b_io.bio_driver2
- #define b_error b_io.bio_error
- #define b_iocmd b_io.bio_cmd
- #define b_iodone b_io.bio_done
- #define b_iodone_chain b_io.bio_done_chain
- #define b_ioflags b_io.bio_flags
- #define b_offset b_io.bio_offset
- #define b_pblkno b_io.bio_pblkno
- #define b_resid b_io.bio_resid
- LIST_ENTRY(buf) b_hash; /* Hash chain. */
- TAILQ_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
- TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
- TAILQ_ENTRY(buf) b_act; /* Device driver queue when active. *new* */
- long b_flags; /* B_* flags. */
- unsigned short b_qindex; /* buffer queue index */
- unsigned char b_xflags; /* extra flags */
- [...]
- .ps +1
- .ft P
- .DE
- .PP
- Putting the struct bio as the first element in struct buf during a transition
- period allows a pointer to either to be cast to a pointer of the other,
- which means that certain pieces of code can be left un-converted with the
- use of a couple of casts while the remaining pieces of code are tested.
- The ccd and vinum modules have been left un-converted like this for now.
- .PP
- This is basically where FreeBSD-current stands today.
- .PP
- The next step is to substitute struct bio for struct buf in all the code
- which only care about the I/O aspect: device drivers, diskslice/label.
- The patch to do this is up for review. \**
- .FS
- And can be found at http://phk.freebsd.dk/misc
- .FE
- and consists mainly of systematic substitutions like these
- .DS
- .ft C
- s/struct buf/struct bio/
- s/b_flags/bio_flags/
- s/b_bcount/bio_bcount/
- &c &c
- .ft P
- .DE
- .NH 2
- Future work
- .PP
- It can be successfully argued that the cpp(1) macros used for aliasing
- above are ugly and should be expanded in place. It would certainly
- be trivial to do so, but not by definition worthwhile.
- .PP
- Retaining the aliasing for the b_* and bio_* name-spaces this way
- leaves us with considerable flexibility in modifying the future
- interaction between the two. The DEV_STRATEGY() macro is the single
- point where a struct buf is turned into a struct bio and launched
- into the drivers to full-fill the I/O request and this provides us
- with a single isolated location for performing non-trivial translations.
- .PP
- As an example of this flexibility: It has been proposed to essentially
- drop the b_blkno field and use the b_offset field to communicate the
- on-disk location of the data. b_blkno is a 32bit offset of B_DEVSIZE
- (512) bytes sectors which allows us to address two terabytes worth
- of data. Using b_offset as a 64 bit byte-address would not only allow
- us to address 8 million times larger disks, it would also make it
- possible to accommodate disks which use non-power-of-two sector-size,
- Audio CD-ROMs for instance.
- .PP
- The above mentioned flexibility makes an implementation almost trivial:
- .IP "" 5n
- \(bu Add code to DEV_STRATEGY() to populate b_offset from b_blkno in the
- cases where it is not valid. Today it is only valid for a struct buf
- marked B_PHYS.
- .IP
- \(bu Change diskslice/label, ccd, vinum and device drivers to use b_offset
- instead of b_blkno.
- .IP
- \(bu Remove the bio_blkno field from struct bio, add it to struct buf as
- b_blkno and remove the cpp(1) macro which aliased it into struct bio.
- .PP
- Another possible transition could be to not have a "built-in" struct bio
- in struct buf. If for some reason struct bio grows fields of no relevance
- to struct buf it might be cheaper to remove struct bio from struct buf,
- un-alias the fields and have DEV_STRATEGY() allocate a struct bio and populate
- the relevant fields from struct buf.
- This would also be entirely transparent to both users of struct buf and
- struct bio as long as we retain the aliasing mechanism and DEV_STRATEGY().
- .bp
- .NH 1
- Towards a stackable BIO subsystem.
- .PP
- Considering that we now have three distinct pieces of code living
- in the nowhere between DEV_STRATEGY() and the device drivers:
- diskslice/label, ccd and vinum, it is not unreasonable to start
- to look for a more structured and powerful API for these pieces
- of code.
- .PP
- In traditional UNIX semantics a "disk" is a one-dimensional array of
- 512 byte sectors which can be read or written. Support for sectors
- of multiple of 512 bytes were implemented with a sort of "don't ask-don't tell" policy where system administrator would specify a larger minimum sector-size
- to the filesystem, and things would "just work", but no formal communication about the size of the smallest transfer possible were exchanged between the disk driver and the filesystem.
- .PP
- A truly generalised concept of a disk needs to be more flexible and more
- expressive. For instance, a user of a disk will want to know:
- .IP "" 5n
- \(bu What is the sector size. Sector-size these days may not be a power
- of two, for instance Audio CDs have 2352 byte "sectors".
- .IP
- \(bu How many sectors are there.
- .IP
- \(bu Is writing of sectors supported.
- .IP
- \(bu Is freeing of sectors supported. This is important for flash based
- devices where a wear-distribution software or hardware function uses
- the information about which sectors are actually in use to optimise the
- usage of the slow erase function to a minimum.
- .IP
- \(bu Is opening this device in a specific mode, (read-only or read-write)
- allowed. The VM system and the file-systems generally assume that nobody
- writes to "their storage" under their feet, and therefore opens which
- would make that possible should be rejected.
- .IP
- \(bu What is the "native" geometry of this device (Sectors/Heads/Cylinders).
- This is useful for staying compatible with badly designed on-disk formats
- from other operating systems.
- .PP
- Obviously, all of these properties are dynamic in the sense that in
- these days disks are removable devices, and they may therefore change
- at any time. While some devices like CD-ROMs can lock the media in
- place with a special command, this cannot be done for all devices,
- in particular it cannot be done with normal floppy disk drives.
- .PP
- If we adopt such a model for disk, retain the existing "strategy/biodone" model of I/O scheduling and decide to use a modular or stackable approach to
- geometry translations we find that nearly endless flexibility emerges:
- Mirroring, RAID, striping, interleaving, disk-labels and sub-disks, all of
- these techniques would get a common framework to operate in.
- .PP
- In practice of course, such a scheme must not complicate the use of or
- installation of FreeBSD. The code will have to act and react exactly
- like the current code but fortunately the current behaviour is not at
- all hard to emulate so implementation-wise this is a non-issue.
- .PP
- But lets look at some drawings to see what this means in practice.
- .PP
- Today the plumbing might look like this on a machine:
- .DS
- .PS
- Ad0: box "disk (ad0)"
- arrow up from Ad0.n
- SL0: box "slice/label"
- Ad1: box "disk (ad1)" with .w at Ad0.e + (.2,0)
- arrow up from Ad1.n
- SL1: box "slice/label"
- Ad2: box "disk (ad2)" with .w at Ad1.e + (.2,0)
- arrow up from Ad2.n
- SL2: box "slice/label"
- Ad3: box "disk (ad3)" with .w at Ad2.e + (.2,0)
- arrow up from Ad3.n
- SL3: box "slice/label"
- DML: box dashed width 4i height .9i with .sw at SL0.sw + (-.2,-.2)
- "Disk-mini-layer" with .n at DML.s + (0, .1)
- V: box "vinum" at 1/2 <SL1.n, SL2.n> + (0,1.2)
- A0A: arrow up from 1/4 <SL0.nw, SL0.ne>
- A0B: arrow up from 2/4 <SL0.nw, SL0.ne>
- A0E: arrow up from 3/4 <SL0.nw, SL0.ne>
- A1C: arrow up from 2/4 <SL1.nw, SL1.ne>
- arrow to 1/3 <V.sw, V.se>
- A2C: arrow up from 2/4 <SL2.nw, SL2.ne>
- arrow to 2/3 <V.sw, V.se>
- A3A: arrow up from 1/4 <SL3.nw, SL3.ne>
- A3E: arrow up from 2/4 <SL3.nw, SL3.ne>
- A3F: arrow up from 3/4 <SL3.nw, SL3.ne>
- "ad0s1a" with .s at A0A.n + (0, .1)
- "ad0s1b" with .s at A0B.n + (0, .3)
- "ad0s1e" with .s at A0E.n + (0, .5)
- "ad1s1c" with .s at A1C.n + (0, .1)
- "ad2s1c" with .s at A2C.n + (0, .1)
- "ad3s4a" with .s at A3A.n + (0, .1)
- "ad3s4e" with .s at A3E.n + (0, .3)
- "ad3s4f" with .s at A3F.n + (0, .5)
- V1: arrow up from 1/4 <V.nw, V.ne>
- V2: arrow up from 2/4 <V.nw, V.ne>
- V3: arrow up from 3/4 <V.nw, V.ne>
- "V1" with .s at V1.n + (0, .1)
- "V2" with .s at V2.n + (0, .1)
- "V3" with .s at V3.n + (0, .1)
- .PE
- .DE
- .PP
- And while this drawing looks nice and clean, the code underneat isn't.
- With a stackable BIO implementation, the picture would look like this:
- .DS
- .PS
- Ad0: box "disk (ad0)"
- arrow up from Ad0.n
- M0: box "MBR"
- arrow up
- B0: box "BSD"
- A0A: arrow up from 1/4 <B0.nw, B0.ne>
- A0B: arrow up from 2/4 <B0.nw, B0.ne>
- A0E: arrow up from 3/4 <B0.nw, B0.ne>
-
- Ad1: box "disk (ad1)" with .w at Ad0.e + (.2,0)
- Ad2: box "disk (ad2)" with .w at Ad1.e + (.2,0)
- Ad3: box "disk (ad3)" with .w at Ad2.e + (.2,0)
- arrow up from Ad3.n
- SL3: box "MBR"
- arrow up
- B3: box "BSD"
- V: box "vinum" at 1/2 <Ad1.n, Ad2.n> + (0,.8)
- arrow from Ad1.n to 1/3 <V.sw, V.se>
- arrow from Ad2.n to 2/3 <V.sw, V.se>
- A3A: arrow from 1/4 <B3.nw, B3.ne>
- A3E: arrow from 2/4 <B3.nw, B3.ne>
- A3F: arrow from 3/4 <B3.nw, B3.ne>
- "ad0s1a" with .s at A0A.n + (0, .1)
- "ad0s1b" with .s at A0B.n + (0, .3)
- "ad0s1e" with .s at A0E.n + (0, .5)
- "ad3s4a" with .s at A3A.n + (0, .1)
- "ad3s4e" with .s at A3E.n + (0, .3)
- "ad3s4f" with .s at A3F.n + (0, .5)
- V1: arrow up from 1/4 <V.nw, V.ne>
- V2: arrow up from 2/4 <V.nw, V.ne>
- V3: arrow up from 3/4 <V.nw, V.ne>
- "V1" with .s at V1.n + (0, .1)
- "V2" with .s at V2.n + (0, .1)
- "V3" with .s at V3.n + (0, .1)
- .PE
- .DE
- .PP
- The first thing we notice is that the disk mini-layer is gone, instead
- separate modules for the Microsoft style MBR and the BSD style disklabel
- are now stacked over the disk. We can also see that Vinum no longer
- needs to go though the BSD/MBR layers if it wants access to the entire
- physical disk, it can be stacked right over the disk.
- .PP
- Now, imagine that a ZIP drive is connected to the machine, and the
- user loads a ZIP disk in it. First the device driver notices the
- new disk and instantiates a new disk:
- .DS
- .PS
- box "disk (da0)"
- .PE
- .DE
- .PP
- A number of the geometry modules have registered as "auto-discovering"
- and will be polled sequentially to see if any of them recognise what
- is on this disk. The MBR module finds a MBR in sector 0 and attach
- an instance of itself to the disk:
- .DS
- .PS
- D: box "disk (da0)"
- arrow up from D.n
- M: box "MBR"
- M1: arrow up from 1/3 <M.nw, M.ne>
- M2: arrow up from 2/3 <M.nw, M.ne>
- .PE
- .DE
- .PP
- It finds two "slices" in the MBR and creates two new "disks" one for
- each of these. The polling of modules is repeated and this time the
- BSD label module recognises a FreeBSD label on one of the slices and
- attach itself:
- .DS
- .PS
- D: box "disk (da0)"
- arrow "O" up from D.n
- M: box "MBR"
- M1: line up .3i from 1/3 <M.nw, M.ne>
- arrow "O" left
- M2: arrow "O" up from 2/3 <M.nw, M.ne>
- B: box "BSD"
- B1: arrow "O" up from 1/4 <B.nw, B.ne>
- B2: arrow "O" up from 2/4 <B.nw, B.ne>
- B3: arrow "O" up from 3/4 <B.nw, B.ne>
-
- .PE
- .DE
- .PP
- The BSD module finds three partitions, creates them as disks and the
- polling is repeated for each of these. No modules recognise these
- and the process ends. In theory one could have a module recognise
- the UFS superblock and extract from there the path to mount the disk
- on, but this is probably better implemented in a general "device-daemon"
- in user-land.
- .PP
- On this last drawing I have marked with "O" the "disks" which can be
- accessed from user-land or kernel. The VM and file-systems generally
- prefer to have exclusive write access to the disk sectors they use,
- so we need to enforce this policy. Since we cannot know what transformation
- a particular module implements, we need to ask the modules if the open
- is OK, and they may need to ask their neighbours before they can answer.
- .PP
- We decide to mount a filesystem on one of the BSD partitions at the very top.
- The open request is passed to the BSD module, which finds that none of
- the other open partitions (there are none) overlap this one, so far no
- objections. It then passes the open to the MBR module, which goes through
- basically the same procedure finds no objections and pass the request to
- the disk driver, which since it was not previously open approves of the
- open.
- .PP
- Next we mount a filesystem on the next BSD partition. The
- BSD module again checks for overlapping open partitions and find none.
- This time however, it finds that it has already opened the "downstream"
- in R/W mode so it does not need to ask for permission for that again
- so the open is OK.
- .PP
- Next we mount a msdos filesystem on the other MBR slice. This is the
- same case, the MBR finds no overlapping open slices and has already
- opened "downstream" so the open is OK.
- .PP
- If we now try to open the other slice for writing, the one which has the
- BSD module attached already. The open is passed to the MBR module which
- notes that the device is already opened for writing by a module (the BSD
- module) and consequently the open is refused.
- .PP
- While this sounds complicated it actually took less than 200 lines of
- code to implement in a prototype implementation.
- .PP
- Now, the user ejects the ZIP disk. If the hardware can give a notification
- of intent to eject, a call-up from the driver can try to get devices synchronised
- and closed, this is pretty trivial. If the hardware just disappears like
- a unplugged parallel zip drive, a floppy disk or a PC-card, we have no
- choice but to dismantle the setup. The device driver sends a "gone" notification to the MBR module, which replicates this upwards to the mounted msdosfs
- and the BSD module. The msdosfs unmounts forcefully, invalidates any blocks
- in the buf/vm system and returns. The BSD module replicates the "gone" to
- the two mounted file-systems which in turn unmounts forcefully, invalidates
- blocks and return, after which the BSD module releases any resources held
- and returns, the MBR module releases any resources held and returns and all
- traces of the device have been removed.
- .PP
- Now, let us get a bit more complicated. We add another disk and mirror
- two of the MBR slices:
- .DS
- .PS
- D0: box "disk (da0)"
- arrow "O" up from D0.n
- M0: box "MBR"
- M01: line up .3i from 1/3 <M0.nw, M0.ne>
- arrow "O" left
- M02: arrow "O" up from 2/3 <M0.nw, M0.ne>
- D1: box "disk (da1)" with .w at D0.e + (.2,0)
- arrow "O" up from D1.n
- M1: box "MBR"
- M11: line up .3i from 1/3 <M1.nw, M1.ne>
- line "O" left
- M11a: arrow up .2i
- I: box "Mirror" with .s at 1/2 <M02.n, M11a.n>
- arrow "O" up
- BB: box "BSD"
- BB1: arrow "O" up from 1/4 <BB.nw, BB.ne>
- BB2: arrow "O" up from 2/4 <BB.nw, BB.ne>
- BB3: arrow "O" up from 3/4 <BB.nw, BB.ne>
- M12: arrow "O" up from 2/3 <M1.nw, M1.ne>
- B: box "BSD"
- B1: arrow "O" up from 1/4 <B.nw, B.ne>
- B2: arrow "O" up from 2/4 <B.nw, B.ne>
- B3: arrow "O" up from 3/4 <B.nw, B.ne>
- .PE
- .DE
- .PP
- Now assuming that we lose disk da0, the notification goes up like before
- but the mirror module still has a valid mirror from disk da1, so it
- doesn't propagate the "gone" notification further up and the three
- file-systems mounted are not affected.
- .PP
- It is possible to modify the graph while in action, as long as the
- modules know that they will not affect any I/O in progress. This is
- very handy for moving things around. At any of the arrows we can
- insert a mirroring module, since it has a 1:1 mapping from input
- to output. Next we can add another copy to the mirror, give the
- mirror time to sync the two copies. Detach the first mirror copy
- and remove the mirror module. We have now in essence moved a partition
- from one disk to another transparently.
- .NH 1
- Getting stackable BIO layers from where we are today.
- .PP
- Most of the infrastructure is in place now to implement stackable
- BIO layers:
- .IP "" 5n
- \(bu The dev_t change gave us a public structure where
- information about devices can be put. This enabled us to get rid
- of all the NFOO limits on the number of instances of a particular
- driver/device, and significantly cleaned up the vnode aliasing for
- device vnodes.
- .IP
- \(bu The disk-mini-layer has
- taken the knowledge about diskslice/labels out of the
- majority of the disk-drivers, saving on average 100 lines of code per
- driver.
- .IP
- \(bu The struct bio/buf divorce is giving us an IO request of manageable
- size which can be modified without affecting all the filesystem and
- VM system users of struct buf.
- .PP
- The missing bits are:
- .IP "" 5n
- \(bu changes to struct bio to make it more
- stackable. This mostly relates to the handling of the biodone()
- event, something which will be transparent to all current users
- of struct buf/bio.
- .IP
- \(bu code to stich modules together and to pass events and notifications
- between them.
- .NH 1
- An Implementation plan for stackable BIO layers
- .PP
- My plan for implementation stackable BIO layers is to first complete
- the struct bio/buf divorce with the already mentioned patch.
- .PP
- The next step is to re-implement the monolithic disk-mini-layer so
- that it becomes the stackable BIO system. Vinum and CCD and all
- other consumers should not be unable to tell the difference between
- the current and the new disk-mini-layer. The new implementation
- will initially use a static stacking to remain compatible with the
- current behaviour. This will be the next logical checkpoint commit.
- .PP
- The next step is to make the stackable layers configurable,
- to provide the means to initialise the stacking and to subsequently
- change it. This will be the next logical checkpoint commit.
- .PP
- At this point new functionality can be added inside the stackable
- BIO system: CCD can be re-implemented as a mirror module and a stripe
- module. Vinum can be integrated either as one "macro-module" or
- as separate functions in separate modules. Also modules for other
- purposes can be added, sub-disk handling for Solaris, MacOS, etc
- etc. These modules can be committed one at a time.