/share/man/man4/geom.4
https://bitbucket.org/freebsd/freebsd-head/ · Forth · 467 lines · 467 code · 0 blank · 0 comment · 25 complexity · 90475073c8df5f102e6fb0f2a1c09570 MD5 · raw file
- .\"
- .\" Copyright (c) 2002 Poul-Henning Kamp
- .\" Copyright (c) 2002 Networks Associates Technology, Inc.
- .\" All rights reserved.
- .\"
- .\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
- .\" and NAI Labs, the Security Research Division of Network Associates, Inc.
- .\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
- .\" DARPA CHATS research program.
- .\"
- .\" Redistribution and use in source and binary forms, with or without
- .\" modification, are permitted provided that the following conditions
- .\" are met:
- .\" 1. Redistributions of source code must retain the above copyright
- .\" notice, this list of conditions and the following disclaimer.
- .\" 2. Redistributions in binary form must reproduce the above copyright
- .\" notice, this list of conditions and the following disclaimer in the
- .\" documentation and/or other materials provided with the distribution.
- .\" 3. The names of the authors may not be used to endorse or promote
- .\" products derived from this software without specific prior written
- .\" permission.
- .\"
- .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- .\" SUCH DAMAGE.
- .\"
- .\" $FreeBSD$
- .\"
- .Dd May 25, 2006
- .Dt GEOM 4
- .Os
- .Sh NAME
- .Nm GEOM
- .Nd "modular disk I/O request transformation framework"
- .Sh DESCRIPTION
- The
- .Nm
- framework provides an infrastructure in which
- .Dq classes
- can perform transformations on disk I/O requests on their path from
- the upper kernel to the device drivers and back.
- .Pp
- Transformations in a
- .Nm
- context range from the simple geometric
- displacement performed in typical disk partitioning modules over RAID
- algorithms and device multipath resolution to full blown cryptographic
- protection of the stored data.
- .Pp
- Compared to traditional
- .Dq "volume management" ,
- .Nm
- differs from most
- and in some cases all previous implementations in the following ways:
- .Bl -bullet
- .It
- .Nm
- is extensible.
- It is trivially simple to write a new class
- of transformation and it will not be given stepchild treatment.
- If
- someone for some reason wanted to mount IBM MVS diskpacks, a class
- recognizing and configuring their VTOC information would be a trivial
- matter.
- .It
- .Nm
- is topologically agnostic.
- Most volume management implementations
- have very strict notions of how classes can fit together, very often
- one fixed hierarchy is provided, for instance, subdisk - plex -
- volume.
- .El
- .Pp
- Being extensible means that new transformations are treated no differently
- than existing transformations.
- .Pp
- Fixed hierarchies are bad because they make it impossible to express
- the intent efficiently.
- In the fixed hierarchy above, it is not possible to mirror two
- physical disks and then partition the mirror into subdisks, instead
- one is forced to make subdisks on the physical volumes and to mirror
- these two and two, resulting in a much more complex configuration.
- .Nm
- on the other hand does not care in which order things are done,
- the only restriction is that cycles in the graph will not be allowed.
- .Sh "TERMINOLOGY AND TOPOLOGY"
- .Nm
- is quite object oriented and consequently the terminology
- borrows a lot of context and semantics from the OO vocabulary:
- .Pp
- A
- .Dq class ,
- represented by the data structure
- .Vt g_class
- implements one
- particular kind of transformation.
- Typical examples are MBR disk
- partition, BSD disklabel, and RAID5 classes.
- .Pp
- An instance of a class is called a
- .Dq geom
- and represented by the data structure
- .Vt g_geom .
- In a typical i386
- .Fx
- system, there
- will be one geom of class MBR for each disk.
- .Pp
- A
- .Dq provider ,
- represented by the data structure
- .Vt g_provider ,
- is the front gate at which a geom offers service.
- A provider is
- .Do
- a disk-like thing which appears in
- .Pa /dev
- .Dc - a logical
- disk in other words.
- All providers have three main properties:
- .Dq name ,
- .Dq sectorsize
- and
- .Dq size .
- .Pp
- A
- .Dq consumer
- is the backdoor through which a geom connects to another
- geom provider and through which I/O requests are sent.
- .Pp
- The topological relationship between these entities are as follows:
- .Bl -bullet
- .It
- A class has zero or more geom instances.
- .It
- A geom has exactly one class it is derived from.
- .It
- A geom has zero or more consumers.
- .It
- A geom has zero or more providers.
- .It
- A consumer can be attached to zero or one providers.
- .It
- A provider can have zero or more consumers attached.
- .El
- .Pp
- All geoms have a rank-number assigned, which is used to detect and
- prevent loops in the acyclic directed graph.
- This rank number is
- assigned as follows:
- .Bl -enum
- .It
- A geom with no attached consumers has rank=1.
- .It
- A geom with attached consumers has a rank one higher than the
- highest rank of the geoms of the providers its consumers are
- attached to.
- .El
- .Sh "SPECIAL TOPOLOGICAL MANEUVERS"
- In addition to the straightforward attach, which attaches a consumer
- to a provider, and detach, which breaks the bond, a number of special
- topological maneuvers exists to facilitate configuration and to
- improve the overall flexibility.
- .Bl -inset
- .It Em TASTING
- is a process that happens whenever a new class or new provider
- is created, and it provides the class a chance to automatically configure an
- instance on providers which it recognizes as its own.
- A typical example is the MBR disk-partition class which will look for
- the MBR table in the first sector and, if found and validated, will
- instantiate a geom to multiplex according to the contents of the MBR.
- .Pp
- A new class will be offered to all existing providers in turn and a new
- provider will be offered to all classes in turn.
- .Pp
- Exactly what a class does to recognize if it should accept the offered
- provider is not defined by
- .Nm ,
- but the sensible set of options are:
- .Bl -bullet
- .It
- Examine specific data structures on the disk.
- .It
- Examine properties like
- .Dq sectorsize
- or
- .Dq mediasize
- for the provider.
- .It
- Examine the rank number of the provider's geom.
- .It
- Examine the method name of the provider's geom.
- .El
- .It Em ORPHANIZATION
- is the process by which a provider is removed while
- it potentially is still being used.
- .Pp
- When a geom orphans a provider, all future I/O requests will
- .Dq bounce
- on the provider with an error code set by the geom.
- Any
- consumers attached to the provider will receive notification about
- the orphanization when the event loop gets around to it, and they
- can take appropriate action at that time.
- .Pp
- A geom which came into being as a result of a normal taste operation
- should self-destruct unless it has a way to keep functioning whilst
- lacking the orphaned provider.
- Geoms like disk slicers should therefore self-destruct whereas
- RAID5 or mirror geoms will be able to continue as long as they do
- not lose quorum.
- .Pp
- When a provider is orphaned, this does not necessarily result in any
- immediate change in the topology: any attached consumers are still
- attached, any opened paths are still open, any outstanding I/O
- requests are still outstanding.
- .Pp
- The typical scenario is:
- .Pp
- .Bl -bullet -offset indent -compact
- .It
- A device driver detects a disk has departed and orphans the provider for it.
- .It
- The geoms on top of the disk receive the orphanization event and
- orphan all their providers in turn.
- Providers which are not attached to will typically self-destruct
- right away.
- This process continues in a quasi-recursive fashion until all
- relevant pieces of the tree have heard the bad news.
- .It
- Eventually the buck stops when it reaches geom_dev at the top
- of the stack.
- .It
- Geom_dev will call
- .Xr destroy_dev 9
- to stop any more requests from
- coming in.
- It will sleep until any and all outstanding I/O requests have
- been returned.
- It will explicitly close (i.e.: zero the access counts), a change
- which will propagate all the way down through the mesh.
- It will then detach and destroy its geom.
- .It
- The geom whose provider is now detached will destroy the provider,
- detach and destroy its consumer and destroy its geom.
- .It
- This process percolates all the way down through the mesh, until
- the cleanup is complete.
- .El
- .Pp
- While this approach seems byzantine, it does provide the maximum
- flexibility and robustness in handling disappearing devices.
- .Pp
- The one absolutely crucial detail to be aware of is that if the
- device driver does not return all I/O requests, the tree will
- not unravel.
- .It Em SPOILING
- is a special case of orphanization used to protect
- against stale metadata.
- It is probably easiest to understand spoiling by going through
- an example.
- .Pp
- Imagine a disk,
- .Pa da0 ,
- on top of which an MBR geom provides
- .Pa da0s1
- and
- .Pa da0s2 ,
- and on top of
- .Pa da0s1
- a BSD geom provides
- .Pa da0s1a
- through
- .Pa da0s1e ,
- and that both the MBR and BSD geoms have
- autoconfigured based on data structures on the disk media.
- Now imagine the case where
- .Pa da0
- is opened for writing and those
- data structures are modified or overwritten: now the geoms would
- be operating on stale metadata unless some notification system
- can inform them otherwise.
- .Pp
- To avoid this situation, when the open of
- .Pa da0
- for write happens,
- all attached consumers are told about this and geoms like
- MBR and BSD will self-destruct as a result.
- When
- .Pa da0
- is closed, it will be offered for tasting again
- and, if the data structures for MBR and BSD are still there, new
- geoms will instantiate themselves anew.
- .Pp
- Now for the fine print:
- .Pp
- If any of the paths through the MBR or BSD module were open, they
- would have opened downwards with an exclusive bit thus rendering it
- impossible to open
- .Pa da0
- for writing in that case.
- Conversely,
- the requested exclusive bit would render it impossible to open a
- path through the MBR geom while
- .Pa da0
- is open for writing.
- .Pp
- From this it also follows that changing the size of open geoms can
- only be done with their cooperation.
- .Pp
- Finally: the spoiling only happens when the write count goes from
- zero to non-zero and the retasting happens only when the write count goes
- from non-zero to zero.
- .It Em INSERT/DELETE
- are very special operations which allow a new geom
- to be instantiated between a consumer and a provider attached to
- each other and to remove it again.
- .Pp
- To understand the utility of this, imagine a provider
- being mounted as a file system.
- Between the DEVFS geom's consumer and its provider we insert
- a mirror module which configures itself with one mirror
- copy and consequently is transparent to the I/O requests
- on the path.
- We can now configure yet a mirror copy on the mirror geom,
- request a synchronization, and finally drop the first mirror
- copy.
- We have now, in essence, moved a mounted file system from one
- disk to another while it was being used.
- At this point the mirror geom can be deleted from the path
- again; it has served its purpose.
- .It Em CONFIGURE
- is the process where the administrator issues instructions
- for a particular class to instantiate itself.
- There are multiple
- ways to express intent in this case - a particular provider may be
- specified with a level of override forcing, for instance, a BSD
- disklabel module to attach to a provider which was not found palatable
- during the TASTE operation.
- .Pp
- Finally, I/O is the reason we even do this: it concerns itself with
- sending I/O requests through the graph.
- .It Em "I/O REQUESTS" ,
- represented by
- .Vt "struct bio" ,
- originate at a consumer,
- are scheduled on its attached provider and, when processed, are returned
- to the consumer.
- It is important to realize that the
- .Vt "struct bio"
- which enters through the provider of a particular geom does not
- .Do
- come out on the other side
- .Dc .
- Even simple transformations like MBR and BSD will clone the
- .Vt "struct bio" ,
- modify the clone, and schedule the clone on their
- own consumer.
- Note that cloning the
- .Vt "struct bio"
- does not involve cloning the
- actual data area specified in the I/O request.
- .Pp
- In total, four different I/O requests exist in
- .Nm :
- read, write, delete, and
- .Dq "get attribute".
- .Pp
- Read and write are self explanatory.
- .Pp
- Delete indicates that a certain range of data is no longer used
- and that it can be erased or freed as the underlying technology
- supports.
- Technologies like flash adaptation layers can arrange to erase
- the relevant blocks before they will become reassigned and
- cryptographic devices may want to fill random bits into the
- range to reduce the amount of data available for attack.
- .Pp
- It is important to recognize that a delete indication is not a
- request and consequently there is no guarantee that the data actually
- will be erased or made unavailable unless guaranteed by specific
- geoms in the graph.
- If
- .Dq "secure delete"
- semantics are required, a
- geom should be pushed which converts delete indications into (a
- sequence of) write requests.
- .Pp
- .Dq "Get attribute"
- supports inspection and manipulation
- of out-of-band attributes on a particular provider or path.
- Attributes are named by
- .Tn ASCII
- strings and they will be discussed in
- a separate section below.
- .El
- .Pp
- (Stay tuned while the author rests his brain and fingers: more to come.)
- .Sh DIAGNOSTICS
- Several flags are provided for tracing
- .Nm
- operations and unlocking
- protection mechanisms via the
- .Va kern.geom.debugflags
- sysctl.
- All of these flags are off by default, and great care should be taken in
- turning them on.
- .Bl -tag -width indent
- .It 0x01 Pq Dv G_T_TOPOLOGY
- Provide tracing of topology change events.
- .It 0x02 Pq Dv G_T_BIO
- Provide tracing of buffer I/O requests.
- .It 0x04 Pq Dv G_T_ACCESS
- Provide tracing of access check controls.
- .It 0x08 (unused)
- .It 0x10 (allow foot shooting)
- Allow writing to Rank 1 providers.
- This would, for example, allow the super-user to overwrite the MBR on the root
- disk or write random sectors elsewhere to a mounted disk.
- The implications are obvious.
- .It 0x40 Pq Dv G_F_DISKIOCTL
- This is unused at this time.
- .It 0x80 Pq Dv G_F_CTLDUMP
- Dump contents of gctl requests.
- .El
- .Sh SEE ALSO
- .Xr libgeom 3 ,
- .Xr disk 9 ,
- .Xr DECLARE_GEOM_CLASS 9 ,
- .Xr g_access 9 ,
- .Xr g_attach 9 ,
- .Xr g_bio 9 ,
- .Xr g_consumer 9 ,
- .Xr g_data 9 ,
- .Xr g_event 9 ,
- .Xr g_geom 9 ,
- .Xr g_provider 9 ,
- .Xr g_provider_by_name 9
- .Sh HISTORY
- This software was developed for the
- .Fx
- Project by
- .An Poul-Henning Kamp
- and NAI Labs, the Security Research Division of Network Associates, Inc.\&
- under DARPA/SPAWAR contract N66001-01-C-8035
- .Pq Dq CBOSS ,
- as part of the
- DARPA CHATS research program.
- .Pp
- The first precursor for
- .Nm
- was a gruesome hack to Minix 1.2 and was
- never distributed.
- An earlier attempt to implement a less general scheme
- in
- .Fx
- never succeeded.
- .Sh AUTHORS
- .An "Poul-Henning Kamp" Aq phk@FreeBSD.org