/share/man/man4/geom.4

https://bitbucket.org/freebsd/freebsd-head/ · Forth · 467 lines · 467 code · 0 blank · 0 comment · 25 complexity · 90475073c8df5f102e6fb0f2a1c09570 MD5 · raw file

  1. .\"
  2. .\" Copyright (c) 2002 Poul-Henning Kamp
  3. .\" Copyright (c) 2002 Networks Associates Technology, Inc.
  4. .\" All rights reserved.
  5. .\"
  6. .\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
  7. .\" and NAI Labs, the Security Research Division of Network Associates, Inc.
  8. .\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
  9. .\" DARPA CHATS research program.
  10. .\"
  11. .\" Redistribution and use in source and binary forms, with or without
  12. .\" modification, are permitted provided that the following conditions
  13. .\" are met:
  14. .\" 1. Redistributions of source code must retain the above copyright
  15. .\" notice, this list of conditions and the following disclaimer.
  16. .\" 2. Redistributions in binary form must reproduce the above copyright
  17. .\" notice, this list of conditions and the following disclaimer in the
  18. .\" documentation and/or other materials provided with the distribution.
  19. .\" 3. The names of the authors may not be used to endorse or promote
  20. .\" products derived from this software without specific prior written
  21. .\" permission.
  22. .\"
  23. .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  24. .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  25. .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  26. .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  27. .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  28. .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  29. .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  30. .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  31. .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  32. .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  33. .\" SUCH DAMAGE.
  34. .\"
  35. .\" $FreeBSD$
  36. .\"
  37. .Dd May 25, 2006
  38. .Dt GEOM 4
  39. .Os
  40. .Sh NAME
  41. .Nm GEOM
  42. .Nd "modular disk I/O request transformation framework"
  43. .Sh DESCRIPTION
  44. The
  45. .Nm
  46. framework provides an infrastructure in which
  47. .Dq classes
  48. can perform transformations on disk I/O requests on their path from
  49. the upper kernel to the device drivers and back.
  50. .Pp
  51. Transformations in a
  52. .Nm
  53. context range from the simple geometric
  54. displacement performed in typical disk partitioning modules over RAID
  55. algorithms and device multipath resolution to full blown cryptographic
  56. protection of the stored data.
  57. .Pp
  58. Compared to traditional
  59. .Dq "volume management" ,
  60. .Nm
  61. differs from most
  62. and in some cases all previous implementations in the following ways:
  63. .Bl -bullet
  64. .It
  65. .Nm
  66. is extensible.
  67. It is trivially simple to write a new class
  68. of transformation and it will not be given stepchild treatment.
  69. If
  70. someone for some reason wanted to mount IBM MVS diskpacks, a class
  71. recognizing and configuring their VTOC information would be a trivial
  72. matter.
  73. .It
  74. .Nm
  75. is topologically agnostic.
  76. Most volume management implementations
  77. have very strict notions of how classes can fit together, very often
  78. one fixed hierarchy is provided, for instance, subdisk - plex -
  79. volume.
  80. .El
  81. .Pp
  82. Being extensible means that new transformations are treated no differently
  83. than existing transformations.
  84. .Pp
  85. Fixed hierarchies are bad because they make it impossible to express
  86. the intent efficiently.
  87. In the fixed hierarchy above, it is not possible to mirror two
  88. physical disks and then partition the mirror into subdisks, instead
  89. one is forced to make subdisks on the physical volumes and to mirror
  90. these two and two, resulting in a much more complex configuration.
  91. .Nm
  92. on the other hand does not care in which order things are done,
  93. the only restriction is that cycles in the graph will not be allowed.
  94. .Sh "TERMINOLOGY AND TOPOLOGY"
  95. .Nm
  96. is quite object oriented and consequently the terminology
  97. borrows a lot of context and semantics from the OO vocabulary:
  98. .Pp
  99. A
  100. .Dq class ,
  101. represented by the data structure
  102. .Vt g_class
  103. implements one
  104. particular kind of transformation.
  105. Typical examples are MBR disk
  106. partition, BSD disklabel, and RAID5 classes.
  107. .Pp
  108. An instance of a class is called a
  109. .Dq geom
  110. and represented by the data structure
  111. .Vt g_geom .
  112. In a typical i386
  113. .Fx
  114. system, there
  115. will be one geom of class MBR for each disk.
  116. .Pp
  117. A
  118. .Dq provider ,
  119. represented by the data structure
  120. .Vt g_provider ,
  121. is the front gate at which a geom offers service.
  122. A provider is
  123. .Do
  124. a disk-like thing which appears in
  125. .Pa /dev
  126. .Dc - a logical
  127. disk in other words.
  128. All providers have three main properties:
  129. .Dq name ,
  130. .Dq sectorsize
  131. and
  132. .Dq size .
  133. .Pp
  134. A
  135. .Dq consumer
  136. is the backdoor through which a geom connects to another
  137. geom provider and through which I/O requests are sent.
  138. .Pp
  139. The topological relationship between these entities are as follows:
  140. .Bl -bullet
  141. .It
  142. A class has zero or more geom instances.
  143. .It
  144. A geom has exactly one class it is derived from.
  145. .It
  146. A geom has zero or more consumers.
  147. .It
  148. A geom has zero or more providers.
  149. .It
  150. A consumer can be attached to zero or one providers.
  151. .It
  152. A provider can have zero or more consumers attached.
  153. .El
  154. .Pp
  155. All geoms have a rank-number assigned, which is used to detect and
  156. prevent loops in the acyclic directed graph.
  157. This rank number is
  158. assigned as follows:
  159. .Bl -enum
  160. .It
  161. A geom with no attached consumers has rank=1.
  162. .It
  163. A geom with attached consumers has a rank one higher than the
  164. highest rank of the geoms of the providers its consumers are
  165. attached to.
  166. .El
  167. .Sh "SPECIAL TOPOLOGICAL MANEUVERS"
  168. In addition to the straightforward attach, which attaches a consumer
  169. to a provider, and detach, which breaks the bond, a number of special
  170. topological maneuvers exists to facilitate configuration and to
  171. improve the overall flexibility.
  172. .Bl -inset
  173. .It Em TASTING
  174. is a process that happens whenever a new class or new provider
  175. is created, and it provides the class a chance to automatically configure an
  176. instance on providers which it recognizes as its own.
  177. A typical example is the MBR disk-partition class which will look for
  178. the MBR table in the first sector and, if found and validated, will
  179. instantiate a geom to multiplex according to the contents of the MBR.
  180. .Pp
  181. A new class will be offered to all existing providers in turn and a new
  182. provider will be offered to all classes in turn.
  183. .Pp
  184. Exactly what a class does to recognize if it should accept the offered
  185. provider is not defined by
  186. .Nm ,
  187. but the sensible set of options are:
  188. .Bl -bullet
  189. .It
  190. Examine specific data structures on the disk.
  191. .It
  192. Examine properties like
  193. .Dq sectorsize
  194. or
  195. .Dq mediasize
  196. for the provider.
  197. .It
  198. Examine the rank number of the provider's geom.
  199. .It
  200. Examine the method name of the provider's geom.
  201. .El
  202. .It Em ORPHANIZATION
  203. is the process by which a provider is removed while
  204. it potentially is still being used.
  205. .Pp
  206. When a geom orphans a provider, all future I/O requests will
  207. .Dq bounce
  208. on the provider with an error code set by the geom.
  209. Any
  210. consumers attached to the provider will receive notification about
  211. the orphanization when the event loop gets around to it, and they
  212. can take appropriate action at that time.
  213. .Pp
  214. A geom which came into being as a result of a normal taste operation
  215. should self-destruct unless it has a way to keep functioning whilst
  216. lacking the orphaned provider.
  217. Geoms like disk slicers should therefore self-destruct whereas
  218. RAID5 or mirror geoms will be able to continue as long as they do
  219. not lose quorum.
  220. .Pp
  221. When a provider is orphaned, this does not necessarily result in any
  222. immediate change in the topology: any attached consumers are still
  223. attached, any opened paths are still open, any outstanding I/O
  224. requests are still outstanding.
  225. .Pp
  226. The typical scenario is:
  227. .Pp
  228. .Bl -bullet -offset indent -compact
  229. .It
  230. A device driver detects a disk has departed and orphans the provider for it.
  231. .It
  232. The geoms on top of the disk receive the orphanization event and
  233. orphan all their providers in turn.
  234. Providers which are not attached to will typically self-destruct
  235. right away.
  236. This process continues in a quasi-recursive fashion until all
  237. relevant pieces of the tree have heard the bad news.
  238. .It
  239. Eventually the buck stops when it reaches geom_dev at the top
  240. of the stack.
  241. .It
  242. Geom_dev will call
  243. .Xr destroy_dev 9
  244. to stop any more requests from
  245. coming in.
  246. It will sleep until any and all outstanding I/O requests have
  247. been returned.
  248. It will explicitly close (i.e.: zero the access counts), a change
  249. which will propagate all the way down through the mesh.
  250. It will then detach and destroy its geom.
  251. .It
  252. The geom whose provider is now detached will destroy the provider,
  253. detach and destroy its consumer and destroy its geom.
  254. .It
  255. This process percolates all the way down through the mesh, until
  256. the cleanup is complete.
  257. .El
  258. .Pp
  259. While this approach seems byzantine, it does provide the maximum
  260. flexibility and robustness in handling disappearing devices.
  261. .Pp
  262. The one absolutely crucial detail to be aware of is that if the
  263. device driver does not return all I/O requests, the tree will
  264. not unravel.
  265. .It Em SPOILING
  266. is a special case of orphanization used to protect
  267. against stale metadata.
  268. It is probably easiest to understand spoiling by going through
  269. an example.
  270. .Pp
  271. Imagine a disk,
  272. .Pa da0 ,
  273. on top of which an MBR geom provides
  274. .Pa da0s1
  275. and
  276. .Pa da0s2 ,
  277. and on top of
  278. .Pa da0s1
  279. a BSD geom provides
  280. .Pa da0s1a
  281. through
  282. .Pa da0s1e ,
  283. and that both the MBR and BSD geoms have
  284. autoconfigured based on data structures on the disk media.
  285. Now imagine the case where
  286. .Pa da0
  287. is opened for writing and those
  288. data structures are modified or overwritten: now the geoms would
  289. be operating on stale metadata unless some notification system
  290. can inform them otherwise.
  291. .Pp
  292. To avoid this situation, when the open of
  293. .Pa da0
  294. for write happens,
  295. all attached consumers are told about this and geoms like
  296. MBR and BSD will self-destruct as a result.
  297. When
  298. .Pa da0
  299. is closed, it will be offered for tasting again
  300. and, if the data structures for MBR and BSD are still there, new
  301. geoms will instantiate themselves anew.
  302. .Pp
  303. Now for the fine print:
  304. .Pp
  305. If any of the paths through the MBR or BSD module were open, they
  306. would have opened downwards with an exclusive bit thus rendering it
  307. impossible to open
  308. .Pa da0
  309. for writing in that case.
  310. Conversely,
  311. the requested exclusive bit would render it impossible to open a
  312. path through the MBR geom while
  313. .Pa da0
  314. is open for writing.
  315. .Pp
  316. From this it also follows that changing the size of open geoms can
  317. only be done with their cooperation.
  318. .Pp
  319. Finally: the spoiling only happens when the write count goes from
  320. zero to non-zero and the retasting happens only when the write count goes
  321. from non-zero to zero.
  322. .It Em INSERT/DELETE
  323. are very special operations which allow a new geom
  324. to be instantiated between a consumer and a provider attached to
  325. each other and to remove it again.
  326. .Pp
  327. To understand the utility of this, imagine a provider
  328. being mounted as a file system.
  329. Between the DEVFS geom's consumer and its provider we insert
  330. a mirror module which configures itself with one mirror
  331. copy and consequently is transparent to the I/O requests
  332. on the path.
  333. We can now configure yet a mirror copy on the mirror geom,
  334. request a synchronization, and finally drop the first mirror
  335. copy.
  336. We have now, in essence, moved a mounted file system from one
  337. disk to another while it was being used.
  338. At this point the mirror geom can be deleted from the path
  339. again; it has served its purpose.
  340. .It Em CONFIGURE
  341. is the process where the administrator issues instructions
  342. for a particular class to instantiate itself.
  343. There are multiple
  344. ways to express intent in this case - a particular provider may be
  345. specified with a level of override forcing, for instance, a BSD
  346. disklabel module to attach to a provider which was not found palatable
  347. during the TASTE operation.
  348. .Pp
  349. Finally, I/O is the reason we even do this: it concerns itself with
  350. sending I/O requests through the graph.
  351. .It Em "I/O REQUESTS" ,
  352. represented by
  353. .Vt "struct bio" ,
  354. originate at a consumer,
  355. are scheduled on its attached provider and, when processed, are returned
  356. to the consumer.
  357. It is important to realize that the
  358. .Vt "struct bio"
  359. which enters through the provider of a particular geom does not
  360. .Do
  361. come out on the other side
  362. .Dc .
  363. Even simple transformations like MBR and BSD will clone the
  364. .Vt "struct bio" ,
  365. modify the clone, and schedule the clone on their
  366. own consumer.
  367. Note that cloning the
  368. .Vt "struct bio"
  369. does not involve cloning the
  370. actual data area specified in the I/O request.
  371. .Pp
  372. In total, four different I/O requests exist in
  373. .Nm :
  374. read, write, delete, and
  375. .Dq "get attribute".
  376. .Pp
  377. Read and write are self explanatory.
  378. .Pp
  379. Delete indicates that a certain range of data is no longer used
  380. and that it can be erased or freed as the underlying technology
  381. supports.
  382. Technologies like flash adaptation layers can arrange to erase
  383. the relevant blocks before they will become reassigned and
  384. cryptographic devices may want to fill random bits into the
  385. range to reduce the amount of data available for attack.
  386. .Pp
  387. It is important to recognize that a delete indication is not a
  388. request and consequently there is no guarantee that the data actually
  389. will be erased or made unavailable unless guaranteed by specific
  390. geoms in the graph.
  391. If
  392. .Dq "secure delete"
  393. semantics are required, a
  394. geom should be pushed which converts delete indications into (a
  395. sequence of) write requests.
  396. .Pp
  397. .Dq "Get attribute"
  398. supports inspection and manipulation
  399. of out-of-band attributes on a particular provider or path.
  400. Attributes are named by
  401. .Tn ASCII
  402. strings and they will be discussed in
  403. a separate section below.
  404. .El
  405. .Pp
  406. (Stay tuned while the author rests his brain and fingers: more to come.)
  407. .Sh DIAGNOSTICS
  408. Several flags are provided for tracing
  409. .Nm
  410. operations and unlocking
  411. protection mechanisms via the
  412. .Va kern.geom.debugflags
  413. sysctl.
  414. All of these flags are off by default, and great care should be taken in
  415. turning them on.
  416. .Bl -tag -width indent
  417. .It 0x01 Pq Dv G_T_TOPOLOGY
  418. Provide tracing of topology change events.
  419. .It 0x02 Pq Dv G_T_BIO
  420. Provide tracing of buffer I/O requests.
  421. .It 0x04 Pq Dv G_T_ACCESS
  422. Provide tracing of access check controls.
  423. .It 0x08 (unused)
  424. .It 0x10 (allow foot shooting)
  425. Allow writing to Rank 1 providers.
  426. This would, for example, allow the super-user to overwrite the MBR on the root
  427. disk or write random sectors elsewhere to a mounted disk.
  428. The implications are obvious.
  429. .It 0x40 Pq Dv G_F_DISKIOCTL
  430. This is unused at this time.
  431. .It 0x80 Pq Dv G_F_CTLDUMP
  432. Dump contents of gctl requests.
  433. .El
  434. .Sh SEE ALSO
  435. .Xr libgeom 3 ,
  436. .Xr disk 9 ,
  437. .Xr DECLARE_GEOM_CLASS 9 ,
  438. .Xr g_access 9 ,
  439. .Xr g_attach 9 ,
  440. .Xr g_bio 9 ,
  441. .Xr g_consumer 9 ,
  442. .Xr g_data 9 ,
  443. .Xr g_event 9 ,
  444. .Xr g_geom 9 ,
  445. .Xr g_provider 9 ,
  446. .Xr g_provider_by_name 9
  447. .Sh HISTORY
  448. This software was developed for the
  449. .Fx
  450. Project by
  451. .An Poul-Henning Kamp
  452. and NAI Labs, the Security Research Division of Network Associates, Inc.\&
  453. under DARPA/SPAWAR contract N66001-01-C-8035
  454. .Pq Dq CBOSS ,
  455. as part of the
  456. DARPA CHATS research program.
  457. .Pp
  458. The first precursor for
  459. .Nm
  460. was a gruesome hack to Minix 1.2 and was
  461. never distributed.
  462. An earlier attempt to implement a less general scheme
  463. in
  464. .Fx
  465. never succeeded.
  466. .Sh AUTHORS
  467. .An "Poul-Henning Kamp" Aq phk@FreeBSD.org