/share/doc/smm/06.nfs/2.t

https://bitbucket.org/freebsd/freebsd-head/ · Raku · 532 lines · 525 code · 7 blank · 0 comment · 77 complexity · 33719fa168dbbd7b59184f33d4164323 MD5 · raw file

  1. .\" Copyright (c) 1993
  2. .\" The Regents of the University of California. All rights reserved.
  3. .\"
  4. .\" This document is derived from software contributed to Berkeley by
  5. .\" Rick Macklem at The University of Guelph.
  6. .\"
  7. .\" Redistribution and use in source and binary forms, with or without
  8. .\" modification, are permitted provided that the following conditions
  9. .\" are met:
  10. .\" 1. Redistributions of source code must retain the above copyright
  11. .\" notice, this list of conditions and the following disclaimer.
  12. .\" 2. Redistributions in binary form must reproduce the above copyright
  13. .\" notice, this list of conditions and the following disclaimer in the
  14. .\" documentation and/or other materials provided with the distribution.
  15. .\" 3. All advertising materials mentioning features or use of this software
  16. .\" must display the following acknowledgement:
  17. .\" This product includes software developed by the University of
  18. .\" California, Berkeley and its contributors.
  19. .\" 4. Neither the name of the University nor the names of its contributors
  20. .\" may be used to endorse or promote products derived from this software
  21. .\" without specific prior written permission.
  22. .\"
  23. .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  24. .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  25. .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  26. .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  27. .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  28. .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  29. .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  30. .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  31. .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  32. .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  33. .\" SUCH DAMAGE.
  34. .\"
  35. .\" @(#)2.t 8.1 (Berkeley) 6/8/93
  36. .\"
  37. .\" $FreeBSD$
  38. .\"
  39. .sh 1 "Not Quite NFS, Crash Tolerant Cache Consistency for NFS"
  40. .pp
  41. Not Quite NFS (NQNFS) is an NFS like protocol designed to maintain full cache
  42. consistency between clients in a crash tolerant manner.
  43. It is an adaptation of the NFS protocol such that the server supports both NFS
  44. and NQNFS clients while maintaining full consistency between the server and
  45. NQNFS clients.
  46. This section borrows heavily from work done on Spritely-NFS [Srinivasan89],
  47. but uses Leases [Gray89] to avoid the need to recover server state information
  48. after a crash.
  49. The reader is strongly encouraged to read these references before
  50. trying to grasp the material presented here.
  51. .sh 2 "Overview"
  52. .pp
  53. The protocol maintains cache consistency by using a somewhat
  54. Sprite [Nelson88] like protocol,
  55. but is based on short term leases\** instead of hard state information
  56. about open files.
  57. .(f
  58. \** A lease is a ticket permitting an activity that is
  59. valid until some expiry time.
  60. .)f
  61. The basic principal is that the protocol will disable client caching of a
  62. file whenever that file is write shared\**.
  63. .(f
  64. \** Write sharing occurs when at least one client is modifying a file while
  65. other client(s) are reading the file.
  66. .)f
  67. Whenever a client wishes to cache data for a file it must hold a valid lease.
  68. There are three types of leases: read caching, write caching and non-caching.
  69. The latter type requires that all file operations be done synchronously with
  70. the server via. RPCs.
  71. A read caching lease allows for client data caching, but no file modifications
  72. may be done.
  73. A write caching lease allows for client caching of writes,
  74. but requires that all writes be pushed to the server when the lease expires.
  75. If a client has dirty buffers\**
  76. .(f
  77. \** Cached write data is not yet pushed (written) to the server.
  78. .)f
  79. when a write cache lease has almost expired, it will attempt to
  80. extend the lease but is required to push the dirty buffers if extension fails.
  81. A client gets leases by either doing a \fBGetLease RPC\fR or by piggybacking
  82. a \fBGetLease Request\fR onto another RPC. Piggybacking is supported for the
  83. frequent RPCs Getattr, Setattr, Lookup, Readlink, Read, Write and Readdir
  84. in an effort to minimize the number of \fBGetLease RPCs\fR required.
  85. All leases are at the granularity of a file, since all NFS RPCs operate on
  86. individual files and NFS has no intrinsic notion of a file hierarchy.
  87. Directories, symbolic links and file attributes may be read cached but
  88. are not write cached.
  89. The exception here is the attribute file_size, which is updated during cached
  90. writing on the client to reflect a growing file.
  91. .pp
  92. It is the server's responsibility to ensure that consistency is maintained
  93. among the NQNFS clients by disabling client caching whenever a server file
  94. operation would cause inconsistencies.
  95. The possibility of inconsistencies occurs whenever a client has
  96. a write caching lease and any other client,
  97. or local operations on the server,
  98. tries to access the file or when
  99. a modify operation is attempted on a file being read cached by client(s).
  100. At this time, the server sends an \fBeviction notice\fR to all clients holding
  101. the lease and then waits for lease termination.
  102. Lease termination occurs when a \fBvacated the premises\fR message has been
  103. received from all the clients that have signed the lease or when the lease
  104. expires via. timeout.
  105. The message pair \fBeviction notice\fR and \fBvacated the premises\fR roughly
  106. correspond to a Sprite server\(->client callback, but are not implemented as an
  107. actual RPC, to avoid the server waiting indefinitely for a reply from a dead
  108. client.
  109. .pp
  110. Server consistency checking can be viewed as issuing intrinsic leases for a
  111. file operation for the duration of the operation only. For example, the
  112. \fBCreate RPC\fR will get an intrinsic write lease on the directory in which
  113. the file is being created, disabling client read caches for that directory.
  114. .pp
  115. By relegating this responsibility to the server, consistency between the
  116. server and NQNFS clients is maintained when NFS clients are modifying the
  117. file system as well.\**
  118. .(f
  119. \** The NFS clients will continue to be \fIapproximately\fR consistent with
  120. the server.
  121. .)f
  122. .pp
  123. The leases are issued as time intervals to avoid the requirement of time of day
  124. clock synchronization. There are three important time constants known to
  125. the server. The \fBmaximum_lease_term\fR sets an upper bound on lease duration.
  126. The \fBclock_skew\fR is added to all lease terms on the server to correct for
  127. differing clock speeds between the client and server and \fBwrite_slack\fR is
  128. the number of seconds the server is willing to wait for a client with
  129. an expired write caching lease to push dirty writes.
  130. .pp
  131. The server maintains a \fBmodify_revision\fR number for each file. It is
  132. defined as an unsigned quadword integer that is never zero and that must
  133. increase whenever the corresponding file is modified on the server.
  134. It is used
  135. by the client to determine whether or not cached data for the file is
  136. stale.
  137. Generating this value is easier said than done. The current implementation
  138. uses the following technique, which is believed to be adequate.
  139. The high order longword is stored in the ufs inode and is initialized to one
  140. when an inode is first allocated.
  141. The low order longword is stored in main memory only and is initialized to
  142. zero when an inode is read in from disk.
  143. When the file is modified for the first time within a given second of
  144. wall clock time, the high order longword is incremented by one and
  145. the low order longword reset to zero.
  146. For subsequent modifications within the same second of wall clock
  147. time, the low order longword is incremented. If the low order longword wraps
  148. around to zero, the high order longword is incremented again.
  149. Since the high order longword only increments once per second and the inode
  150. is pushed to disk frequently during file modification, this implies
  151. 0 \(<= Current\(miDisk \(<= 5.
  152. When the inode is read in from disk, 10
  153. is added to the high order longword, which ensures that the quadword
  154. is greater than any value it could have had before a crash.
  155. This introduces apparent modifications every time the inode falls out of
  156. the LRU inode cache, but this should only reduce the client caching performance
  157. by a (hopefully) small margin.
  158. .sh 2 "Crash Recovery and other Failure Scenarios"
  159. .pp
  160. The server must maintain the state of all the current leases held by clients.
  161. The nice thing about short term leases is that maximum_lease_term seconds
  162. after the server stops issuing leases, there are no current leases left.
  163. As such, server crash recovery does not require any state recovery. After
  164. rebooting, the server refuses to service any RPCs except for writes until
  165. write_slack seconds after the last lease would have expired\**.
  166. .(f
  167. \** The last lease expiry time may be safely estimated as
  168. "boottime+maximum_lease_term+clock_skew" for machines that cannot store
  169. it in nonvolatile RAM.
  170. .)f
  171. By then, the server would not have any outstanding leases to recover the
  172. state of and the clients have had at least write_slack seconds to push dirty
  173. writes to the server and get the server sync'd up to date. After this, the
  174. server simply services requests in a manner similar to NFS.
  175. In an effort to minimize the effect of "recovery storms" [Baker91],
  176. the server replies \fBtry_again_later\fR to the RPCs it is not
  177. yet ready to service.
  178. .pp
  179. After a client crashes, the server may have to wait for a lease to timeout
  180. before servicing a request if write sharing of a file with a cachable lease
  181. on the client is about to occur.
  182. As for the client, it simply starts up getting any leases it now needs. Any
  183. outstanding leases for that client on the server prior to the crash will either be renewed or expire
  184. via timeout.
  185. .pp
  186. Certain network partitioning failures are more problematic. If a client to
  187. server network connection is severed just before a write caching lease expires,
  188. the client cannot push the dirty writes to the server. After the lease expires
  189. on the server, the server permits other clients to access the file with the
  190. potential of getting stale data. Unfortunately I believe this failure scenario
  191. is intrinsic in any delay write caching scheme unless the server is required to
  192. wait \fBforever\fR for a client to regain contact\**.
  193. .(f
  194. \** Gray and Cheriton avoid this problem by using a \fBwrite through\fR policy.
  195. .)f
  196. Since the write caching lease has expired on the client,
  197. it will sync up with the
  198. server as soon as the network connection has been re-established.
  199. .pp
  200. There is another failure condition that can occur when the server is congested.
  201. The worst case scenario would have the client pushing dirty writes to the server
  202. but a large request queue on the server delays these writes for more than
  203. \fBwrite_slack\fR seconds. It is hoped that a congestion control scheme using
  204. the \fBtry_again_later\fR RPC reply after booting combined with
  205. the following lease termination rule for write caching leases
  206. can minimize the risk of this occurrence.
  207. A write caching lease is only terminated on the server when there are have
  208. been no writes to the file and the server has not been overloaded during
  209. the previous write_slack seconds. The server has not been overloaded
  210. is approximated by a test for sleeping nfsd(s) at the end of the write_slack
  211. period.
  212. .sh 2 "Server Disk Full"
  213. .pp
  214. There is a serious unresolved problem for delayed write caching with respect to
  215. server disk space allocation.
  216. When the disk on the file server is full, delayed write RPCs can fail
  217. due to "out of space".
  218. For NFS, this occurrence results in an error return from the close system
  219. call on the file, since the dirty blocks are pushed on close.
  220. Processes writing important files can check for this error return
  221. to ensure that the file was written successfully.
  222. For NQNFS, the dirty blocks are not pushed on close and as such the client
  223. may not attempt the write RPC until after the process has done the close
  224. which implies no error return from the close.
  225. For the current prototype,
  226. the only solution is to modify programs writing important
  227. file(s) to call fsync and check for an error return from it instead of close.
  228. .sh 2 "Protocol Details"
  229. .pp
  230. The protocol specification is identical to that of NFS [Sun89] except for
  231. the following changes.
  232. .ip \(bu
  233. RPC Information
  234. .(l
  235. Program Number 300105
  236. Version Number 1
  237. .)l
  238. .ip \(bu
  239. Readdir_and_Lookup RPC
  240. .(l
  241. struct readdirlookargs {
  242. fhandle file;
  243. nfscookie cookie;
  244. unsigned count;
  245. unsigned duration;
  246. };
  247. struct entry {
  248. unsigned cachable;
  249. unsigned duration;
  250. modifyrev rev;
  251. fhandle entry_fh;
  252. nqnfs_fattr entry_attrib;
  253. unsigned fileid;
  254. filename name;
  255. nfscookie cookie;
  256. entry *nextentry;
  257. };
  258. union readdirlookres switch (stat status) {
  259. case NFS_OK:
  260. struct {
  261. entry *entries;
  262. bool eof;
  263. } readdirlookok;
  264. default:
  265. void;
  266. };
  267. readdirlookres
  268. NQNFSPROC_READDIRLOOK(readdirlookargs) = 18;
  269. .)l
  270. Reads entries in a directory in a manner analogous to the NFSPROC_READDIR RPC
  271. in NFS, but returns the file handle and attributes of each entry as well.
  272. This allows the attribute and lookup caches to be primed.
  273. .ip \(bu
  274. Get Lease RPC
  275. .(l
  276. struct getleaseargs {
  277. fhandle file;
  278. cachetype readwrite;
  279. unsigned duration;
  280. };
  281. union getleaseres switch (stat status) {
  282. case NFS_OK:
  283. bool cachable;
  284. unsigned duration;
  285. modifyrev rev;
  286. nqnfs_fattr attributes;
  287. default:
  288. void;
  289. };
  290. getleaseres
  291. NQNFSPROC_GETLEASE(getleaseargs) = 19;
  292. .)l
  293. Gets a lease for "file" valid for "duration" seconds from when the lease
  294. was issued on the server\**.
  295. .(f
  296. \** To be safe, the client may only assume that the lease is valid
  297. for ``duration'' seconds from when the RPC request was sent to the server.
  298. .)f
  299. The lease permits client caching if "cachable" is true.
  300. The modify revision level and attributes for the file are also returned.
  301. .ip \(bu
  302. Eviction Message
  303. .(l
  304. void
  305. NQNFSPROC_EVICTED (fhandle) = 21;
  306. .)l
  307. This message is sent from the server to the client. When the client receives
  308. the message, it should flush data associated with the file represented by
  309. "fhandle" from its caches and then send the \fBVacated Message\fR back to
  310. the server. Flushing includes pushing any dirty writes via. write RPCs.
  311. .ip \(bu
  312. Vacated Message
  313. .(l
  314. void
  315. NQNFSPROC_VACATED (fhandle) = 20;
  316. .)l
  317. This message is sent from the client to the server in response to the
  318. \fBEviction Message\fR. See above.
  319. .ip \(bu
  320. Access RPC
  321. .(l
  322. struct accessargs {
  323. fhandle file;
  324. bool read_access;
  325. bool write_access;
  326. bool exec_access;
  327. };
  328. stat
  329. NQNFSPROC_ACCESS(accessargs) = 22;
  330. .)l
  331. The access RPC does permission checking on the server for the given type
  332. of access required by the client for the file.
  333. Use of this RPC avoids accessibility problems caused by client->server uid
  334. mapping.
  335. .ip \(bu
  336. Piggybacked Get Lease Request
  337. .pp
  338. The piggybacked get lease request is functionally equivalent to the Get Lease
  339. RPC except that is attached to one of the other NQNFS RPC requests as follows.
  340. A getleaserequest is prepended to all of the request arguments for NQNFS
  341. and a getleaserequestres is inserted in all NFS result structures just after
  342. the "stat" field only if "stat == NFS_OK".
  343. .(l
  344. union getleaserequest switch (cachetype type) {
  345. case NQLREAD:
  346. case NQLWRITE:
  347. unsigned duration;
  348. default:
  349. void;
  350. };
  351. union getleaserequestres switch (cachetype type) {
  352. case NQLREAD:
  353. case NQLWRITE:
  354. bool cachable;
  355. unsigned duration;
  356. modifyrev rev;
  357. default:
  358. void;
  359. };
  360. .)l
  361. The get lease request applies to the file that the attached RPC operates on
  362. and the file attributes remain in the same location as for the NFS RPC reply
  363. structure.
  364. .ip \(bu
  365. Three additional "stat" values
  366. .pp
  367. Three additional values have been added to the enumerated type "stat".
  368. .(l
  369. NQNFS_EXPIRED=500
  370. NQNFS_TRYLATER=501
  371. NQNFS_AUTHERR=502
  372. .)l
  373. The "expired" value indicates that a lease has expired.
  374. The "try later"
  375. value is returned by the server when it wishes the client to retry the
  376. RPC request after a short delay. It is used during crash recovery (Section 2)
  377. and may also be useful for server congestion control.
  378. The "authetication error" value is returned for kerberized mount points to
  379. indicate that there is no cached authentication mapping and a Kerberos ticket
  380. for the principal is required.
  381. .sh 2 "Data Types"
  382. .ip \(bu
  383. cachetype
  384. .(l
  385. enum cachetype {
  386. NQLNONE = 0,
  387. NQLREAD = 1,
  388. NQLWRITE = 2
  389. };
  390. .)l
  391. Type of lease requested. NQLNONE is used to indicate no piggybacked lease
  392. request.
  393. .ip \(bu
  394. modifyrev
  395. .(l
  396. typedef unsigned hyper modifyrev;
  397. .)l
  398. The "modifyrev" is an unsigned quadword integer value that is never zero
  399. and increases every time the corresponding file is modified on the server.
  400. .ip \(bu
  401. nqnfs_time
  402. .(l
  403. struct nqnfs_time {
  404. unsigned seconds;
  405. unsigned nano_seconds;
  406. };
  407. .)l
  408. For NQNFS times are handled at nano second resolution instead of micro second
  409. resolution for NFS.
  410. .ip \(bu
  411. nqnfs_fattr
  412. .(l
  413. struct nqnfs_fattr {
  414. ftype type;
  415. unsigned mode;
  416. unsigned nlink;
  417. unsigned uid;
  418. unsigned gid;
  419. unsigned hyper size;
  420. unsigned blocksize;
  421. unsigned rdev;
  422. unsigned hyper bytes;
  423. unsigned fsid;
  424. unsigned fileid;
  425. nqnfs_time atime;
  426. nqnfs_time mtime;
  427. nqnfs_time ctime;
  428. unsigned flags;
  429. unsigned generation;
  430. modifyrev rev;
  431. };
  432. .)l
  433. The nqnfs_fattr structure is modified from the NFS fattr so that it stores
  434. the file size as a 64bit quantity and the storage occupied as a 64bit number
  435. of bytes. It also has fields added for the 4.4BSD va_flags and va_gen fields
  436. as well as the file's modify rev level.
  437. .ip \(bu
  438. nqnfs_sattr
  439. .(l
  440. struct nqnfs_sattr {
  441. unsigned mode;
  442. unsigned uid;
  443. unsigned gid;
  444. unsigned hyper size;
  445. nqnfs_time atime;
  446. nqnfs_time mtime;
  447. unsigned flags;
  448. unsigned rdev;
  449. };
  450. .)l
  451. The nqnfs_sattr structure is modified from the NFS sattr structure in the
  452. same manner as fattr.
  453. .lp
  454. The arguments to several of the NFS RPCs have been modified as well. Mostly,
  455. these are minor changes to use 64bit file offsets or similar. The modified
  456. argument structures follow.
  457. .ip \(bu
  458. Lookup RPC
  459. .(l
  460. struct lookup_diropargs {
  461. unsigned duration;
  462. fhandle dir;
  463. filename name;
  464. };
  465. union lookup_diropres switch (stat status) {
  466. case NFS_OK:
  467. struct {
  468. union getleaserequestres lookup_lease;
  469. fhandle file;
  470. nqnfs_fattr attributes;
  471. } lookup_diropok;
  472. default:
  473. void;
  474. };
  475. .)l
  476. The additional "duration" argument tells the server to get a lease for the
  477. name being looked up if it is non-zero and the lease is specified
  478. in "lookup_lease".
  479. .ip \(bu
  480. Read RPC
  481. .(l
  482. struct nqnfs_readargs {
  483. fhandle file;
  484. unsigned hyper offset;
  485. unsigned count;
  486. };
  487. .)l
  488. .ip \(bu
  489. Write RPC
  490. .(l
  491. struct nqnfs_writeargs {
  492. fhandle file;
  493. unsigned hyper offset;
  494. bool append;
  495. nfsdata data;
  496. };
  497. .)l
  498. The "append" argument is true for apeend only write operations.
  499. .ip \(bu
  500. Get Filesystem Attributes RPC
  501. .(l
  502. union nqnfs_statfsres (stat status) {
  503. case NFS_OK:
  504. struct {
  505. unsigned tsize;
  506. unsigned bsize;
  507. unsigned blocks;
  508. unsigned bfree;
  509. unsigned bavail;
  510. unsigned files;
  511. unsigned files_free;
  512. } info;
  513. default:
  514. void;
  515. };
  516. .)l
  517. The "files" field is the number of files in the file system and the "files_free"
  518. is the number of additional files that can be created.
  519. .sh 1 "Summary"
  520. .pp
  521. The configuration and tuning of an NFS environment tends to be a bit of a
  522. mystic art, but hopefully this paper along with the man pages and other
  523. reading will be helpful. Good Luck.