PageRenderTime 32ms CodeModel.GetById 13ms app.highlight 14ms RepoModel.GetById 1ms app.codeStats 0ms

/share/doc/smm/06.nfs/2.t

https://bitbucket.org/freebsd/freebsd-head/
Unknown | 532 lines | 523 code | 9 blank | 0 comment | 0 complexity | 33719fa168dbbd7b59184f33d4164323 MD5 | raw file
  1.\" Copyright (c) 1993
  2.\"	The Regents of the University of California.  All rights reserved.
  3.\"
  4.\" This document is derived from software contributed to Berkeley by
  5.\" Rick Macklem at The University of Guelph.
  6.\"
  7.\" Redistribution and use in source and binary forms, with or without
  8.\" modification, are permitted provided that the following conditions
  9.\" are met:
 10.\" 1. Redistributions of source code must retain the above copyright
 11.\"    notice, this list of conditions and the following disclaimer.
 12.\" 2. Redistributions in binary form must reproduce the above copyright
 13.\"    notice, this list of conditions and the following disclaimer in the
 14.\"    documentation and/or other materials provided with the distribution.
 15.\" 3. All advertising materials mentioning features or use of this software
 16.\"    must display the following acknowledgement:
 17.\"	This product includes software developed by the University of
 18.\"	California, Berkeley and its contributors.
 19.\" 4. Neither the name of the University nor the names of its contributors
 20.\"    may be used to endorse or promote products derived from this software
 21.\"    without specific prior written permission.
 22.\"
 23.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 24.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 25.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 26.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 27.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 28.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 29.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 30.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 31.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 32.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 33.\" SUCH DAMAGE.
 34.\"
 35.\"	@(#)2.t	8.1 (Berkeley) 6/8/93
 36.\"
 37.\"	$FreeBSD$
 38.\"
 39.sh 1 "Not Quite NFS, Crash Tolerant Cache Consistency for NFS"
 40.pp
 41Not Quite NFS (NQNFS) is an NFS like protocol designed to maintain full cache
 42consistency between clients in a crash tolerant manner.
 43It is an adaptation of the NFS protocol such that the server supports both NFS
 44and NQNFS clients while maintaining full consistency between the server and
 45NQNFS clients.
 46This section borrows heavily from work done on Spritely-NFS [Srinivasan89],
 47but uses Leases [Gray89] to avoid the need to recover server state information
 48after a crash.
 49The reader is strongly encouraged to read these references before
 50trying to grasp the material presented here.
 51.sh 2 "Overview"
 52.pp
 53The protocol maintains cache consistency by using a somewhat
 54Sprite [Nelson88] like protocol,
 55but is based on short term leases\** instead of hard state information
 56about open files.
 57.(f
 58\** A lease is a ticket permitting an activity that is
 59valid until some expiry time.
 60.)f
 61The basic principal is that the protocol will disable client caching of a
 62file whenever that file is write shared\**.
 63.(f
 64\** Write sharing occurs when at least one client is modifying a file while
 65other client(s) are reading the file.
 66.)f
 67Whenever a client wishes to cache data for a file it must hold a valid lease.
 68There are three types of leases: read caching, write caching and non-caching.
 69The latter type requires that all file operations be done synchronously with
 70the server via. RPCs.
 71A read caching lease allows for client data caching, but no file modifications
 72may be done.
 73A write caching lease allows for client caching of writes,
 74but requires that all writes be pushed to the server when the lease expires.
 75If a client has dirty buffers\**
 76.(f
 77\** Cached write data is not yet pushed (written) to the server.
 78.)f
 79when a write cache lease has almost expired, it will attempt to
 80extend the lease but is required to push the dirty buffers if extension fails.
 81A client gets leases by either doing a \fBGetLease RPC\fR or by piggybacking
 82a \fBGetLease Request\fR onto another RPC. Piggybacking is supported for the
 83frequent RPCs Getattr, Setattr, Lookup, Readlink, Read, Write and Readdir
 84in an effort to minimize the number of \fBGetLease RPCs\fR required.
 85All leases are at the granularity of a file, since all NFS RPCs operate on
 86individual files and NFS has no intrinsic notion of a file hierarchy.
 87Directories, symbolic links and file attributes may be read cached but
 88are not write cached.
 89The exception here is the attribute file_size, which is updated during cached
 90writing on the client to reflect a growing file.
 91.pp
 92It is the server's responsibility to ensure that consistency is maintained
 93among the NQNFS clients by disabling client caching whenever a server file
 94operation would cause inconsistencies.
 95The possibility of inconsistencies occurs whenever a client has
 96a write caching lease and any other client,
 97or local operations on the server,
 98tries to access the file or when
 99a modify operation is attempted on a file being read cached by client(s).
100At this time, the server sends an \fBeviction notice\fR to all clients holding
101the lease and then waits for lease termination.
102Lease termination occurs when a \fBvacated the premises\fR message has been
103received from all the clients that have signed the lease or when the lease
104expires via. timeout.
105The message pair \fBeviction notice\fR and \fBvacated the premises\fR roughly
106correspond to a Sprite server\(->client callback, but are not implemented as an
107actual RPC, to avoid the server waiting indefinitely for a reply from a dead
108client.
109.pp
110Server consistency checking can be viewed as issuing intrinsic leases for a
111file operation for the duration of the operation only. For example, the
112\fBCreate RPC\fR will get an intrinsic write lease on the directory in which
113the file is being created, disabling client read caches for that directory.
114.pp
115By relegating this responsibility to the server, consistency between the
116server and NQNFS clients is maintained when NFS clients are modifying the
117file system as well.\**
118.(f
119\** The NFS clients will continue to be \fIapproximately\fR consistent with
120the server.
121.)f
122.pp
123The leases are issued as time intervals to avoid the requirement of time of day
124clock synchronization. There are three important time constants known to
125the server. The \fBmaximum_lease_term\fR sets an upper bound on lease duration.
126The \fBclock_skew\fR is added to all lease terms on the server to correct for
127differing clock speeds between the client and server and \fBwrite_slack\fR is
128the number of seconds the server is willing to wait for a client with
129an expired write caching lease to push dirty writes.
130.pp
131The server maintains a \fBmodify_revision\fR number for each file. It is
132defined as an unsigned quadword integer that is never zero and that must
133increase whenever the corresponding file is modified on the server.
134It is used
135by the client to determine whether or not cached data for the file is
136stale.
137Generating this value is easier said than done. The current implementation
138uses the following technique, which is believed to be adequate.
139The high order longword is stored in the ufs inode and is initialized to one
140when an inode is first allocated.
141The low order longword is stored in main memory only and is initialized to
142zero when an inode is read in from disk.
143When the file is modified for the first time within a given second of
144wall clock time, the high order longword is incremented by one and
145the low order longword reset to zero.
146For subsequent modifications within the same second of wall clock
147time, the low order longword is incremented. If the low order longword wraps
148around to zero, the high order longword is incremented again.
149Since the high order longword only increments once per second and the inode
150is pushed to disk frequently during file modification, this implies
1510 \(<= Current\(miDisk \(<= 5.
152When the inode is read in from disk, 10
153is added to the high order longword, which ensures that the quadword
154is greater than any value it could have had before a crash.
155This introduces apparent modifications every time the inode falls out of
156the LRU inode cache, but this should only reduce the client caching performance
157by a (hopefully) small margin.
158.sh 2 "Crash Recovery and other Failure Scenarios"
159.pp
160The server must maintain the state of all the current leases held by clients.
161The nice thing about short term leases is that maximum_lease_term seconds
162after the server stops issuing leases, there are no current leases left.
163As such, server crash recovery does not require any state recovery. After
164rebooting, the server refuses to service any RPCs except for writes until
165write_slack seconds after the last lease would have expired\**.
166.(f
167\** The last lease expiry time may be safely estimated as
168"boottime+maximum_lease_term+clock_skew" for machines that cannot store
169it in nonvolatile RAM.
170.)f
171By then, the server would not have any outstanding leases to recover the
172state of and the clients have had at least write_slack seconds to push dirty
173writes to the server and get the server sync'd up to date. After this, the
174server simply services requests in a manner similar to NFS.
175In an effort to minimize the effect of "recovery storms" [Baker91],
176the server replies \fBtry_again_later\fR to the RPCs it is not
177yet ready to service.
178.pp
179After a client crashes, the server may have to wait for a lease to timeout
180before servicing a request if write sharing of a file with a cachable lease
181on the client is about to occur.
182As for the client, it simply starts up getting any leases it now needs. Any
183outstanding leases for that client on the server prior to the crash will either be renewed or expire
184via timeout.
185.pp
186Certain network partitioning failures are more problematic. If a client to
187server network connection is severed just before a write caching lease expires,
188the client cannot push the dirty writes to the server. After the lease expires
189on the server, the server permits other clients to access the file with the
190potential of getting stale data. Unfortunately I believe this failure scenario
191is intrinsic in any delay write caching scheme unless the server is required to
192wait \fBforever\fR for a client to regain contact\**.
193.(f
194\** Gray and Cheriton avoid this problem by using a \fBwrite through\fR policy.
195.)f
196Since the write caching lease has expired on the client,
197it will sync up with the
198server as soon as the network connection has been re-established.
199.pp
200There is another failure condition that can occur when the server is congested.
201The worst case scenario would have the client pushing dirty writes to the server
202but a large request queue on the server delays these writes for more than
203\fBwrite_slack\fR seconds. It is hoped that a congestion control scheme using
204the \fBtry_again_later\fR RPC reply after booting combined with
205the following lease termination rule for write caching leases
206can minimize the risk of this occurrence.
207A write caching lease is only terminated on the server when there are have
208been no writes to the file and the server has not been overloaded during
209the previous write_slack seconds. The server has not been overloaded
210is approximated by a test for sleeping nfsd(s) at the end of the write_slack
211period.
212.sh 2 "Server Disk Full"
213.pp
214There is a serious unresolved problem for delayed write caching with respect to
215server disk space allocation.
216When the disk on the file server is full, delayed write RPCs can fail
217due to "out of space".
218For NFS, this occurrence results in an error return from the close system
219call on the file, since the dirty blocks are pushed on close.
220Processes writing important files can check for this error return
221to ensure that the file was written successfully.
222For NQNFS, the dirty blocks are not pushed on close and as such the client
223may not attempt the write RPC until after the process has done the close
224which implies no error return from the close.
225For the current prototype,
226the only solution is to modify programs writing important
227file(s) to call fsync and check for an error return from it instead of close.
228.sh 2 "Protocol Details"
229.pp
230The protocol specification is identical to that of NFS [Sun89] except for
231the following changes.
232.ip \(bu
233RPC Information
234.(l
235        Program Number 300105
236        Version Number 1
237.)l
238.ip \(bu
239Readdir_and_Lookup RPC
240.(l
241        struct readdirlookargs {
242                fhandle file;
243                nfscookie cookie;
244                unsigned count;
245                unsigned duration;
246        };
247
248        struct entry {
249                unsigned cachable;
250                unsigned duration;
251                modifyrev rev;
252                fhandle entry_fh;
253                nqnfs_fattr entry_attrib;
254                unsigned fileid;
255                filename name;
256                nfscookie cookie;
257                entry *nextentry;
258        };
259
260        union readdirlookres switch (stat status) {
261        case NFS_OK:
262                struct {
263                        entry *entries;
264                        bool eof;
265                } readdirlookok;
266        default:
267                void;
268        };
269
270        readdirlookres
271        NQNFSPROC_READDIRLOOK(readdirlookargs) = 18;
272.)l
273Reads entries in a directory in a manner analogous to the NFSPROC_READDIR RPC
274in NFS, but returns the file handle and attributes of each entry as well.
275This allows the attribute and lookup caches to be primed.
276.ip \(bu
277Get Lease RPC
278.(l
279        struct getleaseargs {
280                fhandle file;
281                cachetype readwrite;
282                unsigned duration;
283        };
284
285        union getleaseres switch (stat status) {
286        case NFS_OK:
287                bool cachable;
288                unsigned duration;
289                modifyrev rev;
290                nqnfs_fattr attributes;
291        default:
292                void;
293        };
294
295        getleaseres
296        NQNFSPROC_GETLEASE(getleaseargs) = 19;
297.)l
298Gets a lease for "file" valid for "duration" seconds from when the lease
299was issued on the server\**.
300.(f
301\** To be safe, the client may only assume that the lease is valid
302for ``duration'' seconds from when the RPC request was sent to the server.
303.)f
304The lease permits client caching if "cachable" is true.
305The modify revision level and attributes for the file are also returned.
306.ip \(bu
307Eviction Message
308.(l
309        void
310        NQNFSPROC_EVICTED (fhandle) = 21;
311.)l
312This message is sent from the server to the client. When the client receives
313the message, it should flush data associated with the file represented by
314"fhandle" from its caches and then send the \fBVacated Message\fR back to
315the server. Flushing includes pushing any dirty writes via. write RPCs.
316.ip \(bu
317Vacated Message
318.(l
319        void
320        NQNFSPROC_VACATED (fhandle) = 20;
321.)l
322This message is sent from the client to the server in response to the
323\fBEviction Message\fR. See above.
324.ip \(bu
325Access RPC
326.(l
327        struct accessargs {
328                fhandle file;
329                bool read_access;
330                bool write_access;
331                bool exec_access;
332        };
333
334        stat
335        NQNFSPROC_ACCESS(accessargs) = 22;
336.)l
337The access RPC does permission checking on the server for the given type
338of access required by the client for the file.
339Use of this RPC avoids accessibility problems caused by client->server uid
340mapping.
341.ip \(bu
342Piggybacked Get Lease Request
343.pp
344The piggybacked get lease request is functionally equivalent to the Get Lease
345RPC except that is attached to one of the other NQNFS RPC requests as follows.
346A getleaserequest is prepended to all of the request arguments for NQNFS
347and a getleaserequestres is inserted in all NFS result structures just after
348the "stat" field only if "stat == NFS_OK".
349.(l
350        union getleaserequest switch (cachetype type) {
351        case NQLREAD:
352        case NQLWRITE:
353                unsigned duration;
354        default:
355                void;
356        };
357
358        union getleaserequestres switch (cachetype type) {
359        case NQLREAD:
360        case NQLWRITE:
361                bool cachable;
362                unsigned duration;
363                modifyrev rev;
364        default:
365                void;
366        };
367.)l
368The get lease request applies to the file that the attached RPC operates on
369and the file attributes remain in the same location as for the NFS RPC reply
370structure.
371.ip \(bu
372Three additional "stat" values
373.pp
374Three additional values have been added to the enumerated type "stat".
375.(l
376        NQNFS_EXPIRED=500
377        NQNFS_TRYLATER=501
378        NQNFS_AUTHERR=502
379.)l
380The "expired" value indicates that a lease has expired.
381The "try later"
382value is returned by the server when it wishes the client to retry the
383RPC request after a short delay. It is used during crash recovery (Section 2)
384and may also be useful for server congestion control.
385The "authetication error" value is returned for kerberized mount points to
386indicate that there is no cached authentication mapping and a Kerberos ticket
387for the principal is required.
388.sh 2 "Data Types"
389.ip \(bu
390cachetype
391.(l
392        enum cachetype {
393                NQLNONE = 0,
394                NQLREAD = 1,
395                NQLWRITE = 2
396        };
397.)l
398Type of lease requested. NQLNONE is used to indicate no piggybacked lease
399request.
400.ip \(bu
401modifyrev
402.(l
403        typedef unsigned hyper modifyrev;
404.)l
405The "modifyrev" is an unsigned quadword integer value that is never zero
406and increases every time the corresponding file is modified on the server.
407.ip \(bu
408nqnfs_time
409.(l
410        struct nqnfs_time {
411                unsigned seconds;
412                unsigned nano_seconds;
413        };
414.)l
415For NQNFS times are handled at nano second resolution instead of micro second
416resolution for NFS.
417.ip \(bu
418nqnfs_fattr
419.(l
420        struct nqnfs_fattr {
421                ftype type;
422                unsigned mode;
423                unsigned nlink;
424                unsigned uid;
425                unsigned gid;
426                unsigned hyper size;
427                unsigned blocksize;
428                unsigned rdev;
429                unsigned hyper bytes;
430                unsigned fsid;
431                unsigned fileid;
432                nqnfs_time atime;
433                nqnfs_time mtime;
434                nqnfs_time ctime;
435                unsigned flags;
436                unsigned generation;
437                modifyrev rev;
438        };
439.)l
440The nqnfs_fattr structure is modified from the NFS fattr so that it stores
441the file size as a 64bit quantity and the storage occupied as a 64bit number
442of bytes. It also has fields added for the 4.4BSD va_flags and va_gen fields
443as well as the file's modify rev level.
444.ip \(bu
445nqnfs_sattr
446.(l
447        struct nqnfs_sattr {
448                unsigned mode;
449                unsigned uid;
450                unsigned gid;
451                unsigned hyper size;
452                nqnfs_time atime;
453                nqnfs_time mtime;
454                unsigned flags;
455                unsigned rdev;
456        };
457.)l
458The nqnfs_sattr structure is modified from the NFS sattr structure in the
459same manner as fattr.
460.lp
461The arguments to several of the NFS RPCs have been modified as well. Mostly,
462these are minor changes to use 64bit file offsets or similar. The modified
463argument structures follow.
464.ip \(bu
465Lookup RPC
466.(l
467        struct lookup_diropargs {
468                unsigned duration;
469                fhandle dir;
470                filename name;
471        };
472
473        union lookup_diropres switch (stat status) {
474        case NFS_OK:
475                struct {
476                        union getleaserequestres lookup_lease;
477                        fhandle file;
478                        nqnfs_fattr attributes;
479                } lookup_diropok;
480        default:
481                void;
482        };
483
484.)l
485The additional "duration" argument tells the server to get a lease for the
486name being looked up if it is non-zero and the lease is specified
487in "lookup_lease".
488.ip \(bu
489Read RPC
490.(l
491        struct nqnfs_readargs {
492                fhandle file;
493                unsigned hyper offset;
494                unsigned count;
495        };
496.)l
497.ip \(bu
498Write RPC
499.(l
500        struct nqnfs_writeargs {
501                fhandle file;
502                unsigned hyper offset;
503                bool append;
504                nfsdata data;
505        };
506.)l
507The "append" argument is true for apeend only write operations.
508.ip \(bu
509Get Filesystem Attributes RPC
510.(l
511        union nqnfs_statfsres (stat status) {
512        case NFS_OK:
513                struct {
514                        unsigned tsize;
515                        unsigned bsize;
516                        unsigned blocks;
517                        unsigned bfree;
518                        unsigned bavail;
519                        unsigned files;
520                        unsigned files_free;
521                } info;
522        default:
523                void;
524        };
525.)l
526The "files" field is the number of files in the file system and the "files_free"
527is the number of additional files that can be created.
528.sh 1 "Summary"
529.pp
530The configuration and tuning of an NFS environment tends to be a bit of a
531mystic art, but hopefully this paper along with the man pages and other
532reading will be helpful. Good Luck.