PageRenderTime 67ms CodeModel.GetById 21ms RepoModel.GetById 0ms app.codeStats 0ms

/docs/report/src/content.tex

https://gitlab.com/sorind/tagfs
LaTeX | 396 lines | 350 code | 40 blank | 6 comment | 0 complexity | c65708f5713f4232bfa6627ef4d4d94b MD5 | raw file
  1. \begin{abstract}
  2. %\boldmath
  3. File systems are an integral part of every operating system. Because of the
  4. high capacity of modern storing devices file systems need a better way of
  5. organizing and accessing data in order to be easier for one to retrieve
  6. exactly the files he/she is looking for.
  7. Also, the need of users to personalize the content stored and
  8. to find specific data, pushes manufacturers to employ alternatives for the
  9. curent design.
  10. tagSys implements a tag based file system in Linux, using a user space application
  11. which offers support for tagging files and browsing files by tags,
  12. and a kernel module that hooks into the VFS\cite{rlove} to keep metadata about files.
  13. We present a simple way to implement such a system and how the regular user can benefit
  14. from file tagging.
  15. \end{abstract}
  16. % keywords
  17. \textit{\textbf{Keywords}}: File systems, Tags, VFS, metadata
  18. \section{Introduction}
  19. In most operating systems the files are hierarchically organized.
  20. This means that there usually is a starting point, or parent directory.
  21. In Windows based systems there are multiple starting points
  22. based on the physical hard drive partitions. In UNIX-like systems,
  23. there is a single root drive with different mount points available for users
  24. to add or remove subtrees from different drives, partitions, etc.
  25. In these filesystems a user organizes related data by storing it in the same
  26. folder but say that a user, Bob, has two separated folders one for storing
  27. photos taken in the mountains (Mountain-pics) and one for storing photos in which
  28. a certain person appears (Alice-pics). Two questions arise, one, where should
  29. Bob store a picture taken in the mountains in which Alice appears and two, how
  30. could Bob find the pictures taken in the mountains in which Alice appears. For
  31. current
  32. filesystems the answer to the first question might be storing the photo in either
  33. folder and in the second one creating a link to this photo or, store it in both
  34. folders. The answer to the second one could be naming the photo in such a way
  35. that retrieving them based on the previously stated criteria would work.
  36. tagSys file system aims to bring a different approach, based on tags rather than
  37. hierarchical system that is rooted for a long time in modern operating system.
  38. For the above example, for a tagSys filesystem the answer to both questions would
  39. be adding tags to photos ($<$mountains$><$Alice$>$) and then search for files that
  40. contain these tags. The question of where to store a specific photo
  41. would not be that important anymore.
  42. A pure tag file system is difficult to implement starting from zero, so we tried
  43. to adapt the current file system in Linux to support tags and see how the two
  44. systems can coexist on an end-user machine.
  45. A tag file system should be able to organize files, data on the disk regardless of
  46. hierarchical logical approach. The position of the files on the disk is irrelevant and
  47. completely transparent to the user. The file system should be able to put files on disks
  48. and simply recover them on demand based on tags requests.
  49. In our approach, logical directory based organization and file tags coexist, in order
  50. to see how the two systems can fit and how the user can use alternatives for searching
  51. and clustering the information it has.
  52. We implemented a tag layer in the Linux Virtual File System and tested how this impacts
  53. the regular user. We have added posibilities for the user to manipulate the tags (add, delete,
  54. search) in order to increase the flexibility of the filesystem and the way it interacts with the user.
  55. tagSys is expected to make it easier to work with files, especially personal ones.
  56. %\subsection{Subsection Heading Here}
  57. %Subsection text here.
  58. %\subsubsection{Subsubsection Heading Here}
  59. %Subsubsection text here.
  60. \section{State of the art}
  61. The idea of tagging files in order to access them in an easier fashion is not
  62. a new one and various attempts to implement solutions have been made. Some of
  63. these are specialized solutions for special kind of data, such as Calibre\cite{calibre}
  64. which makes ebook management easier, implemented in userspace. The vast majority
  65. of these applications rely on a database where mappings between files and
  66. associated metadata are stored and expose a set of commands which translate
  67. to specific queries for the database.
  68. \subsection[Nepomuk-KDE]{Nepomuk-KDE}
  69. Nepomuk-KDE\cite{nepomuk} is an implementation of Nepomuk which
  70. has been integrated with KDE and that allows adding metadata
  71. to items stored on a computer and making queries based on
  72. that metadata. Based on the Nepomuk specification, Nepomuk-
  73. KDE is able to store in a RDF (Resurse Description Frame-
  74. work) semantic data from desktop applications. For example
  75. the Dolphin desktop manager is able to add simple tags to files
  76. or more complicated comments. This solution is not limited
  77. to files metadata. Almost every application can use the RDF
  78. store to add semantic metadata to theire objects. For example
  79. KMail can do it for emails, Amarok can do it for music files,
  80. but it is used mostly for tagging files.
  81. \subsection{TaggedFrog}TaggedFrog\cite{taggedfrog} is a Windows application
  82. based on the convenint drag'n'drop technique. It allows you to organize your files,
  83. documents and Web links just by adding objects to the library and tagging them with
  84. any keywords. Moreover, you are able tagging files directly from Windows File Explorer
  85. because the application is integrated with Explorer's context menu.
  86. \subsection{pyTAGSfs}pyTAGSfs\cite{pytagfs}
  87. is a FUSE filesystem, written in python for Linux and Mac OS X systems, that
  88. arranges media files in a virtual directory structure based on the file tags.
  89. File tags can be changed by moving and renaming virtual files and directories.
  90. The virtual files can also be modified directly, and, of course, can be opened
  91. and played just like regular files.
  92. \subsection{TagFS:Bringing Semantic Metadata to the Filesystem}
  93. \textit{TagFS: Bringing Semantic Metadata to the Filesystem}\cite{tagfssemantic} is a research projected
  94. started at the University of Koblenz which, as Nepomuk, relies on RDF for
  95. defining semantics and SPARQL. Metadata is stored in a repository having an
  96. associated graph, and various opperations can be performed on it(additions,
  97. updates, etc).
  98. $\\$
  99. \section{tagSys}
  100. tagSys is a software application that implements a tag-based filesystem in
  101. Linux, more specifically, tagSys allows tagging a file at creation time or at
  102. a later time, adding and removing tags, listing the tags associated to a file
  103. at a given time and, the most important characteristic, tagSys allows
  104. browsing for files having specific tag(s).
  105. What is different from the other implementations is that the filesystem hierarchy whill remain unchanged but files will have
  106. associated tags (an example is presented in
  107. \figref{img:hierarchy}).
  108. \fig[scale=0.5]{figs/hierarchy.pdf}{img:hierarchy}{tagSys hierarchy}
  109. The tagSys application architecture is presented in \figref{arch}.
  110. \fig[scale=0.5]{figs/archall.pdf}{arch}{tagSys architecture}
  111. The \textit{CLI} is used to issue commands for tag manipulation. In order for the transition
  112. to this new file system to be as user-friendly as possible, we have implemented a different
  113. way of manipulating the tags. There are two types
  114. of commands. The first type consists of file manipulation commands available on
  115. every UNIX-like operating system, such as \textit{ls, touch, mv, cp} whose behaviour
  116. and implementation was slightly changed in tagSys implementation in order to support tags.
  117. The second type of commands reffers to new tagSys commands implemented in order to
  118. provide more tag-related operations - list, remove, add new tags.
  119. The implementation changes are related to hooks created in \textit{VFS} and will be detailed in
  120. subsection \textit{Tag handling}.
  121. No implementation changes at filesystem level were required.
  122. \subsection{Architectural decisions}
  123. tagSys started as an idea to create a more user-friendly file system; remembering
  124. tags is easier than remembering the name of a file or the place where it is stored
  125. but, at the same time a pure tag filesystem might not offer a simpler way of organizing data
  126. in a hierarchical manner, a choice to implement tagSys as a new filesystem, from
  127. scratch, thus would have been time consuming and would have required a lot of changes into
  128. the kernel and user interface.
  129. The other choice was to implement tagSys as hooks in
  130. VFS in order to store and retrieve needed metadata.
  131. Since changes are made at VFS level there will be an overhead for filesystems that
  132. subsequently are to be used without tag support. A main concern in implementation
  133. was to reduce this overhead to as little as possible.
  134. $\\$
  135. From the beginning the focus was on the changes needed at VFS and file system
  136. level and not on storage possibilities of the mappings between files and
  137. associated tags and so these mappings are stored in a file which is always in RAM memory.
  138. $\\$
  139. The keyword of entry point was defined in order to designate
  140. a point in the file system hierarchy starting from which a
  141. distinct tagSys begins meaning that only for that part of the
  142. file system tags apply; for a file system multiple entry points
  143. can be declared. This allowed to keep the changes necessary
  144. for tagSys isolated so that the performance for the normal case
  145. is not affected. This way the only difference from a vanilla
  146. kernel is that we do a string comparison for each entry point
  147. defined.
  148. \subsection{Storage}
  149. The mapping between a file and its associated tags is presented in \figref{storage}.
  150. \fig[scale=0.4]{figs/storage.pdf}{storage}{Tag Storage}
  151. Each file contains a bitmask in which each bit represents a tag. If the bit is set to 1 than
  152. the file has the tag with that id and if not it does not. The bitmask is static so we will only
  153. be able to have only a limited number of tags. This defaults to 8192 tags but it is configurable
  154. through the kernel config file with impact on the memory and disk space used.
  155. Each tag has a list of file pointers to files which have that tag. This is done so we do not
  156. have duplicated information and so that we will need to do updates in only one place.
  157. \subsection{Tag commands}
  158. The idea of tagging files is to be able to add a number of tags to a file but
  159. since the number of tags that will be associated to a file is not known beforehand
  160. we establish a convention that a filename will be separated from its
  161. associated tags by ":" whenever a command that envolves tags is issued. Also,
  162. one tag is separated by another tag by ":".
  163. $\\$
  164. The behaviour of \textit{ls} command was changed so that when issued with
  165. an argument
  166. beginning with ":" it lists all the files that are tagged with the given words.
  167. $\\$
  168. An existent file can be assigned tags by issueing the \textit{tag} command
  169. with \textit{-a} parameter followed by the filename and the tags that one wants to
  170. attach to that file. There are two constraints that one has to take into
  171. account when wanting to add tags to a file. One is that the implementation of
  172. tagSys limits the maximum number of tags that can be assigned to a file to 256
  173. and the second one is a limitation imposed by the kernel implementation and
  174. it reffers to the fact that the total length of filename, tags and separator
  175. must be less or equal to the value of MAX$\_$PATH$\_$LENGTH(256).
  176. $\\$
  177. Tags can be removed from a file with \textit{tag -d filename:tag[:tag*]} command
  178. taking into consideration the second constraint stated above.
  179. $\\$
  180. The output of \textit{tag -l filename} command is the list of tags associated
  181. to a file at a given time.
  182. $\\$
  183. tagSys permits the creation of entry points in the file system which indicate
  184. that starting from that point down the hierarchy tags may be used.
  185. This was introduced in order to reduce the overhead for file systems where
  186. the user does not require to use tags, limiting it to a couple of comparisons.
  187. An entry point can be created using \textit{tag -c} command meaning that
  188. the current working directory is a new tagSys entry point.
  189. $\\$
  190. The available commands as well as a short description is listed in Table1.
  191. \begin{center}
  192. \begin{table}[htb]
  193. \begin{center}
  194. \begin{tabular}{ | l | l | l | l |}
  195. \hline
  196. \textbf{Command}&\textbf{Params} &\textbf{Args}&\textbf{Description}\\ \hline
  197. ls & - & :tag[:tag]* & List files having the specified tags\\ \hline
  198. tag & -c & - & Create a new entry point for tagSys\\ \hline
  199. tag & -a & filename:tag[:tag]* & Add tags to file\\ \hline
  200. tag & -d & filename:tag[:tag]* & Remove tags from filename\\ \hline
  201. tag & -l & filename & List all the tags for filename\\
  202. \hline
  203. \end{tabular}
  204. \end{center}
  205. \caption{tagSys implemented commands}
  206. \label{table:commands}
  207. \end{table}
  208. \end{center}
  209. \subsection{Implementation details}
  210. \subsubsection[medatada]{Metadata structure}$\\$
  211. We hold two hashtables in the memory. One of them keeps the
  212. tags from the system and the other hold the files. Structures
  213. are added in this hashtables when we add tags to a file. If the
  214. file is not tracked, we create a new entry for it in the hashtable.
  215. The same is done for each tag. Each tag from the tag hashtable
  216. holds a list of file pointers, each pointing to a specific file. This
  217. way we dont have to duplicate the information about each file.
  218. To associate a tag to a file, each file will have a bitvector with
  219. enough space for each tag.
  220. \subsubsection{VFS Hooks} $\\$
  221. The VFS hooks allow tagSys to break out of the normal flow
  222. of the kernel and performs certain verifications in order to
  223. determine if a particular file should or not be treated as a tag-
  224. able file and afterwards, if necessary, performing the desired
  225. changes. These will usually strip the tags from the filename
  226. so that normal operations will work as expected( E.g executing ls
  227. dir:tag would try to open the file dir:tag but the file does
  228. not exist so it would fail. Because of this, it is necessary to
  229. strip the tags). After this they will continue to do operation
  230. specific things.
  231. To determine where we would need to insert our hooks, we did a strace
  232. on a command and checked what syscalls it makes. For example for a "ls dir"
  233. command:
  234. \begin{center}
  235. \begin{table}[htb]
  236. \begin{center}
  237. \begin{tabular}{ | p{5.5cm} | p{5.5cm} | }
  238. \hline
  239. \textbf{Syscall}&\textbf{Action}\\ \hline
  240. fstat64(path) & Remove tags from path\\ \hline
  241. open(path) & Remove tags from path and make association between $<$process, fd$>$ and tags\\ \hline
  242. getdents64(fd) & Get the tags associated with the fd and search the storage. Store the results in the dents structure passed from userspace.\\ \hline
  243. close(fd) & Remove the association between fd and tags\\ \hline
  244. \end{tabular}
  245. \end{center}
  246. \caption{Ls syscalls}
  247. \label{table:ls}
  248. \end{table}
  249. \end{center}
  250. \subsubsection{Userspace application}
  251. $\\$ The application in userspace adds the tag layer to the normal
  252. file operations using the tag command. This command is a
  253. normal user space application that can be called by the user.
  254. This way the user can add and remove tags from a file, and also
  255. list the current tags. The application allows creating an entry
  256. point in the file system in a similar way. Tagging application
  257. works in user space and calls the specific API that further sends
  258. the requests to the kernel. The adding and removing of the
  259. tags keep the specified convention, using : as delimitator.
  260. The tag listing keeps the same format as well. The tag user
  261. application uses fcntl system call in order to send the operation
  262. type and required arguments. New command types have been
  263. defined for fcntl for tagSys operations. From fcntl syscall other
  264. kernel-level tag specific functions are called for adding tags,
  265. deleting tags, searching and creating entry points.
  266. \section{Experimental results}
  267. $\\$
  268. Because of the way tagSys is implemented there will be two major testing scenarios.
  269. The first scenario will consist of testing the UNIX-based commands, while the second
  270. one will test the tag manipulation commands. There may be an extra scenario for
  271. performance testing, but this is not the scope of this article as the main goal of tagSys
  272. is to prove that a tag based file system is implementable and it is more intuitive than the hierarchical one.
  273. The testing process will be concluded by executing a series of commands, noting the output of the commands and comparing it with the expected results.
  274. \\
  275. The test platform is represented by Fedora 14 OS with a 2.6.35.10 kernel
  276. compiled with the changes for tagSys.
  277. At the moment the \textit{ls} command is the only UNIX-based command hacked to allow
  278. the use of tags and three tagSys commands (tag -c tag -a, tag -d) are fully
  279. implemented and tested. We started with \textit{ls} because it is the command
  280. with the most importance, the main use
  281. of tags being to search by them, and we did not have enough time to implement
  282. the other commands.
  283. A first test scenario was to add tags to a file and browse the file based on
  284. one or more of its associated tags
  285. \lstset{numbers=none,captionpos=b,frame=single,language=C,caption=First test scenario,label=lst:tossimnet}
  286. \begin{lstlisting}
  287. $ cd ~/tagSys
  288. $ ls
  289. tag_test untagged
  290. $ tag -c
  291. $ tag -a tag_test:soa:tag:fs
  292. $ ls :soa
  293. tag_test
  294. $ ls :fs:soa:tag
  295. tag_test
  296. $ tag -d tag_test:soa
  297. $ ls :fs:soa:tag
  298. $
  299. \end{lstlisting}
  300. A second test scenario is having more tagged files and to browse by various
  301. combinations of these tags.
  302. \lstset{numbers=none,captionpos=b,frame=single,language=C,caption=Second test scenario,label=lst:tossimnet}
  303. \begin{lstlisting}
  304. $ ls
  305. tag_test untagged
  306. $ tag -a untagged:soa:fs:test
  307. $ tag -a tag_test:second:soa:test
  308. $ ls :soa:test
  309. tag_test untagged
  310. \end{lstlisting}
  311. We wanted to have little impact on the normal performance of the kernel,
  312. that is why we introduced the entry points. Because of them, each vfs syscall
  313. will only have an overhead of a few string compares. To test the performance
  314. is still ok we did a kernel build while running the tagSys modified kernel.
  315. Performance differences where very little, 10-20 seconds in average. Because
  316. building the kernel stresses the filesystem very much, we are positive that
  317. the performance in normal cases is still very good.
  318. \section{Conclusion}
  319. tagSys can be a huge step towards a different-organized file system for end-users due to its little overhead,
  320. good performance and usability.
  321. Changes in the core kernel code are hidden behind a kernel config(CONFIG_TAGFS) so it can be turned off.
  322. Pushing upstream could make it more usefull aswell as maybe getting a better interface to the filesystem, but the persistent storage is something that is usually not accepted in the vanilla kernel.
  323. $\\$
  324. There are still a couple of common UNIX commands whose behaviour we want to
  325. modify in order to issue them with tags. Also we would like to add a tag-specific command for listing
  326. the tags associated with a file.
  327. \begin{center}
  328. \begin{table}[htb]
  329. \begin{center}
  330. \begin{tabular}{ | p{2.5cm} | p{6.5cm} | p{2.5cm} |}
  331. \hline
  332. \textbf{Command}&\textbf{Description}&\textbf{VFS function}\\ \hline
  333. touch &Add tags to a file at creation time&do$\_$sys$\_$open\\ \hline
  334. mv &The new file keeps tag information&rename\\ \hline
  335. cp &The new file copies tag information from the old one&unlink\\ \hline
  336. tag -l &tagSys new command for listing tags associated to a file&-\\
  337. \hline
  338. \end{tabular}
  339. \end{center}
  340. \caption{More tag-based commands}
  341. \label{table:future-work}
  342. \end{table}
  343. \end{center}
  344. \textit{Possible future work}
  345. In a pure tag file system, the disk mechanism could be improved in the following way:
  346. We know that tags can be added to some files, we have no hierarchical structure of the files.
  347. This way we can find blocks of files based on tags which could reduce disk fragmentation.
  348. Clustering tag data can give insight on how much space there is required of a certain tag type files
  349. and how accessible this should be to the user. This could lower the external fragmentation of the disk
  350. if properly used. However more tests should be done regarding this problem.
  351. Given the fact that Linux implements Extended Attributes\cite{extattr}(also called xattrs) which are name/value pairs
  352. associated with files as an extension to normal inode-based attributes, tags could be inserted in xattrs
  353. and could be easily displayed of graphical file browsers.
  354. The current implementation, where we add hooks in the kernel syscall so that existing applications
  355. can use tagSys, can be augmented by adding new syscalls or fcntl options so that new application can use the
  356. system better.