content.tex | searchcode

/docs/report/src/content.tex

https://gitlab.com/sorind/tagfs · LaTeX · 396 lines · 350 code · 40 blank · 6 comment · 0 complexity · c65708f5713f4232bfa6627ef4d4d94b MD5 · raw file

\begin{abstract}
%\boldmath

File systems are an integral part of every operating system. Because of the 
high capacity of modern storing devices file systems need a better way of 
organizing and accessing data in order to be easier for one to retrieve
exactly the files he/she is looking for.
Also, the need of users to personalize the content stored and
to find specific data, pushes manufacturers to employ alternatives for the 
curent design.

tagSys implements a tag based file system in Linux, using a user space application
 which offers support for tagging files and browsing files by tags, 
and a kernel module that hooks into the VFS\cite{rlove} to keep metadata about files. 
We present a simple way to implement such a system and how the regular user can benefit 
from file tagging.
\end{abstract}
% keywords
\textit{\textbf{Keywords}}: File systems, Tags, VFS, metadata

\section{Introduction}
In most operating systems the files are hierarchically organized. 
This means that there usually is a starting point, or parent directory. 
In Windows based systems there are multiple starting points 
based on the physical hard drive partitions. In UNIX-like systems, 
there is a single root drive with different mount points available for users
to add or remove subtrees from different drives, partitions, etc. 
In these filesystems a user organizes related data by storing it in the same 
folder but say that a user, Bob, has two separated folders one for storing
photos taken in the mountains (Mountain-pics) and one for storing photos in which 
a certain person appears (Alice-pics). Two questions arise, one, where should 
Bob store a picture taken in the mountains in which Alice appears and two, how 
could Bob find the pictures taken in the mountains in which Alice appears. For 
current
filesystems the answer to the first question might be storing the photo in either
folder and in the second one creating a link to this photo or, store it in both
folders. The answer to the second one could be naming the photo in such a way
that retrieving them based on the previously stated criteria would work. 
tagSys file system aims to bring a different approach, based on tags rather than 
hierarchical system that is rooted for a long time in modern operating system. 
For the above example, for a tagSys filesystem the answer to both questions would
be adding tags to photos ($<$mountains$><$Alice$>$) and then search for files that 
contain these tags. The question of where to store a specific photo 
 would not be that important anymore. 
A pure tag file system is difficult to implement starting from zero, so we tried
to adapt the current file system in Linux to support tags and see how the two 
systems can coexist on an end-user machine. 
A tag file system should be able to organize files, data on the disk regardless of 
hierarchical logical approach. The position of the files on the disk is irrelevant and 
completely transparent to the user. The file system should be able to put files on disks
and simply recover them on demand based on tags requests. 
In our approach, logical directory based organization and file tags coexist, in order
to see how the two systems can fit and how the user can use alternatives for searching
and clustering the information it has. 
We implemented a tag layer in the Linux Virtual File System and tested how this impacts 
the regular user. We have added posibilities for the user to manipulate the tags (add, delete,
 search) in order to increase the flexibility of the filesystem and the way it interacts with the user.
tagSys is expected to make it easier to work with files, especially personal ones.

%\subsection{Subsection Heading Here}
%Subsection text here.


%\subsubsection{Subsubsection Heading Here}
%Subsubsection text here.

\section{State of the art}

The idea of tagging files in order to access them in an easier fashion is not
a new one and various attempts to implement solutions have been made. Some of
these are specialized solutions for special kind of data, such as Calibre\cite{calibre}
which makes ebook management easier, implemented in userspace. The vast majority
of these applications rely on a database where mappings between files and 
associated metadata are stored and expose a set of commands which translate
to specific queries for the database.  

\subsection[Nepomuk-KDE]{Nepomuk-KDE}
Nepomuk-KDE\cite{nepomuk} is an implementation of Nepomuk which
has been integrated with KDE and that allows adding metadata
to items stored on a computer and making queries based on
that metadata. Based on the Nepomuk specification, Nepomuk-
KDE is able to store in a RDF (Resurse Description Frame-
work) semantic data from desktop applications. For example
the Dolphin desktop manager is able to add simple tags to files
or more complicated comments. This solution is not limited
to files metadata. Almost every application can use the RDF
store to add semantic metadata to theire objects. For example
KMail can do it for emails, Amarok can do it for music files,
but it is used mostly for tagging files.


\subsection{TaggedFrog}TaggedFrog\cite{taggedfrog} is a Windows application 
based on the convenint drag'n'drop technique. It allows you to organize your files, 
documents and Web links just by adding objects to the library and tagging them with 
any keywords. Moreover, you are able tagging files directly from Windows File Explorer 
because the application is integrated with Explorer's context menu.

\subsection{pyTAGSfs}pyTAGSfs\cite{pytagfs}
is a FUSE filesystem, written in python for Linux and Mac OS X systems, that 
arranges media files in a virtual directory structure based on the file tags. 
File tags can be changed by moving and renaming virtual files and directories. 
The virtual files can also be modified directly, and, of course, can be opened 
and played just like regular files.

\subsection{TagFS:Bringing Semantic Metadata to the Filesystem}
\textit{TagFS: Bringing Semantic Metadata to the Filesystem}\cite{tagfssemantic} is a research projected
started at the University of Koblenz which, as Nepomuk, relies on RDF for
defining semantics and SPARQL. Metadata is stored in a repository having an
associated graph, and various opperations can be performed on it(additions, 
updates, etc).
$\\$

\section{tagSys}
    
tagSys is a software application that implements a tag-based filesystem in 
Linux, more specifically, tagSys allows tagging a file at creation time or at
a later time, adding and removing tags, listing the tags associated to a file
at a given time and, the most important characteristic, tagSys allows
browsing for files having specific tag(s).

What is different from the other implementations is that the filesystem hierarchy whill remain unchanged but files will have
associated tags (an example is presented in 
\figref{img:hierarchy}).
\fig[scale=0.5]{figs/hierarchy.pdf}{img:hierarchy}{tagSys hierarchy}

The tagSys application architecture is presented in \figref{arch}. 
\fig[scale=0.5]{figs/archall.pdf}{arch}{tagSys architecture}
The \textit{CLI} is used to issue commands for tag manipulation. In order for the transition 
to this new file system to be as user-friendly as possible, we have implemented a different 
way of manipulating the tags. There are two types
of commands. The first type consists of file manipulation commands available on 
every UNIX-like operating system, such as \textit{ls, touch, mv, cp} whose behaviour 
and implementation was slightly changed in tagSys implementation in order to support tags. 
The second type of commands reffers to new tagSys commands implemented in order to 
provide more tag-related operations - list, remove, add new tags.  
The implementation changes are related to hooks created in \textit{VFS} and will be detailed in 
subsection \textit{Tag handling}. 
No implementation changes at filesystem level were required.

\subsection{Architectural decisions}
tagSys started as an idea to create a more user-friendly file system; remembering
tags is easier than remembering the name of a file or the place where it is stored
but, at the same time a pure tag filesystem might not offer a simpler way of organizing data
in a hierarchical manner, a choice to implement tagSys as a new filesystem, from 
scratch, thus would have been time consuming and would have required a lot of changes into
 the kernel and user interface.
The other choice was to implement tagSys as hooks in
VFS in order to store and retrieve needed metadata.
Since changes are made at VFS level there will be an overhead for filesystems that
subsequently are to be used without tag support. A main concern in implementation
was to reduce this overhead to as little as possible.
$\\$
From the beginning the focus was on the changes needed at VFS and file system 
level and not on storage possibilities of the mappings between files and
associated tags and so these mappings are stored in a file which is always in RAM memory.
$\\$
The keyword of entry point was defined in order to designate
a point in the file system hierarchy starting from which a
distinct tagSys begins meaning that only for that part of the
file system tags apply; for a file system multiple entry points
can be declared. This allowed to keep the changes necessary
for tagSys isolated so that the performance for the normal case
is not affected. This way the only difference from a vanilla
kernel is that we do a string comparison for each entry point
defined.
      
\subsection{Storage}
The mapping between a file and its associated tags is presented in \figref{storage}.
\fig[scale=0.4]{figs/storage.pdf}{storage}{Tag Storage}
Each file contains a bitmask in which each bit represents a tag. If the bit is set to 1 than
the file has the tag with that id and if not it does not. The bitmask is static so we will only
be able to have only a limited number of tags. This defaults to 8192 tags but it is configurable
through the kernel config file with impact on the memory and disk space used.

Each tag has a list of file pointers to files which have that tag. This is done so we do not
have duplicated information and so that we will need to do updates in only one place.

\subsection{Tag commands}
The idea of tagging files is to be able to add a number of tags to a file but
since the number of tags that will be associated to a file is not known beforehand
we establish a convention that a filename will be separated from its 
associated tags by ":" whenever a command that envolves tags is issued. Also,
one tag is separated by another tag by ":".
$\\$
The behaviour of \textit{ls} command was changed so that when issued with 
an argument
beginning with ":" it lists all the files that are tagged with the given words.
$\\$
An existent file can be assigned tags by issueing the \textit{tag} command
with \textit{-a} parameter followed by the filename and the tags that one wants to
attach to that file. There are two constraints that one has to take into
account when wanting to add tags to a file. One is that the implementation of
tagSys limits the maximum number of tags that can be assigned to a file to 256
and the second one is a limitation imposed by the kernel implementation and
it reffers to the fact that the total length of filename, tags and separator
must be less or equal to the value of MAX$\_$PATH$\_$LENGTH(256).
$\\$
Tags can be removed from a file with \textit{tag -d filename:tag[:tag*]} command
taking into consideration the second constraint stated above.
$\\$
The output of \textit{tag -l filename} command is the list of tags associated
to a file at a given time.
$\\$
tagSys permits the creation of entry points in the file system which indicate 
that starting from that point down the hierarchy tags may be used.
This was introduced in order to reduce the overhead for file systems where
the user does not require to use tags, limiting it to a couple of comparisons.
An entry point can be created using \textit{tag -c} command meaning that
the current working directory is a new tagSys entry point.  
$\\$
The available commands as well as a short description is listed in Table1.

\begin{center}
  \begin{table}[htb]
  \begin{center}
  \begin{tabular}{ | l | l | l | l |}
    \hline
      \textbf{Command}&\textbf{Params} &\textbf{Args}&\textbf{Description}\\ \hline
        ls  & -  & :tag[:tag]*         & List files having the specified tags\\ \hline
        tag & -c & -                   & Create a new entry point for tagSys\\ \hline
        tag & -a & filename:tag[:tag]* & Add tags to file\\ \hline
        tag & -d & filename:tag[:tag]* & Remove tags from filename\\ \hline
        tag & -l & filename            & List all the tags for filename\\
    \hline
  \end{tabular}
  \end{center}
  \caption{tagSys implemented commands}
  \label{table:commands}
  \end{table}
\end{center}

\subsection{Implementation details}
\subsubsection[medatada]{Metadata structure}$\\$
We hold two hashtables in the memory. One of them keeps the
tags from the system and the other hold the files. Structures
are added in this hashtables when we add tags to a file. If the
file is not tracked, we create a new entry for it in the hashtable.
The same is done for each tag. Each tag from the tag hashtable
holds a list of file pointers, each pointing to a specific file. This
way we don’t have to duplicate the information about each file.
To associate a tag to a file, each file will have a bitvector with
enough space for each tag.


\subsubsection{VFS Hooks} $\\$
The VFS hooks allow tagSys to break out of the normal flow
of the kernel and performs certain verifications in order to
determine if a particular file should or not be treated as a tag-
able file and afterwards, if necessary, performing the desired
changes. These will usually strip the tags from the filename
so that normal operations will work as expected( E.g executing ls
dir:tag would try to open the file ”dir:tag” but the file does
not exist so it would fail. Because of this, it is necessary to
strip the tags). After this they will continue to do operation
specific things.

To determine where we would need to insert our hooks, we did a strace
on a command and checked what syscalls it makes. For example for a "ls dir"
command:
\begin{center}
	\begin{table}[htb]
	\begin{center}
	\begin{tabular}{ | p{5.5cm} | p{5.5cm} | }
	\hline
	\textbf{Syscall}&\textbf{Action}\\ \hline
	fstat64(path) & Remove tags from path\\ \hline
	open(path) & Remove tags from path and make association between $<$process, fd$>$ and tags\\ \hline
	getdents64(fd) & Get the tags associated with the fd and search the storage. Store the results in the dents structure passed from userspace.\\ \hline
	close(fd) & Remove the association between fd and tags\\ \hline
	\end{tabular}
	\end{center}
	\caption{Ls syscalls}
	\label{table:ls}
	\end{table}
\end{center}

\subsubsection{Userspace application} 
$\\$ The application in userspace adds the tag layer to the normal
file operations using the tag command. This command is a
normal user space application that can be called by the user.
This way the user can add and remove tags from a file, and also
list the current tags. The application allows creating an entry
point in the file system in a similar way. Tagging application
works in user space and calls the specific API that further sends
the requests to the kernel. The adding and removing of the
tags keep the specified convention, using ”:” as delimitator.
The tag listing keeps the same format as well. The tag user
application uses fcntl system call in order to send the operation
type and required arguments. New command types have been
defined for fcntl for tagSys operations. From fcntl syscall other
kernel-level tag specific functions are called for adding tags,
deleting tags, searching and creating entry points.



\section{Experimental results}
$\\$
Because of the way tagSys is implemented there will be two major testing scenarios. 
The first scenario will consist of testing the UNIX-based commands, while the second 
one will test the tag manipulation commands. There may be an extra scenario for 
performance testing, but this is not the scope of this article as the main goal of tagSys 
is to prove that a tag based file system is implementable and it is more intuitive than the hierarchical one.
The testing process will be concluded by executing a series of commands, noting the output of the commands and comparing it with the expected results. 
\\
The test platform is represented by Fedora 14 OS with a 2.6.35.10 kernel 
compiled with the changes for tagSys.
At the moment the \textit{ls} command is the only UNIX-based command hacked to allow 
the use of tags and three tagSys commands (tag -c tag -a, tag -d) are fully 
implemented and tested. We started with \textit{ls} because it is the command
with the most importance, the main use
of tags being to search by them, and we did not have enough time to implement
the other commands.

A first test scenario was to add tags to a file and browse the file based on 
one or more of its associated tags
\lstset{numbers=none,captionpos=b,frame=single,language=C,caption=First test scenario,label=lst:tossimnet}
\begin{lstlisting}
$ cd ~/tagSys
$ ls 
tag_test  untagged
$ tag -c 
$ tag -a tag_test:soa:tag:fs
$ ls :soa
tag_test
$ ls :fs:soa:tag
tag_test
$ tag -d tag_test:soa
$ ls :fs:soa:tag
$
\end{lstlisting}
A second test scenario is having more tagged files and to browse by various
 combinations of these tags.
\lstset{numbers=none,captionpos=b,frame=single,language=C,caption=Second test scenario,label=lst:tossimnet}
\begin{lstlisting}
$ ls
tag_test  untagged
$ tag -a untagged:soa:fs:test
$ tag -a tag_test:second:soa:test
$ ls :soa:test
tag_test  untagged
\end{lstlisting}

We wanted to have little impact on the normal performance of the kernel,
that is why we introduced the entry points. Because of them, each vfs syscall
will only have an overhead of a few string compares. To test the performance
is still ok we did a kernel build while running the tagSys modified kernel.
Performance differences where very little, 10-20 seconds in average. Because
building the kernel stresses the filesystem very much, we are positive that
the performance in normal cases is still very good.

\section{Conclusion}
tagSys can be a huge step towards a different-organized file system for end-users due to its little overhead,
 good performance and usability.
 
 Changes in the core kernel code are hidden behind a kernel config(CONFIG_TAGFS) so it can be turned off.
 Pushing upstream could make it more usefull aswell as maybe getting a better interface to the filesystem, but the persistent storage is something that is usually not accepted in the vanilla kernel. 
  
$\\$
There are still a couple of common UNIX commands whose behaviour we want to 
modify in order to issue them with tags. Also we would like to add a tag-specific command for listing
the tags associated with a file.
\begin{center}
	\begin{table}[htb]
	\begin{center}
	\begin{tabular}{ | p{2.5cm} | p{6.5cm} | p{2.5cm} |}
	\hline
	\textbf{Command}&\textbf{Description}&\textbf{VFS function}\\ \hline
	 touch &Add tags to a file at creation time&do$\_$sys$\_$open\\ \hline
	 mv &The new file keeps tag information&rename\\ \hline
	 cp &The new file copies tag information from the old one&unlink\\ \hline
	 tag -l &tagSys new command for listing tags associated to a file&-\\
     \hline
	\end{tabular}
	\end{center}
	\caption{More tag-based commands}
	\label{table:future-work}
	\end{table}
\end{center}

\textit{Possible future work}

In a pure tag file system, the disk mechanism could be improved in the following way:
We know that tags can be added to some files, we have no hierarchical structure of the files.
This way we can find blocks of files based on tags which could reduce disk fragmentation. 
Clustering tag data can give insight on how much space there is required of a certain tag type files
and how accessible this should be to the user. This could lower the external fragmentation of the disk
if properly used. However more tests should be done regarding this problem. 

Given the fact that Linux implements Extended Attributes\cite{extattr}(also called xattrs) which are name/value pairs 
associated with files as an extension to normal inode-based attributes, tags could be inserted in xattrs
and could be easily displayed of graphical file browsers.

The current implementation, where we add hooks in the kernel syscall so that existing applications
can use tagSys, can be augmented by adding new syscalls or fcntl options so that new application can use the
system better.