Worlds Worst File System - WWFS

From GridSiteWiki

Table of contents

Worlds Worst File System - WWFS

The idea is to build a simple file staging area on top of GridSite for the OMII Application Hosting Environment Project(AHE) (http://kato.mvc.mcc.ac.uk/rss-wiki/ApplicationHostingEnvironment) project. The key extension to GridSite is that support extensive meta data about the files has been added. The code has been written in Perl and is run as a CGI script in GridSite.

This is only a prototype but someone might find it useful.

WWFS was built according to the constraints of REST (http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm) defined in Roy Fieldings doctoral thesis. From the thesis "Another conflict with the resource interface of REST occurs when software attempts to treat the Web as a distributed file system. Since file systems expose the implementation of their information, tools exist to "mirror" that information across to multiple sites as a means of load balancing and redistributing the content closer to users. However, they can do so only because files have a fixed set of semantics (a named sequence of bytes) that can be duplicated easily. In contrast, attempts to mirror the content of a Web server as files will fail because the resource interface does not always match the semantics of a file system, and because both data and metadata are included within, and significant to, the semantics of a representation."

Web Dav (http://www.ietf.org/rfc/rfc2518.txt) was not used because it was decided that it was not RESTful: it uses stateful interactions for locking, URI's are not opaque (Web Dav uses a hierarchical namespaces and URIs) and the intoduction of the extra Web Dav verbs is in conflict with the concept of a uniform interface. It is possible to argue that Web Dav is RESTful but that is another story...

The features of the WWFS that make it a candidate for the worlds worst file system are:

 No directories - allows URIs to be opaque
 No file locking
 No Posix interface
 All access through HTTP (GET, POST, DELETE, HEAD, PUT)

Features that make it useful:

Extensive Meta Data support

MD5 checking support, any files written are MD5 checked after writing for errors

Tags - allows you to create "tags", unique URI's, to tag a file, useful for searching for files. A file can have many Tags associated, for example you could mark all the files for a particular job with a unique Tag. The Tags are resources so you can get/set data about them.

CommonName - you can give your files common names, many files can have the same name (no concept of directory so no name clashs).

Cache-Control - you can set the cache control for the file, allows conditional gets and makes HEAD useful

Support conditional GETs, can use HEAD for checking file changes.

All meta data stored in MySQL database including file location - files can be stored in places other than the Web server and files can be moved around behind the scene without the clients being aware

Database handles concurrency issues.

Use "stable writes" for PUTs to handle write failures - when a PUT is invoked a new file is created and only if the new file is written correctly is the database updated. Database transactions used incase of failure during database update.

Everything has an ACL (Access Control List) expressed as GACL (Access Control model) even ACLs. If an ACL is not set for a resource then the owner of the resource is the only person with access.

Use links to ACLs to set access control ie. you can create your favourite ACL and use the URI for that ACL to set the access control for lots of things (this adds a layer of indirection, which adds complexity, but seems to be worth it).

Extensive use of Xlink - this allows you to search/navigate to files, you can select files by common name, last modified date, creation date, by "tag", by file type etc.

Supports user defined meta data, ie you can create and add your own meta data, eg for an output file you could set a JPEG as the meta data.

Logging information provided for all file access.

You can chain files together - there is a place in the meta data for identifying a "previous version" of the file.

The ACLs are serialised into the database - making searching fast and simple.

TODO:

 Output file logs as ATOM/RSS
 Loggin available only for files at the minute, access to all resources IS logged but just not available.
 Add support for CSS/XSLT/XHTML so browsers display the data nicely
 Add support for uploading files through browser
 Add support for XACML - use Accept header in request to decide which format to return
 Add support for conditional gets on ACLs
 Add RDF and Semantic Web technology
 Add support for remote ACLs, use caching, cache-control, HEAD and conditional get for performance
 Add users to upload meta data about them selves.
 Add MD5 header support for everything not just files
 Collections - a user can create a file with a set of Xlinks in it to create a collection, should formal support be added?
 Create client test suite.
 Add support for authenication support for other things than X509 
 There is far too much code repeated - it should be refactored, especially the authorisation step.

If you have a UK e-Science certificate you can upload a file to the service at https://garfield.mvc.mcc.ac.uk/cgi-bin/zzcgumk/a/File, simply use HTTP to POST a file to the address (you will need to use mutual authenication) - you should get XML with Xlinks that will allow you to navigate around the WWFS. When you POST a file for the first time a resource is created that holds information on all your files, from this point you can access all the files you have created. When you do a GET on a file there is an extra HTTP Header added, Meta-Location, this is set to a URI that identifies the resource that holds the meta data for a file - so if someone sends a pointer to a file in the WWFS it is possible to access the meta data for the file.

Interesting Issues

GACL did not have a "delete" permission tag, this is an example of the mis-match that Fielding pointed out when comparing a file system to the Web. There is no delete permission in a UNIX file system, if you have write access to a directory you can delete files in it.

When you POST a file should you get back as the Location HTTP header the location of the file or the location of the meta-data? Or do you use the Content-Location header? The WWFS returns the location of the meta-data, part of the meta data is the location of the file.

Code is written as a Perl CGI script - didn't use mod_perl, the mod_perl books reckon it is not worth while using mod_perl for file upload/download services as the bottleneck is uploading/downloading files not parsing the script.

The code is available at http://garfield.mvc.mcc.ac.uk/cgi-bin/zzcgumk/a/File/118 and the database structure is available at http://garfield.mvc.mcc.ac.uk/cgi-bin/zzcgumk/a/File/117 - when you do a GET on the files you can check the HTTP Headers for the Meta-Location ;-) --Mark mc keown 16:55, 4 Oct 2005 (BST)