[SGVLUG] Cluster Filesystems

Sun Jan 8 01:39:25 PST 2006

On 1/7/06, Max Clark <max at clarksys.com> wrote:
> A recent failure of a customer's NetApp has again left me looking for a
> better approach to network storage - specifically in the redundancy and
> replication model. For the sites that can afford it we recommend and
> sell the Isilon systems which give what I am looking for... multiple
> nodes striped together to provide a distributed pool of storage that can
>   survive a node failure.

I was pretty sure that NetApp had a way to approximate a functional
high availablility system (something where one node would take over
the IP of a failed node). It isn't perfect, but functional.

> Ideally I'd love to run the Google File System
> (http://labs.google.com/papers/gfs.html) but I don't think they would
> give me a copy.

The Google File System would probably not work to well for you any
way. It isn't a proper POSIX file system (really, it's an API for
managing data). It's optimised for a specific problem domain. It makes
assumptions that are unlikely to be true in the general case (perhaps
true in your case), like that file are mostly quite large, that one
doesn't ever need to write to a file with anything other than an
append, etc., etc.

That said, if you are looking for something like it, you can look at
the code in the nutch project:

http://lucene.apache.org/nutch/

The have implemented their own data management system which is based
on similar principles as the Google File System.

> Which leaves me with AFS and CODA. Can anyone give me
> real world examples/tips/tricks with these systems?

Don't use CODA for anything serious. AFS is a nice file system, but
it's not really a cluster filesystem. It does function better (mildly)
in the event of a server failure, but it is really just a network
filesystem.

> I would like to be able to take a bunch of 1U single CPU machines with
> dual 250GB-500GB hard drives and cluster them together as a single NAS
> system supporting NFS/CIFS clients. I figure I should be able to get
> 0.2TB of usable protected storage into a node for ~$800/ea, this would
> mean $5,600 for 1TB of protected storage (assuming parity and n+1).

May I ask why you want to use multiple machines if you're still going
to present an NFS/CIFS interface? In general, clustered filesystems
really only make sense if the clients access them via their native
interface.

If you think about it, a single 4U machine with a nice RAID storage
system. Heck, with SATA drives you can actually get that kind of
storage out of a 1U (4 400GB drives with RAID-5 and you've got 1.2TB
of storage) and at a bargain basement price (although without the same
kind of high transaction rate performance you'd expect from higher end
drives). While not a super high availability system, it'd have as good
availability as what your are envisioning.

If you are doing NFS/CIFS, you just aren't going to get the kind of
redundancy you are talking about. If a client is talking to an
NFS/CIFS server when it dies, there is going to be a service
interruption (although particularly with UDP NFS you can do some
clever things to provide a fairly smooth transiion). Probably the
simple way to do that is have a designated master which serves an
NFS/CIFS interface and then use Linux's network block device to RAID
together the drives on all the other machines.

> Thoughts and opinions would be very welcome.

You probably are looking for a clustered file system. The ones that
come to mind immediately are Lustre, SGI's CXFS, Red Hat's GFS, OCFS,
PVFS, and PVFS2 (there are others).

We are experimenting with Lustre at Yahoo Research, and I can say that
the early results show just amazing performance, although you do need
a really nice switch to sustain it. The down side of Lustre is that it
only supports Linux clients. CXFS has a fairly broad set of client
drivers, but I don't know of a Windows client. Same for GFS really. I
think only PVFS has one --maybe OCFS has one, but I never looked. PVFS
and PVFS2 are more geared towards massively parallel computer systems
(to a certain extent all of the ones I mentioned are, but PVFS and
PVFS2 are exceptional), so unless you are working on that you are
probably better off with a more general purpose clustered filesystem.

--
Chris