MaverickClusterFilesystems

Summary

This specification defines the steps deemed valuable and necessary to improve support for clustered file systems, especially in the cloud, for the Maverick release.

Release Note

GlusterFS is included in the cloud-init tool to make deploying GlusterFS in the cloud simpler.

The Ceph clustered file system is available as a technology preview via the ceph package.

Rationale

In order to support our users building highly available, scalable distributed applications in both traditional and cloud-computing environments, we should foster the usage of the best of breed clustered filesystems. GlusterFS has been researched and appears to be stable and highly performant. Ceph looks to be even more exciting for high performance applications and should take advantage of BTRFS's natural copy-on-write abilities.

User stories

As a web applications developer I want to be able to utilize scalable storage for large data sets easily. I create cloud images with cloud-init specifying gluster-client and a configuration for finding the servers, and when I deploy these images they automatically have access to the clustered filesystem.

As a web applications developer I want to test my application on top of Ceph to see if it will be a viable option for the future. I install ceph, setup a small ceph cluster, and then remove them when I am done.

Assumptions

  • Ceph requires BTRFS to be useful.
  • Gluster only requires a filesystem with extended attribute support (ext4, our default, does, as do most modern linux filesystems)

Design

Gluster

Lucid has excellent gluster server and client support in Universe as synced from debian. These packages should be kept current with the latest release of Gluster.

cloud-init

The ability to deploy images with gluster mounts or exports should be added to cloud-init.

Ceph

Ceph is still marked as experimental, and in fact the user space tools warn the user at every chance of that fact. However it makes good sense to help people try it out now that it is in the upstream kernel.

Kernel

Ceph was merged into the kernel with the 2.6.34 release, which is currently the kernel of choice for Maverick.

The kernel team has committed to building ceph support as a module for Maverick.

Upstream Debian Packages

The ceph project produces debian packages already. These packages should be reviewed and uploaded into Universe (or possibly directly into debian?)

http://ceph.newdream.net/wiki/Debian

Implementation

Please see the whiteboard in server-maverick-cloud-gluster under WorkItems

Test/Demo Plan

Install gluster Servers

  • Configure glusterfsd to export one volume using replicate transport
  • Setup both servers to serve the same volume files

Test gluster clients

  • Test mounting gluster under /mnt
  • Test mounting /home on gluster

Test fault tolerance

  • Test removing one server during reads
  • Test removing one server during writes
  • Test re-inserting failed server

Benchmark gluster vs. NFS

  • Setup NFS on one of the gluster servers
  • Run bonnie++ on 1 client at a time
  • Run bonnie++ on 3 clients at a time

Post blog entry to Ubuntuserver.wordpress.com With test results

Unresolved issues

gluster NFS support

Gluster includes support to mount volumes via NFS. This may prove useful for legacy integration, but is being left out of this discussion.

Gluster Storage Platform (GlusterSP, Glusterweb)

glusterweb is only available in release form on gluster.org as a .src.rpm, which is fairly simple to extract but would require special handling for users wanting to update the package.

According to a developer upstream here:

http://www.mail-archive.com/gluster-devel@nongnu.org/msg06924.html

These components of gluster are not ready for packaging. Upon inspection of them, its clear that they are designed only to be used as a dedicated management node. This still might be useful in cloud environments where the management node could even be spawned only when needed.

Setting up such an image is beyond the scope of this blueprint. Some time should be devoted to contacting the glusterfs community to prepare for possibly packaging or contributing upstream to getting these components into a deployable state.

BoF agenda and discussion

Integration options:

  1. package in universe, ensure uptodate (ceph)
  2. support into images (gluster)

Cloud file systems (need application support)

mogile fs (more of a datastore - object store)

  • need packaging - base available from digg (http://mirrors.digg.com).

    • check what else they packaged
  • backend perl -> cross langague support (libmogilefs) limited

  • anagram: OMG Files!
  • requires a metadata db (single point of failure -> HA required)

    • uses Perl DBI -> any NoSQL db could work

    Use cases: S3 like - providing files by http.
    • Lots and lots of little files. Not so good for big files. Store media artifacts. CDN in a box.
    Users: digg, livejournal
  • needs application awareness (HTTP GET / PUT)
  • bottleneck: Tracker (Database bound).

hdfs (hadoop filesystem)

Sheepdog Project

Cluster filesystem (can be mounted in the fs, existing applications can use it unmodified)

gluster fs

  • in universe since intrepid+.
  • uses existing FS
    • 'cross network raid'
  • open web UI for admin
  • easy to configure
  • mountable through fuse
  • replication support (configurable)
  • performance?
    • rackspace testing it now, eta few months
    • self healing by checking every server on read
  • central server?
    • not needed, p2p based on hashing
    • consistent hashing to work out where things go
    • additions/changes will need to be communicated to the client
      • this includes machines going away/coming back
    Use cases:
    • Glusterfs.com vs .org? Glusterfs is the filesystem. Gluster (the company) offers a storage platform (glusterfs+gui stuff). This they call Gluster.
    Maverick options:
    • - cloud image support:
      • client: cloud-config options for gluster server to configure an instance as a gluster client. server:
        • adding a storage node is tricky (dynamism is being worked on). backend store: need filesystem with extended attribute (ex: EBS).

ceph (+btrfs)

  • Use cases:
    • - S3 interface.
  • merged in kernel, will be in maverick
  • Debian packages maintained/available from upstream website.
  • experimental -> can't be only option?

  • http://ceph.newdream.net/wiki/Installing_on_Debian

  • Roadmap: "We hope to have the system usable (for non-critical applications) by the end of 2009 in a single-mds configuration. Even then, however, we would not recommend going without backups for any important data, or deploying in situations where availability is critical."

pvfs2 (martinbogo)

xtreemfs

  • (similar in function to tahoe)
    • uses fips 140-2 approved transport
    • parallel io (read/write)
    • replication / striping / posix interfaces/semantics
    • open & packaged by volunteers for debian

gfs2, ocfs2. Defer to ha-cluster-stack blueprint.

  • packaging state?
    • gfs1 is gone, gfs2 and ocfs2 are in main
    • gfs2 will be built with support for pacemaker

tahoe-lafs

  • Use case:
    • - Client outside of the cloud storing data into the cloud (securely, and redundantly).
  • http://en.wikipedia.org/wiki/Tahoe_Least-Authority_Filesystem

  • http://allmydata.org/trac/tahoe-lafs

  • in Ubuntu Universe, current version: 1.6.1, upstream: 1.6.1
  • Tahoe, the Least Authority File System, is a distributed filesystem that features high reliability, strong security properties, and a fine-grained sharing model. Files are encrypted, signed, erasure-coded, then distributed over multiple servers, such that any (configurable) subset of the servers will be sufficient to recover the data. The default 3-of-10 configuration tolerates up to 7 server failures before data becomes unrecoverable.
  • Tahoe offers "provider-independent security": the confidentiality and integrity of your data do not depend upon the behavior of the servers. The use of erasure-coding means that reliability and availability depend only upon a subset of the servers.
  • Tahoe files are accessed through a RESTful web API, a human-oriented web server interface, and CLI tools. pvfso
    • working on a FUSE plugin (performance?)
  • http://en.wikipedia.org/wiki/Redundant_Array_of_Inexpensive_Nodes RAIN


CategorySpec

MaverickClusterFilesystems (last edited 2010-08-19 20:15:49 by 76-216-240-245)