Our Emperor Sponsors

  • InternetNZ
  • Google
  • IBM
  • HP

<-- Back to schedule

Ceph: a scalable distributed storage system for Linux

Time:15:45 - 16:30
Day:Wednesday 20 January 2010
Location:Renouf 1 (MFC)
Project: Ceph distributed storage system

Ceph is a scalable distributed storage system for Linux consisting of two main components. An object storage layer provides reliable, scalable, and high-performance parallel access to gigabytes to petabytes of data objects. A distributed file system is constructed on top of this object store, providing high-performance cache-coherent parallel access to a single shared file system namespace with POSIX semantics. This talk will focus, in turn, on both parts of the system: their architecture, implementation, and deployment. The intended audience is a mix of developers and system administrators.

The object store provides a generic, scalable cloud storage platform (much like Amazon S3) with advanced features like snapshots and distributed computation. The storage cluster is designed to be relatively self-managing: it data replication, failure recovery, and data migration (during cluster expansion or contraction) are handled semi-autonomously by the storage nodes comprising the cluster. A well-known data distribution function allows clients to calculate object locations within the cluster without consulting any central directory or index servers, providing fast, direct parallel access to data.

The store logically consists of some number of independent object pools, each providing an independent object namespace. Each pool has some associated level of (n-way) replication and placement constraints (e.g., affinity for a given class of storage nodes), which can be adjusted at any time. A simple computation infrastructure allows an administrator to dynamically load object "methods" into the cluster, extending the basic set of supported operations (read, write, truncate, remove, get/set xattr, etc.). For example, a large application hosting image content may load an image manipulation library, allowing applications to rotate, resize, or crop image objects on the storage nodes themselves without an over-the-net read/modify/write cycle.

The Ceph distributed file system brings a new level of scalability to Linux cluster file systems. Unlike conventional shared-disk file systems like GFS and OCFS2, Ceph utilizes a metadata server (MDS) cluster that mediates access to file data in the shared object store. The cluster essentially acts as a special-purpose distributed in-memory metadata cache, providing fast access to file system metadata, while managing cache coherence between clients mounting the file system.

The file system incorporates two interesting features. First, by maintaining "recursive accounting" information within the directory hierarchy, clients and trivially see how many files and how much data is contained by any directory (and its children) in the system. A recursive "mtime" similarly allows applications like backup software to quickly identify which portions of the hierarchy contain recent changes. Second, Ceph implements snapshots on arbitrary directories, without requiring the file system to be separated a priori into separate subvolumes. A simple interface (mkdir .snap/foo, rmdir .snap/foo) makes snapshots usable by individual, non-privileged users.

Ceph is licensed under a combination of the GPL and LGPL (version 2), and is actively working to merge the file system client into the mainline Linux kernel.

Sage Weil

Sage Weil designed Ceph as part of his PhD research in storage systems
at the University of California, Santa Cruz. Since graduating, he has
continued to refine the system with the goal of providing a stable
next generation distributed file system for Linux. Prior to his
graduate work, Sage helped found New Dream Network, the company behind
Dreamhost web hosting (, who now supports a small team
of Ceph developers.