Lustre

Lustre is a scalable distributed cluster file system, that
  • is licensed under the GPL as OpenSource
  • is POSIX compliant
  • is in use as the high performance file system on most of the top 30 supercomputers world-wide (e.g. Jaguar at Oak Ridge National Lab).
  • since 2010/2011 mainly developed by the company Whamcloud

Lustre has become the basis for experimental data at GSI. It is operated by the HPC group.

Status of the GSI installation

  • Multi Petabyte net capacity on more than 130 OSS in production since spring 2008.
  • Currently running version 1.8.4 on Debian Lenny since 2010-09-17.
  • Available on the BatchFarm and dedicated InteractiveMachines (see also: FileSystems).
  • Cost per Terabyte 350.- Euro/TB (as of 2011-12)

Documentation

There is no backup of this filespace!
  • Risk of data loss is reduced by storing the data on RAID arrays. However, this does not guarantee the safety of the data and it does not protect against errors of the RAID controllers, silent data corruption and other sources of data loss.
  • The metadata are copied out about every two weeks. This may help in case of total destruction of the metadata target (renders all files on Lustre useless). Success of replaying the metadata on the production scale has not been shown. Any file operations more recent than this copy would be lost.

Lustre on GSI machines

Lustre is mounted on
  • GSI batch farm nodes lxb*
  • subgroups of the GSI interactive machines:
    • lennylust32.gsi.de 32 bit Debian "Lenny"
    • lennylust64.gsi.de 64 bit Debian "Lenny"
    • squeezelust64.gsi.de 64 bit Debian "Squeeze"
  • some of the interactive resctricted machines lxir*

On these Lustre clients, the mountpoint is /lustre.

Lustre usage

Since Lustre is POSIX compliant, you can use it as any other Linux file system. There are top level directories under the /lustre moutpoint for each group that has bought file space on Lustre.

While Lustre excels at reading or writing massively parallel data streams, meta data operations are much more costly:
  • Any ls, find, df, du command or generally any command involving a stat on a file or directory goes through the MDS (Meta data server).
  • Meta data performance will be reduced significantly if a large number of such operations are carried out simultaneously: Lustre will feel slow if you run find within several hundred batch jobs at the same time.
  • A corresponding caveat comes with increasing directory size: While Lustre developers claim to have tested directories containing 10^8 files, already a directory listing of 10000 files will feel slow.
  • Avoid file locks: Do not read and write in many concurrent processes at the same file: This will slow down seriously the whole system due to file locking overhead

  • Avoid in any circumstance concurrent parallel read/writes on the same file
  • You are therefore strongly discouraged from using find or similar commands in a massively parallel manner to retrieve information about your/your groups own files.
  • Refrain from recursing through deeply nested directory trees to gain information about file names, file sizes, file locations etc.

Architecture/Glossary

Lustre uses the "classic" architecture of a clustered files system:

Meta Data Server (MDS)
Contains the meta data of the Lustre file system (file names and layout, permissions, ownership etc.) stored in the meta data target (MDT). Each Lustre file system has exactly one MDS/MDT ideally implemented in a fault-tolerant cluster.
Object Storage Servers (OSS)
The data itself is distributed over multiple object storage targets (OSTs) located on object storage servers (OSS). Each OSS may serve multiple OSTs.
Lustre Clients
The clients mount the Lustre file-system.
Configuration Management Server (MGS)
Central point of contact providing configuration information about Lustre file-systems. This is the entry point for Lustre communication (e.g. for a client intending to mount a Lustre file-system). The MGS is usually co-located to the MDT on one machine.

-- ThomasRoth, ChristopherHuhn - 2010 -- WalterSchoen - 15 Jun 2012
Topic revision: r11 - 2013-02-01, ThomasRoth