Lustre file systems at GSI

File system State
go_forward.gif Nyx PRODUCTION
go_forward.gif Lustre (codename Hebe)

Data safety

Warning! Lustre storage has no backup!

Risk of data loss is reduced by storing data on RAID arrays. However, this does not guarantee the safety of data and it does not protect against errors of RAID controllers, silent data corruption and other sources of data loss.

The namespace metadata is copied in regular intervals. This may help in case of total destruction of the MDS (which may render all files on Lustre useless). Success of recovering metadata on the production scale has not been shown. Any file operations more recent than the last backup copy of the namespace metadata would be lost.

Using Lustre

Lustre is a scalable and efficient file system. Nevertheless, since our Lustre installations manage millions of files and PetaBytes of data, certain common practices in handling I/O on smaller network file systems do not apply to Lustre. In order to understand these boundary conditions for Lustre, users have to understand how data access is established.

Why should You care?

It would be much more convenient if it were possible to hide all the particularities of a highly complex system like Lustre from users. However, applications performance and to some degree Lustre's serviceability depend on basic understanding of Lustre architecture. Simple changes to your application code or runtime environment may have a big impact on reliability and performance of Lustre file systems.

The difficulty in implementing parallel I/O for scientists is based on the difference between data representation from the scientific point of view (particles, cells, etc.) and the conditions of a physical data storage. Furthermore, software layers (like Lustre) in between applications and physical devices are complex and varying. Nevertheless, simple insides as described in "Best Practices" here will optimize application performance with a moderate amount of work.
Keep in mind that you can only compute the data as fast as you can move it.

How does Lustre I/O work?

Clients (on a compute node: running jobs) create, read and write files by first sending a request to the MDS. Using its local metadata, the MDS determines the location and structural distribution of a file. Second, the MDS redirects the client to the OSS managing the OST that stores the requested file objects. Once a client has established a network connection to the target OSS, the MDS is no longer involved in the I/O process. Third, the actual data access of clients goes directly over the OSS(s) to the OST(s) performing operations like file locking, disk allocation, and so forth.

Where are the limits?

Since Lustre enforces coherence in case of multiple clients accessing the same file at the same time, a distributed file lock management is required to guarantee consistency for all clients. A Lustre system is limited in the number of operations it can do per second. Therefore, it is strongly recommended to keep the amount of file-open and file-lock operations in parallel as small as possible to reduce contention.

Best Practices

Lustre is a system shared by all users of the compute clusters. Due to the architecture of Lustre, users have a significant impact on the overall performance depending on their methods of accessing data (reading and writing). Optimizing the I/O performance for each application utilizing distributed storage will decrease the overall load on Lustre and hence improve the user experience for everyone. Following are some general advices for I/O on Lustre.

Keep in mind that sub-optimal access on Lustre multiplies with the number of jobs (concurrent processes) submitted to the compute clusters. Many of the topics discussed further down will unfold with severity depending on the scale of cluster applications.

Avoid flooding the MDS

Commands used to investigate file metadata like ls, find, du, or df can flood the MDS with huge numbers of requests, especially if they descend into a deep directory tree structure with many files.

Metadata information such as ownership or permissions are stored in the MDS, whereas a file size is only available from the respective OST. For example, ls -lR issues a request to the MDS and to an OST for each file or directory. Similarly, recursive scanning of directory trees with find is very expensive and takes a long time to complete.

Avoid as far as possible searching for input/output files in jobs. If you need to check for the existence of a file with ls, for example, omit unnecessary command options reading irrelevant metadata (like -l or --color). Should you not need sorted output for ls, use option -U which will improve the response time for listing.

Likewise, if you use rsync to move data between Lustre and execution nodes make sure to copy only absolutely relevant files and do not use --exclude PATTERN options. Generally, wild-cards for commands like tar or rm for huge lists of files should be avoided. For example, executing rm -rf /path/to/files/* with millions of files will never finish, since the expansion of the wild-card * will have highly negative impact on the responsiveness of Lustre overall.

Don't install software in Lustre

The HPC department provides a dedicated infrastructure for software deployment on the compute clusters. Please refer to the CvmFs documentation and/or contact the software coordinator of your experiment / working group.

Avoid installing software frameworks and libraries in Lustre, because they usually contain lots of very small files. When many jobs start to initialize applications with library dependencies inside a Lustre file system, the MDS will be flooded with requests.

Do not compile software in Lustre! Since the build process generates plenty of compile artifacts and temporary files, it floods the MDS with requests. If you absolutely need to deploy binaries on Lustre, compile them in a temporary directory on the Interactive Machines and install them afterwards.

Beware of executables in Lustre

Lustre clients can block I/O in case of high load in Lustre, since Lustre comes with a "strong" client/server coupling to enable connection recovery after infrastructure failures. Usually, in such cases programs crash when instructions are loaded from an inaccessible executable into memory. It is possible to submit binaries as jobs with the cluster management system, but we recommend to copy the executable into the temporary scratch space local at the execution node before executing it. Furthermore, executables suffer from a performance penalty (due to the network latency) when run from Lustre.

Don't store too many files in a directory

Concurrent access to several files in the same directory creates contention, because Lustre has to maintain a lock to the directory. Therefore, make sure to use subdirectories and keep the number of files per directory within thousands.

Avoid many small files

The optimal access strategy for handling I/O of a single job would be to have exactly one file containing input data and to write exactly one output file. Neither input nor output files should be shared with another jobs respective process. Naturally, in many applications this is impossible or very difficult due to the nature of the executed computation itself. Nevertheless, the number of files accessed by your application should be as small as possible.

For this reason we recommend file sizes bigger than 1 GB for input data if feasible. If your experiment data consists of many small files, consider to merge the data once before your execute several processing applications on significantly bigger data files. The number of output files needs to be very limited too. Remember that log files are output files as well. For example, separating standard out and standard error should be avoided, as well as writing additional log streams from child processes.

Don't write one file from many processes

Logical concurrence of file access is a burden for Lustre. In case of massively parallel computations (with a large number of processes or threads) contention has to be taken into account. Instead of allowing all processes to do the I/O, choose just a few processes to do this. For writes, a couple of processes should collect the data from other processes and merge it before writing to storage. For reads, these few processes should read the data and then broadcast to others.

Keep files open and buffer data

Each file-open is a metadata request to the MDS followed by a redirection to an OSS/OST. Keep file handles open during the execution of your application. Make sure to buffer output as long as possible (1MB+) before you flush it to storage. Generally, aggregate small read and write operations into the larger ones (e.g. with MPI-IO Collective Buffering).

Architecture / Glossary

You might hear HPC people referring to certain Lustre machine types without giving explanations, so here they are:

  1. The Metadata Server (MDS) stores namespace metadata such as file and directory names or access permissions. This single service directs each file request to a corresponding server storing data. Once the file is opened, the MDS is not longer involved.
  2. The Object Storage Servers (OSS) typically manage several Object Storage Targets (OST) by controlling I/O access and handling network requests. OSTs are storage devices that store your files and consist of physical disks (in a RAID configuration). File data is stored in one or more objects, where each object is distributed to a different OST. The capacity of a Lustre file system is the sum of the capacities provided by all OSTs.
  3. Clients (normally compute nodes in a cluster) access file data. Lustre presents a single namespace to users, visible like a usual network file system via a mounted path in the local directory tree (standard POSIX semantics). It allows concurrent and coherent read and write access to the distributed file system.
  4. Configuration Management Server (MGS): this is the central point of contact providing configuration information about Lustre file systems and the entry point for Lustre communication (e.g. for a client intending to mount a Lustre file system). The MGS is usually co-located to the MDT on one machine.
Topic revision: r26 - 2018-08-29, ThomasRoth