You are here: GSI Wiki>Linux Web>BatchFarm (revision 64)EditAttach

Cluster Computing

The HPC computing-facility is build for physics simulation and experiment data analysis. It is available for all scientists associated with GSI and/or FAIR experiments.

TIP Issues with the cluster infrastructure are posted to a mailing list HPC-INFO ( subscribe/ sign off).

Slurm User Manual

A Slurm user manual can be found here Slurm

Overview of the GSI Cluster

Prometheus

The Prometheus Cluster was decommissioned on 30.6.2016

Kronos

For further documentation about the Kronos cluster click here

Batch Computing on the Grid Engine

go_forward.gif Detailed instructions on Batch Computing: Grid Engine

Terminology

What is a job?

In general, a job stands for an action the user wishes to be performed on a computing machine of the Computing Center. It could be an executable file, a set of commands, a script,... This job (which may consist of several tasks) will be developed and tested one of the InteractiveMachines of the Computing Center, before being submitted to the Computing Cluster.

What is the Resource Management System good for?

The Resource Management System (RMS) (like LSF or GridEngine) accepts work requests as "jobs" from users, and puts these jobs into a pending area. As soon as computing resources become available, the system selects a "matching" job from the pending area and sends it to the execution host. Jobs will be managed by the RMS while they are executed (running). When jobs have finished, accounting data, logs and results will be returned.

What is job scheduling?

Jobs are submitted by sending them to the job scheduler. The scheduler is the common entry point to the Computing Cluster. The role of the scheduler is to receive the jobs, match them to appropriate computing resources and to manage the job while it is running on the execution host. The coordinated distribution of all computing resources (memory, disk space, CPU) for all users allows an optimal use of all computing resources.

What is persistent storage?

Computational results (output data) and data produced by a physics experiment (input data) are stored in a shared file-system available on the entire computing facility. This storage is persistent, in contrast to the local storage available on execution host, which is used as temporary space for intermediate data. Any data that is to remain available for a longer period of time must be stored to persistent storage.

My Job Is Not Starting?

Many factors affect the point in time a job gets dispatched for execution: Resource requirements, availability of eligible execution hosts, job dependency conditions, Fairshare or priority constrains, cluster load conditions.
Edit | Attach | Print version |  PDF | History: r65 < r64 < r63 < r62 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r64 - 2016-09-07, GabrieleIannetti