You are here: GSI Wiki>Linux Web>BatchFarm (revision 63)EditAttach

Cluster Computing

TIP Prometheus/Hera decommissioning:

go_forward.gif Detailed instructions on Batch Computing: Grid Engine

The HPC computing-facility is build for physics simulation and experiment data analysis. It is available for all scientists associated with GSI and/or FAIR experiments.

TIP Issues with the cluster infrastructure are posted to a mailing list HPC-INFO (subscribe/ sign off).

Prometheus Debian 6 Squeeze (64Bit) ~380 Nodes (~9000 Cores) Read GridEngine, LustreFs, CvmFs
AMD Opteron 6238 (24 cores) 64GB RAM Login to or for nodes with Infiniband
Mellanox QDR Infiniband File systems /hera, /cvmfs
Kronos see separate topic


What is a job?

In general, a job stands for an action the user wishes to be performed on a computing machine of the Computing Center. It could be an executable file, a set of commands, a script,... This job (which may consist of several tasks) will be developed and tested one of the InteractiveMachines of the Computing Center, before being submitted to the Computing Cluster.

What is the Resource Management System good for?

The Resource Management System (RMS) (like LSF or GridEngine) accepts work requests as "jobs" from users, and puts these jobs into a pending area. As soon as computing resources become available, the system selects a "matching" job from the pending area and sends it to the execution host. Jobs will be managed by the RMS while they are executed (running). When jobs have finished, accounting data, logs and results will be returned.

What is job scheduling?

Jobs are submitted by sending them to the job scheduler. The scheduler is the common entry point to the Computing Cluster. The role of the scheduler is to receive the jobs, match them to appropriate computing resources and to manage the job while it is running on the execution host. The coordinated distribution of all computing resources (memory, disk space, CPU) for all users allows an optimal use of all computing resources.

What is persistent storage?

Computational results (output data) and data produced by a physics experiment (input data) are stored in a shared file-system available on the entire computing facility. This storage is persistent, in contrast to the local storage available on execution host, which is used as temporary space for intermediate data. Any data that is to remain available for a longer period of time must be stored to persistent storage.

My Job Is Not Starting?

Many factors affect the point in time a job gets dispatched for execution: Resource requirements, availability of eligible execution hosts, job dependency conditions, Fairshare or priority constrains, cluster load conditions.
Edit | Attach | Print version |  PDF | History: r65 | r64 < r63 < r62 < r61 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r63 - 2016-06-06, ChristopherHuhn