You are here: GSI Wiki>AcceleratorControls/APP Web>AppOnCallDutyMain (2026-06-12, RaphaelMueller)

On-call duty / Rufbereitschaft - Main Page

These pages collect necessary information for on-call duty. Relevant information about system access, location of log files, restart/reboot operations and support for specific applications is all collected or linked here.

0. Organisatoric and General information regarding On-call duty
- On-call duty in general
- Duties additionally to actual on-call duty
HowTo

0. Organisatoric and General information regarding On-call duty

On-call duty in general

Handout On-call duty (GSI)
[[Timing/Intern/TimingSystemArbeitszeitUndRufbereitschaft][Work hours and On-call duty (Timing Wiki)]
Overview, whom to call (ACO)
Telephone numbers (Control rooms, etc.)

Duties additionally to actual on-call duty

Visit HKR once or twice a day and ask for issues. Spend more time there in case of problems.
Monitor OLOG entries
- Shift and technical defects (Shift and Cone icon)
Visit the "Morning Briefing" (8.30Uhr, Zoom) and be ready to report about APP OLOG entries. In case you cannot participate, organize an APP colleague that can attend instead.
Visit the "Mittagssitzung" (12:45Uhr, Zoom) and be ready to report about APP OLOG entries. In case you cannot participate, organize an APP colleague that can attend instead.
Visit "OCM Meeting" on Wednesdays (9:00Uhr, Zoom) if neither JF nor AW is going, be ready to report on issues listed in the Technical Defect Report (OLog Login required).
(If nobody is there/ has time, talk to Regine Pfeil or HH and report in detail about APP issues so that they can report on the status instead)

HowTo

Services (Dienste)

Status of services can be found in Icinga -> https://monitoring-acoapp.acc.gsi.de/icinga/monitoring/list/servicegrid?flipped=1
Existing Services, URLs, Ports and their configuration: ApServiceLocationDefinitions
Analyzing problems
- Check logging for errors -> https://logging.acc.gsi.de -> login as anonymous
- asl15* -> bob -> heapdump -> AppOnCallDutyMain#Create_a_Heap_Dump
- asl15* -> bob -> journalctl --user -u lsa-server-gsi -b 0
Pro Services are managed using systemd and executed as user bob
- Known Hosts: asl151.acc.gsi.de (LSA services), asl156.acc.gsi.de (former CSCOAP group services), asl157.acc.gsi.de (former CSCOSV group services)
- Documentation of the setup and of the most important command from INN: https://git.acc.gsi.de/handel/bob-services/src/branch/master/bob.md#systemd
- To Rollout services we use Ansible, if a different version is rolled out a restart should happen automatically!
  - See also AppOnCallDutyMain#Rollout
  - For a manual restart use : asl15* -> bob -> systemctl --user status / systemctl --user restart

Rollout

If a new software version needs to be rolled out, clarify beforehand with the shift leader in the main control room and inform them about constraints or consequences. If a rollout at the moment is not possible because of operational restrictions or if it can not be done before one of the status meetings (like Morning Briefing or Mittagssitzung), then please raise this topic so a convenient time for the rollout can be discussed.

To Rollout services on PRO we use Ansible:

https://git.acc.gsi.de/fcc-commons/acoapp-ansible#usage

Problem solving using the "Panic" App

expert-cs-panic-app is an application to reset the state of several central services and resupply them with the required data. Basically "resetting" the whole control system to a defined state.

Before: If you assume that the problem is timing related, please contact the Timing-On-Call-Duty to secure state and logs for later diagnostic.

The reset can be done using expert-cs-panic-app (ACO-APP Expertenprogramm), documentation is on the GIT front page..

Full data supply ("Vollversorgung")

A full data supply ("Vollversorgung") in the context of the Scheduling App means: Remove the problematic chains chains (non resident), supply with changes ("Versorgen / Supply" upper right), add the chains you wan to be available again, supply again. Depending on the problem the on call staff has to assess which chains are affected (if e.g. 2 of 3 chains work without problems, as it first step it probably makes sense to only remove and add again the problematic chain).

A full data supply ("Vollversorgung") in the context of the devices / LSA means: Use ParamModi, select the context, instead of the usual "An Geräte schicken" select from the drop-down of the button the item "Ganzen Kontext schicken". If it is necessary to supply mulitple context, you have to manually select them one by one (this can be necessary if a device was restarted or it settings reset - in this case all contexts containing the device need to be resupplied)

Diagnosis in case of OutOfMemory-Errors

If an applications is malfunctioning (stuck, wrong values, visual artifacts etc. ...), and there is a suspicion that it could be a (out-of) memory problem. It makes sense to analyze the state further, creating a heapdump is often useful for a later investigation.

If possible: Create a Heapdump and secure it (e.g. on the clusters scratch). If the problematic behavior took place in the MCR and one on call staff not be there in person, you can ask the MCR to execute the following steps.

Determine Process-ID

There are two options to determine the pid, either on your own through the Logging-System (check further down) or ask the MCR to use the terminal program jps on the console in question an check the program name. Example:

jps
2529 DigitizerExpertApp
28707 Jps
486 LauncherApp

For e.g. the DigitizerExpertApp the PID would be 2529.

Create a Heap Dump

jmap -dump:live,format=b,file=<filename>.hprof <PID>

Example: jmap -dump:live,format=b,file=/tmp/dump_digitizer_expert_app_tcl1030.hprof 2529

Copying the dump to the cluster (use needs a Cluster-Account for this to work):

scp /path/to/<dump-filename> <username>@asl75<n>:/tmp

(n is the number of the used asl cluster computer. Since /tmp is a machine local directory the name of the cluster machine is relevant to find the copied dump file. )

As an example: scp /tmp/dump_digitizer_expert_app_tcl1030.hprof awalter@asl754:/tmp

Depending on the default permission of the user that copies the file, he may need to grant us access to the file:

ssh <username>@asl75<n>
chmod a+rw <filename>

For later analysis you should then move the dump file form the /tmp folder to the /common/scratch/cscoap/dumps Verzeichnis or to your own home folder, otherwise it could happen that it is automatically deleted from the /tmp directory. After the file is analyzed it usually can be deleted.

For analysis there exist a multitude of tools, we frequently used e.g.=jvisualvm= , Eclipse Memory Analyzer, …

Trace or understand errors using the Logging-System

Often Out-of-memory errors are caught by our "Uncaught-Exception-Handler" and directed to the logging system. In this case you can search the logging for the following terms (maybe searching only in the StackTrace Field):

OutOfMemory

java.lang.OutOfMemoryError

StackTrace:*OutOfMemory

"heap space"

Diagnosis of stuck / crashed applications

In a similar way like a Heapdump it can be useful to create a Thread-Dump, which gives an overview of the open or running threads of the application. This can be done using the following command:

jstack PID  > /tmp/thread-dump-example-app_tcl1030.txt

How to determine the pid is described in AppOnCallDutyMain#Determine_Process_45ID

Please login to edit this topic

Topic revision: r5 - 2026-06-12, RaphaelMueller

AcceleratorControls/APP

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding GSI Wiki? Send feedback | Legal notice | Privacy Policy (german)