Shifting Checklist
Daily Checklist
-
Approve or Extend VOMS membership
Email notifications will be sent out when there are requests need to be approved or extended. Click here for membership extension Click here for membership approvement. Links will also be included in the email, you must be the admin first before doing above actions
-
Screen sessions
there are a total of six screen sessions you need to worry about. You can use
screen -r <screen name/Id>
to see them. you can use-x
to see a screen someone else is currently looking at.On Buffer1:
dflow_register registers new incoming raw files
offsite_transfers handles the transfers from buffer 1 to the grid storage.On cedar1:
benchmarking submits benchmarking jobs to cedar
offline_processing creates and inserts processing jobs to the database for incoming data.On liverpool:
enqueue_processing submits and monitors processing Dirac jobs through ganga.
enqueue_production submits and monitors production Dirac jobs through ganga.If the screens on B1 need to be relaunched, first kill them and then run
~/data-flow/cron/launch_processing_screen/launch_dataflow.sh
. if the other screens need to be relaunched go to the cedar or liverpool, this will kill and relaunch the screens on that site. If the action saysThis workflow was disabled manually.
ssh to the site, kill the screen, git pull and run~/data-flow/cron/launch_processing_screen/launch_processing.sh
, the variables inlaunch_processing_settings.sh
denote which screens will be launched, so modify them to true/false accordingly.
-
Production requests
You may receive production request like the following:
Production Request for: Rat Version: 6.18.9 Rat DB Tag: partialfill-physics-v2 Priority: medium Total CPU-Year Request: 1.6 Run-by-run: Attached Further Information: With the ratdb tag.
Production requests will be followed with attachments, usually a .json file and a .txt file. You should copy those to Cedar into a temporary directory:
scp <file1> <file2> username@cedar.computecanada.ca:~/<my_temp_dir>
if the file is short you can just copy it through the termanal. Once the files are on Cedar, source the data-flow environment:
cd ~/data-flow source envpy3.sh cd gasp
Now you are ready to submit the request:
python bin/production_submit.py -t <Rat DB Tag> -o ~/<my_temp_dir>/<file_name>.exactly -L ~/<my_temp_dir>/<file_name>.txt config jobfile
where
<file_name>.exactly
is the name of the output file, containing a list of all of the things submitted. It is important to keep track of these files, as they are referenced by members of the group so consider making a separate directory for storing these. Just specify the config and jobfile name. DO NOT type out config or jobfile then the name of the file.
-
Reprocessing requests
A reprocessing request looks similar to a production request:
Reprocessing Request for: Rat Version: 6.18.9 Priority: high Rat DB Tag: partialfill-physics-v2 Run-by-run: Attached Further Information: Submit once the Analysis40R is done (including the resubmission of the partfailed). Second pass processing: https://github.com/snoplus/rat/blob/master/mac/processing/neck_fill/second_pass_processing.mac Third pass processing: - name = Analysis40R_CrOff - nhit cut = 40 - macro = https://github.com/snoplus/rat/blob/master/mac/processing/neck_fill/third_pass_analysis_processing_classifier.mac
most of them can now be submited through the action, go here, copy the contence of the
.txt
file intothe list of run numbers
, select the ratv and other paramters and clickrun workflow
if any of the options are missing edit the file here to add them.here are the instructuons to do it manualy:
Reprocessing requests will also be followed by attachments, usually a
.txt
runlist file. Copy this to Cedar into a temporary directory:scp <file1> username@cedar.computecanada.ca:~/<my_temp_dir>
Double-check that the processing module file exists in
~/data-flow/gasp/modules/processing_information_X_Y_Z.py
, whereX_Y_Z
is the RAT version that will be used. For example,processing_information_6_18_9.py
for RAT 6.18.9If it is not there, then it needs to be created. Since little changes between the modules, the process to create a new one is just:
- Make a copy of the latest
processing_information_X_Y_Z.py
file - Rename it to include the latest RAT version
- Ask Valentina Lozza about any new changes to include, and add them in
Once the file is on Cedar and the module is in the right place, source the data-flow environment:
cd ~/data-flow source envpy3.sh cd gasp
Now you are ready to submit the request:
python bin/offline_processing.py config/<site>.cfg <ratV> L1 -t <Rat DB Tag> -N <Second Pass Module> -N <Third Pass Module> -L ~/<my_temp_dir>/<file_name>.txt -o ~/<my_temp_dir>/<file_name>.exactly`
where
<file_name>.exactly
is the name of the output file, containing a list of all of the things submitted. It is important to keep track of these files, as they are referenced by members of the group so consider making a separate directory for storing these.A module is slightly different than a macro - the easiest way to find out what the corresponding module is to the given macro is to look it up in the processing_information_X_Y_Z.py for that RAT version; the file is basically just a giant mapping of modules to macros with given arguments.
Note:
- Use
-f
only when you need to set Analysis job’s Processing prerequisite to the largest (most recent) pass number of Processing for each run instead of requiring it be part of the submission. - Sometimes you need to submit both processing and analysis jobs with these processing ones being prerequisite for the analysis. Use
-N <Processing module> -N <Analysis module>
to submit them all together. It is best to do this in one step, specifying all modules, than submitting processing modules and then submitting analysis modules which won’t always work. You don’t need-f
flag in this case.
- Make a copy of the latest
-
Failed jobs resubmission
We have a automatic script
~/cron/retry_jobs.sh
that will resubmit certain failed jobs that satisfy some specific requirements twice a day and retry_failed.py has been integrated into gasp_client to resubmit part failed and failed jobs only after run 270000; this works for both Processing as well as Productions jobs. But there are still some other jobs that have to be submitted manually.- Part Failed jobs Sometimes we need to resubmit part failed jobs. Use retry_failed.py to resubmit them instead of using offline_processing.py which would give these jobs a new unwanted pass.
- Identify issues. We usually have email notifications turned on for failure jobs, the execution log will be attached along with the email. If the email includes
Note, no attachments for error/output logs!
, this means the job has no execution log, this usually happen when a job fails before actually running, which can be solved by resubmission. If you want further information, you can getdirac_id
in its data document, and search that on Dirac monitoring page.
-
draining the queue
If something goes wrong with the jobs or something you may need to remove all the jobs from the clusters: processing:
cd ~/data-flow/tree/master/cron screen -S drain ./processing/enqueue_processing_drain.sh
production:
cd ~/data-flow/tree/master/cron screen -S drain ./production/enqueue_production_drain.sh
Weekly Checklist
-
DWG meeting
You need to attend weekly DWG call on Tuesday at 2:00 p.m. EST and give updates on anything notable that happened during the last week. Sign up to the mailing list, snoplus@snolab.ca, for further information.