Overview
Processing and production use largely the same codebase, with a few exceptions for processing or production specific scripts and modules. The system uses Ganga for submission of jobs, CouchDB for databasing, and python for the majority of the codebase.
A passing understanding of how a couch database works will make design choices more clear and debugging less painful. A couch database in the most general sense maps keys to values. At the simplest level the database is a collection of documents. The keys are arbitrary (but unique), and the associated documents are JSON objects (analogous to a Python dictionaries). Views can be generated by functions which create zero or more (KEY,VALUE) pairs per document and - with an appropriately chosen key - may be used in queries to retrieve the originating document by key or key range. For more details on the inner workings of CouchDB see the documentation. The views for the production and processing databases are the same and can be found under the database folder in the data-flow repository. The details of the database will be glossed over in the following description.
Requested jobs are stored as documents in the couch database. A single job document exists for each unique (RUN,MODULE) tuple and contains each pass for that (RUN,MODULE). A pass for a job may contain one or more sub jobs depending on whether subfiles are submitted separately to the queue or not.
One or more clients is able to access the database, select a subset of jobs (typically by the job priority, see queue types available in offline_processing.py
) and, assuming some prerequisite conditions are met, run the job. The particular module types define the prerequisite conditions to be checked. For production modules there are few/no prerequisites, but for processing e.g. the input data is confirmed to exist prior to launching the job.
Output data is then archived (e.g. to Grid storage or to a database depending on output type) and catalogued in the same database. There exists a unique data document for each type of data for pass that is run; subjobs/subfiles for the same data type and pass are aggregated into a single document. Types of data include RAT full datastructure files (ratds), RAT root ntuples (ntuple), and variations on these.
To submit jobs at a particular site the gasp_client
must be configured and run. The gasp_client
should only be submitting jobs at one site at a time per database, otherwise (unlikely) jobs may get submitted at more than one site or (more likely) job submission will be very inefficient due to document update collisions. Production and processing are in separate databases, meaning no synchronization is required between them. Processing is generally run at only one site since it requires less CPU time relative to production which is run at many sites. This requires a synchronization scheme for production sites to make sure the gasp_client
is only submitting at one place. One option is to run cron jobs at non-overlapping times at each site. However, this does not protect against the client running for longer than expected among other drawbacks. A better option is the enqueue_command
script which uses CouchDB to create a robust and exclusive lock, making sure only once site is running at a time. This script may invoke gasp_client as frequently as desired (i.e. in a loop) at each site, but will only run once no other active site is running. See the section on enqueue_command.py
for an example script to run gasp_client
in this way.