Overview

A large part of our understanding of the data we have depends on the couchdb databases maintained by the group. It is necessary to construct a rigorous validation of the database and the data documents it contains.

A passing understanding of how a couch database works will make design choices more clear and debugging less painful. A couch database in the most general sense maps keys to values. At the simplest level the database is a collection of documents. The keys are arbitrary (but unique), and the associated documents are JSON objects (analogous to a Python dictionaries). Views can be generated by functions which create zero or more (KEY,VALUE) pairs per document and - with an appropriately chosen key - may be used in queries to retrieve the originating document by key or key range. For more details on the inner workings of CouchDB see the documentation. The views for the production and processing databases are the same and can be found under the database folder in the data-flow repository. The details of the database will be glossed over in the following description.

Raw data is archived in triplicate on Grid storage. Each run has a unique data document to keep track of the location and size of the files.

Output data is also archived (e.g. to Grid storage or to a database depending on output type) and catalogued in the couchdb database. There exists a unique data document for each type of data for pass that is run; subjobs/subfiles for the same data type and pass are aggregated into a single document. Types of data include RAT full datastructure files (ratds), RAT root ntuples (ntuple), and variations on these.

Production and processing are in separate databases, meaning no synchronization is required between them.