CalibPipe DataBase Data Model
Summary
It is important to define a CalibPipe Data Model that will be used for the CalibPipe DataBase, in which we'll store the calibration coefficient. This sequence diagrams show the current functionality and the design for queries and data retrievement.
Current model
The current DB access and data read system is the following:
However, it is necessary for us to guarantee that the CalibPipe DB is normalized and thus free from any insertion, updation, and deletion anomalies.
Proposed model
Normalization ensures that relevant information is well organized and any ambiguity can be solved by the usage of primary keys, which might be composite.
To summarize the main normalization strategy, the two following rules should be at least satisfied:
- each element of the table identified by the pair {row, column} should not be a nested relation (i.e. an array, a dictionary, another table, etc.)
- each column which is not a primary key or part of it (in case of a composite primary key) should functionally depend on the primary key only (and not on other columns, or a subset of a composite primary key)
Version control table
First, the current version_control_table
, used to retrieve the information about the version of data tables storing coefficient in a certain period of validity, has the following design:
This table has a composite primary key, made by {name, version}
; the period of validity depends on both of these two keys. However, the most frequent update operation on this table will be to extend the validity period (but minor version can be kept fixed at 1 in the DB for simplicity), while it will be rare to insert a new row due to a major update which causes incompatibility. Therefore, the version_control_table
can be redesigned as:
where we can access easily to the validity period through the index table_id
.
Coefficient tables
For coefficient tables, we must ensure that they are normalized and safe from ambiguity when accessing data through queries. According to the existing documents and requirements, each calibration coefficient must be coupled to a period of validity. In certain cases, this can be an observation block, or a shorter period, or even cover a longer period of time (i.e. seasonal). Therefore, timestamps are the best solution for storing such an information.
The normalized table which corresponds to this requirement should be the following:
Here the array_id
can refer to the entire array (North/South), a subarray, or a single telescope. Calibration coefficients that are calculated with the same frequency should be stored in the same table, to optimize queries and data access. The primary key is made by {array_id, validity_start, validity_end}
. However, since queries will usually be based on searching within a range of validity (between validity_start
and validity_end
), it might be worth to split the table in two separated tables, one storing unique values associated to this composite key, and the other storing coefficients that are uniquely identified by such values, which work as primary keys:
In this way, the main queries will be operated on a table of 4 columns only. The main drawback of this design is the increased number of tables, as for each calibration coefficient (or group of coefficients calculated with the same frequency) there will be a validity table storing the entries for fast access.
Note: in either the first or the second model, indices should be created on array_id and timestamps, in order to accelerate queries and make the execution time constant and independent on the size of the table.
Additional information can also be stored in separated meta tables, like:
and grouped in the same table if they depend on the same primary key.
The advantage of using timestamp ranges is that they ensure the needed granularity for some calibration tasks (i.e. event-based or season based). Furthermore, it would be preferable to use the same format for the timestamp in each table, i.e. mjd
format, as it can be stored as float in the DB.
On the other end, if a calibration task has the same granularity as the observation block, it might be worth to use the primary key {sb_creator_id, obs_id, array_id}
instead, being a global unique key as described in the DL0 data model. In this case, indices will be defined on this attributes and a unique value can be defined in a separate table as in the previous case
An additional table to store timestamps for validity of the primary key can also be considered.
Additional meta tables
Tables to store information about array, subarrays and telescope can also be useful to store, i.e.
Large arrays tables
For certain pixel-wise calibration tasks (i.e. camera calibration), building a normalized table might not be an option, as according to the rules listed above it would require to build tables with an incredibly high number of columns (at least n_gain_channels x n_pixels
) for most of the calibration coefficients. Therefore, the large array should be stored as a compressed byte string in the DB using the NDArray
class already available in the current implementation.
Note: this functionality has to be tested.