Molecular atmosphere calibration: what to store in the DB?

Molecular atmosphere & database: what should we save?

Discussion on which calibration products we need saved in the CalibPipe DB, from the point of view of the molecular atmosphere calibration. We can split this discussion on 2 parts: one concerning daily calibration products, either for cat-B or cat-C data, and another on reference calibration products, to be updated once per year or even less frequently.

Daily calibration products

The goal of the molecular atmospheric calibration is to provide an atmospheric model, i.e. the state of the atmosphere over the observatory during an Observation Block (OB). Here one can split even further taking into account the two different approaches of providing IRFs for a given OB, either by selecting them from a precalculated IRF grid or creating them on the fly (tailored MCs). Care should be taken that apart of saving the selected model, we should save all necessary information that allows the reproducibility of the results, as well as the ability to mitigate unavailabilities of the external data sources. Finally should be stressed that there are still some open factors that can affect which calibration products should be stored

Reproducibility of cat-B data
Cat-C data production process

Precalculated IRFs

Here we simply select the best matching atmospheric model. The selection is performed via the matching of the number molecular density at ~15km a.s.l. The computation of the density does not include the MACOBAC value.

Cat-B data:

Selected atmospheric model (type enum)
Date (MJD or datetime.date or string) of the selection
Date (MJD or datetime.date or string) of the observation night in question

In case reproducibility of cat-B data is required, we should also store the name of GDAS files we used in order to select the model. That could be a list of string or a text file (<< 1kB). The reason for that is that GDAS data are not uploaded in RDA in real time, so there is a possibility that a data request of a given timeseries will return only a part of it.

Date of selection and date of observation night should be identical unless there is unavailabilities of external data. The idea is that we send a request to GDAS every morning (?) after an observation night. In case we retrieve the data, we launch the main scenario and select a reference atmospheric model. In case the data retrieval fails, we activate exception scenario no 1, where we use the already existing and stored model. Here we start having a difference between the date of selection and the date of observation night. We calculate the difference of the date the model was selected to the date of the observation night in question. If the data difference is more than 5 days, we activate exception scenario number 2, where we request Copernicus for the atmospheric data of the specific night in question. Then we save again the enum variable with the selected model together with the date of the requested Copernicus data. Therefore the same table will do. Here one might consider that, in case of required cat-B reproducibility, one should save the data assimilation system (DAS) used (GDAS or ECMWF via Copernicus). However this information needs not to be stored, since it will become evident by the names of the used files. (Of course saving the DAS name will make easier the automatisation of the reproduction, but here we are talking about reproduction of cat-B data produced in the rather unlikely exception scenario II. Do we care? )

Cat-C data:

There is no clear strategy on when the cat-C data will be produced. A bare minimum of what should be saved in the DB is:

Selected atmospheric model (type enum)
Date (MJD or datetime.date or string) of the observation night in question
DAS system used

This list implicitly assumes that the production takes place well after the night in question, so all the requested timeseries data should be available. If for any reason a file missing (it shouldn't) that should be caught by some proper exception because it is an unexpected (and hopefully rare) occasion.

The above discussion is based on an assumption: that the template request is not modified in nightly basis. The only things expected to change in a request for a given observatory is the date obviously, but also the timeseries duration (since the night duration changes with seasons). However the night duration can be predicted, the changes should be reflected in having different configuration files for different (parts of) seasons, and it should be clear to the pipeline user which configuration should be used for any date.

Tailored MCs

Here obviously we are talking only for cat-C data. The idea is that we don't simply select precalculated atmospheric model but we instead create one reflecting the state of the atmosphere in that given night.

Bare minimum: data assimilation system (enum)

That should be enough in first order. The pipeline user, any time in the future, knowing which DAS was used can download the timeseries and reproduce the model. However there are few tricky things need to be considered.

timeseries duration: that's required in the extreme case we want the state of the atmosphere only for the specific OB and not the whole night. This possibility only opens due to Copernicus data, that they offer time resolution of 1 hour (and I think the stakeholders were not aware of).
ecsv file with the full atmospheric profile.

That's the safest and heaviest option. It will ensure 100% reproducibility. It will be also the most straightforward way for someone to reproduce or crosscheck the results. Now why one might not be able to fully reproduce the results if they simply now the DAS used? That's for two reasons.

ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified.
Here we need MACOBAC. But how are we going to use MACOBAC? Are we calculating it once per calendar year and store the value in DB? Or for tailored MCs we calculate MACOBAC on the fly? FYI we provide tools to calculate MACOBAC of the last 12 months (that's the requirement - 12MACOBAC). If someone wants to calculate the 12 MACOBAC of e.g. June 2019, they have to do it themselves (unless we want to enhance MACOBAC functionality, but no one asked us to do so)

Reference products

The details of which reference products need to be stored and in which exactly format depends on which simulation tools we plan to use in CTA. The discussion here assumes Corsika 7 and sim_telarray. Here, apart of MACOBAC, all other products are grouped around a reference atmospheric model.

Let's start with 12MACOBAC

12MACOBAC

co2 averaged value for the last 12 months in ppm
date of estimation

Reference atmospheric model set

Those models are like, summer in La Palma, winter in Atacama and so on. Let's see what we need

Reference MDP: it is used only for model selection. Currently it is not an input to the simulations
Atmospheric model for Corsika: contains the following profiles, as function of height. It implicitly includes MACOBAC.
1. Pressure
2. Temperature
3. Partial water Pressure
4. Refractive index
5. Density (g/cm3)
6. Thickness (g/cm2)
Molecular Scattering Profiles (MSPs): Optical depth (or transmission) per wavelength per height. It implicitly includes MACOBAC.
Molecular Absorption Profiles (MAPs): I don't know exactly what our Croatian friends will give us, but I would expect transmission (that's what MODTRAN gives) per molecule per wavelength per height. We then should add the extinction indices of MAPsPs in order to calculate the overall molecular extinction. One might ask why we insist to store MAPs and MSPs separately. The reason (to my knowledge) is to account for variations in the ozone mixing ratio: if we monitor ozone, we can account and correct for them. There is a dropped requirement to store Potential Vorticity, which is highly correlated with transport of ozone from stratosphere to troposphere. But in the word document is stated that this should be a SUSS functionality. I disagree, I would propose to monitor ozone via ECMWF data, but didn't have te chance to discuss it yet. To make a long story short: if we don't plan to correct for ozone, we can simply sum up molecular absorption for every molecule, add it to rayleigh scattering and simply store Molecular Extinction Profiles (MEPs).

All the above profiles can be provided in form of ecsv files. They should be stored together with dates of validity.

Systematics

As I see it, each reference set should come with its own set of systematics calculated in climatological studies. I would propose to store in DB a table like:

Ref model (enum) : IRF (string or enum) : systematic uncertainty (float)

So for example one can easily find that for Atacama summers, the systematic in effective area is within x%.

If we cannot connect the ref model with its validity dates, we should store the validity dates here as well.