Data

Data from various sources are used in this project as input to the U-net model and for validation. Descriptions of each data source are found below. This guide assumes you have followed the instructions on the Installation page.

Contents

Data sources
- TCR-2 NOₓ emissions
  - The surface NOₓ emissions used in the “y” input files for the U-net.
- TCR-2 NO₂ surface data
  - Surface NO₂ concentrations used in the “x” input files for the U-net.
- ERA5 meteorological data
  - 7 meteorological variables used in the “x” input files for the U-net, plus the land-sea mask.
  - Download ERA5 data
- USA EPA Air Quality data
  - Ground-based NO₂ measurements used to supplement the TCR-2 NO₂ surface data in the “x” input files for stage 2 of the U-net.
  - Downloading USA EPA Air Quality data
- Potential future data: ECCC Air Quality data
  - Currently using ground-based measurements only over the USA. Adding data from ECCC will provide coverage for Canada in the “x” input files for stage 2 of the U-net.
Input files for the U-net model
- Input file structure
- Creating input files

back to top

Data sources

The datafiles directory contains data files and scripts to make the input files for the U-net model for estimating North American NOₓ emissions. These scripts pull from data files kept in shared directories on Animus, usually within /data/high_res/. In the descriptions below, the contents and usage of each data source are described as well as how to obtain the data, if they are publicly available. When listing the latitude and longitude resolutions, a “±” symbol indicates that the resolution is irregular. That is, the value before the symbol is the average difference in values while the value after the symbol is the standard deviation of the difference in values. For example, the TCR-2 NOₓ emissions latitude values have an average resolution of 1.121483870967742 with a standard deviation of 0.0004997397866077013.

back to top

TCR-2 (Tropospheric Chemistry Reanalysis v2) NOₓ emissions

Directory on Animus: /data/high_res/emacdonald/unet/datafiles/t106/
Filename convention: nox_20XX_t106_US.nc
- where 20XX is the year
Contains the variable:
- nox: Surface NOₓ emissions
Latitude extent: 24.112 to 58.878
- Resolution: 1.121483870967742 ± 0.0004997397866077013
Longitude extent: -126.0 to -59.625
- Resolution: 1.125
Grid size: 32 x 60
Daily time frequency
Example file: datafiles/sample_data/nox_2019_t106_US.nc
- There also exists datafiles/sample_data/TROPESS_reanalysis_mon_emi_nox_anth_2021.nc however, while this sample file has the same latitude-longitude grid as the one above, it has been down-sampled to monthly frequency to allow for quick testing.

Note: These data are not publicly available.

back to top

TCR-2 (Tropospheric Chemistry Reanalysis v2) NO₂

Directory on Animus: /data/high_res/emacdonald/unet/datafiles/TROPESS/
Institution: Jet Propulsion Laboratory
Filename convention: TROPESS_reanalysis_2hr_no2_sfc_20XX.nc
- where 20XX is the year
Contains the variable:
- no2: TROPESS Chemical Reanalysis Surface NO₂ 2-Hourly 2-Dimensional Product

Latitude extent: -89.14 to 89.14
- Resolution: 1.121483870967742 ± 0.0004997397866077013

Longitude extent: 0 to 358.9
- Resolution: 1.125
Grid size: 160 x 320
2-hour time frequency

TO BE ADDED: Instructions on how to download these files.

back to top

ERA5 Data

Directory on Animus: /data/high_res/ERA5concatenated/
Institution: European Centre for Medium-Range Weather Forecasts
Filename convention: 20XX<var>.nc
- where 20XX is the year and <var> is one of the following variables:
  - blh: Boundary layer height
  - lsm: Land-sea mask
  - skt: Skin temperature
  - sp: Surface pressure
  - ssrd: Surface short-wave (solar) radiation downwards
  - t2m: 2 metre temperature
  - u10: 10 metre U wind component
  - v10: 10 metre V wind component
Latitude extent: 11.78 to 73.46
- Resolution: 1.121472716331482 ± 0.0004995913477614522
Longitude extent: -174.4 to -40.5
- Resolution: 1.125
Grid size: 56 x 120
Daily time frequency
Example file: datafiles/sample_data/2019_u10.nc

back to top

Downloading ERA5 data

The ERA5 data files described above have already been downloaded to /data/high_res/ERA5concatenated/ on Animus. However, if you ever need to re-download these files, to obtain a different geographical region for example, follow the instructions below.

The extent, time frame, and variables for which to download ERA5 data are defined in the era5_download.py Python script. Any of these can be modified from their defaults by:

Extent
- The era5_download.py Python script calls the variable DEFAULT_EXTENT from unox/data.py, where the default values can be changed.
Time frame
- The years over which to run the era5_download.py Python script are defined by the input arguments to the era5_download.sh Bash script, see below.
Variables
- Modify the list of variables in the variable_names dictionary in the era5_download.py Python script. The keys (short names) are arbitrary, used to rename the files once downloaded, however the values (long names) must be valid variable names within ERA5. To check what the long name of a variable is, go to the Climate Data Store page for this data, select the variables of interest from the list, scroll down to the “Corresponding API request,” and copy the “variable” list. The code under “Corresponding API request” is what was used to create the era5_download.py Python script.

To start the download, use the era5_download.sh Bash script which accepts arguments for the start and end years to download. Only run this if the data is no longer on Animus in /data/high_res/:

username@animus-c:~/unox$ bash datafiles/era5_download.sh 2005 2020 > datafiles/era5_download_log.txt 2>&1

That will run the era5_download.py script for each month within those years (2005–2020) and send the log output of each call to the file datafiles/era5_download_log.txt.

When downloaded, the ERA5 data are at 2-hour frequency and 0.25 degree resolution. The U-net model takes daily averages on the grid given by (datafiles/lats.npy, datafiles/lons.npy). Running the era5_concatenate.py script will find all the downloaded ERA5 files in unox/datafiles/era5_downloads/ and concatenate them into one file for each year which are output to unox/datafiles/ERA5concatenated/. As this process takes a long time, I recommend launching a tmux session before starting so that any network interruption between your local machine and Animus won’t stop the script from running. Be sure to activate the conda environment on Animus after lauching tmux.

username@animus-c:~/unox$ tmux
username@animus-c:~/unox$ conda activate env_name
(env_name) username@animus-c:~/unox$ python datafiles/era5_concatenate.py
Creating directory: /home/mschee/unox/datafiles/ERA5concatenated
era5_dirs: ['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']
Processing year directory: 2005
        Processing variable: u10 for year 2005
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2005/2005_01_u10/data_stream-oper_stepType-instant.nc
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2005/2005_02_u10/data_stream-oper_stepType-instant.nc
        ...
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2006/2006_12_u10/data_stream-oper_stepType-instant.nc
        Frozen({'valid_time': 365, 'latitude': 56, 'longitude': 120})
        Processing variable: lsm for year 2006
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2006/2006_01_lsm/data_stream-oper_stepType-instant.nc
        ...
Processing year directory: 2007
        Processing variable: u10 for year 2007
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2007/2007_01_u10/data_stream-oper_stepType-instant.nc
        ...
Processing year directory: 2020
        Processing variable: u10 for year 2020
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2020/2020_01_u10/data_stream-oper_stepType-instant.nc
        ...
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2020/2020_12_u10/data_stream-oper_stepType-instant.nc
        Frozen({'valid_time': 366, 'latitude': 56, 'longitude': 120})
        Processing variable: lsm for year 2020
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2020/2020_01_lsm/data_stream-oper_stepType-instant.nc
        ...
        Opening file: /home/mschee/unox/datafiles/era5_downloads/2020/2020_12_lsm/data_stream-oper_stepType-instant.nc
        Frozen({'valid_time': 366, 'latitude': 56, 'longitude': 120})

While running the era5_concatenate.py script, the ERA5 data are regridded to the the grid defined by the lats.npy and lons.npy files (I believe those values came from the t106 files) and to a daily frequency. These are now in the format needed to make input files for U-net: (365,56,120), or (366,56,120) for leap years.

To close the tmux session when done, simply run the exit command.

back to top

USA EPA Air Quality data

Directory on Animus: /data/high_res/US_EPA/
Institution: United States of America Environmental Protection Agency
Filename convention: <species>/<frequency>_<species>/<frequency>_42602_20XX.csv
- where 20XX is the year, and <frequency> is either daily or hourly, and <species> is one of the ID’s listed in the case block for $SPECIES in the datafiles/US_EPA_data_download.sh script
Latitude extent: 18.198712 to 64.84569
- Resolution: 0.10229600438596492 ± 0.7920890308379249
Longitude extent: -159.36624 to -66.052237
- Resolution: 0.20463597149122806 ± 1.216681660681876
Grid size: 56 x 120
Daily time frequency
Example file: datafiles/sample_data/daily_42602_2019.csv
- 42602 is the ID for NO₂
- Note: The latitude and longitude extents and resolutions were calculated solely on this example file.

Note that the USA EPA Air Quality Data is provided in .csv format where each (I should check whether I remove zero values from the list of differences in lat/lon before calculating the average and standard deviation)

back to top

Downloading USA EPA Air Quality data

Air Quality data are currently available from the US EPA’s Air Data site. The data products used in this study were found on the Pre-Generated Data Files page, which lists the files available for each species and each year. While each file can be downloaded individually from this page, I have created the datafiles/US_EPA_data_download.sh script to automate that process, downloading the data into the /data/high_res/US_EPA/ directory on Animus.

That script takes in the following arguments, all of which are optional:

-s: Species
- Default: NO2
- Can choose all to download every species listed.
-b: Begin year
- Default: 1980
-e: End year
- Default: 2024
-f: Frequency
- Default: daily
- Other options: hourly or both

To run the script and download exactly the same NO₂ data as is currently in the /data/high_res/US_EPA/ directory on Animus, this script can be run without specifying any of the arguments. As an example, the code below will download data for CO between 2000 and 2005 at an hourly frequency. Only do this if the data is no longer on in /data/high_res/.

username@animus-c:~/unox$ bash datafiles/US_EPA_data_download.sh -s CO -b 2000 -e 2005 -f hourly
Downloading hourly US EPA data for species: CO from 2000 to 2005
...

back to top

Input files for the U-net model

When running the U-net model, the input data are loaded from a netCDF input file. These files are created to have a consistent structure, with data from all the above sources interpolated onto a common grid in space and time. You will generally only need to create new input files when investigating a new geographic area, a different species, or adding new variables. The process of creating new input files can take some time. However, the model run configuration files (discussed in the Workflow guide) can be used to specify exactly what data are pulled from the input netCDF files for a particular run. Therefore, after spending the time to create an input file, you should be able to try many different kinds of model runs by modifying the configuration file.

back to top

Input file structure

The input files are netCDFs. Using xarray you can look at the structure of such a file by opening it. Below is a text representation of the output. However, if the below python commands are executed in a Jupyter Notebook cell on Animus, the structure becomes interactive, allowing for more exploration (see Analysis).

import xarray as xr

xr.open_dataset('inputfiles/no2_2019_JFM/no2_2019_JFM.nc')

<xarray.Dataset> Size: 62MB
Dimensions:     (time: 89, lat: 56, lon: 120, var: 1)
Coordinates:
  * lat         (lat) float32 224B 11.78 12.9 14.02 15.14 ... 71.21 72.34 73.46
  * lon         (lon) float32 480B -174.4 -173.2 -172.1 ... -42.75 -41.62 -40.5
  * time        (time) object 712B 2019-01-02 00:00:00 ... 2019-03-31 00:00:00
Dimensions without coordinates: var
Data variables: (13/13)
    nox         (time, lat, lon, var) float64 5MB ...
    no2         (time, lat, lon) float64 5MB ...
    no2_tm1     (time, lat, lon) float64 5MB ...
    u10         (time, lat, lon) float64 5MB ...
    v10         (time, lat, lon) float64 5MB ...
    blh         (time, lat, lon) float64 5MB ...
    sp          (time, lat, lon) float64 5MB ...
    skt         (time, lat, lon) float64 5MB ...
    t2m         (time, lat, lon) float64 5MB ...
    ssrd        (time, lat, lon) float64 5MB ...
    lsm         (time, lat, lon) float64 5MB ...
    no2_s2      (time, lat, lon) float64 5MB ...
    no2_s2_tm1  (time, lat, lon) float64 5MB ...
Attributes: (13/17)
    description:        Input data for the Unet model. Data for each year is ...
    y_var:              nox
    emiss_dir:          /data/high_res/emacdonald/unet/datafiles/t106
    emiss_pre:          nox_
    emiss_post:         _t106_US.nc
    nan_fill:           0
    ...                 ...
    x1_vars:            ['no2', 'no2_tm1', 'u10', 'v10', 'blh', 'sp', 'skt', 't2m', 'ssrd']
    x2_vars:            ['no2_s2', 'no2_s2_tm1', 'u10', 'v10', 'blh', 'sp', 'skt', 't2m', 'ssrd']
    data_dir:           /data/high_res
    chemra_path:        emacdonald/unet/datafiles/TROPESS/TROPESS_reanalysis_...
    insitu_path:        US_EPA/NO2/daily_NO2/daily_42602_
    era5_path:          ERA5concatenated
    stages:             [1 2]

In the attributes, it is indicated which variables are “y” and which are “x”.

“y” variable
- The variable which the model is trying to emulate, the “target” variable.
- In this case, y_var is nox.
- Note that this variable has an extra var dimension. This is a dummy dimension to ensure that the “y” variable data has the same number of dimensions as the “x” variables when bundled together.
“x” variables
- The variables which the model combines in particular ways to create a mapping to the target “y” variable.
- Ideally, none of these variables should be dependent on each other.

The input file contains two lists of “x” variables: x1_vars and x2_vars. These correspond to the “x” variables used in Stage 1 and Stage 2 of the training. In Stage 1, the ground-based data is not used, only chemical data from reanalyses. In Stage 2, the pre-trained model from Stage 1 is retrained on input data which is supplemented with ground-based data. For the case above, the chemical data from reanalyses that is used in Stage 1 is no2. When training in Stage 2, the model is given no2_s2 which is the same as no2 except for locations and times for which ground-based data is available.

There is also the variable no2_tm1. This represents the no2 at “T-minus 1 day”, that is, the value of no2 at the same location the day before. It is for this reason that the dataset does not start on January 1st, where the value of no2_tm1 would be from December 31st of the previous year. Therefore, the dataset starts on January 2nd where the value of no2_tm1 is the value of no2 from January 1st. The variable no2_s2_tm1 is the equivalent of no2_tm1 for Stage 2.

Overall, for Stage 1 or 2, the number of “x” variables is equal to the number of ERA5 meteorological variables (of which there are 7 currently) plus 2 (one for no2 and one for no2_tm1).

The data in the input netCDF were originally stored as separate .npy files, each one containing a Numpy array of either the “x” or “y” data for a particular year for a particular stage. These contained no metadata, and so it became difficult to document which input files covered which geographical regions and contained which variables. I reconfigured the files to be in netCDF format so that the metadata is readily accessible and easily readable by both human and machine.

back to top

Creating input files

To create an input netCDF, use the make_all_input_files() function from the input module. This function has no required arguments, however it is a good idea to pass in output_dir, the name of the subdirectory that will be created under the inputfiles directory where the netCDF and corresponding input_metadata.json file are stored. The input_metadata.json file contains a dictionary that is build in the process of creating the input file which contains an overview of what the netCDF contains. If you are working with multiple different input files, the input_metadata.json files offer a quick way to check which file you might want to use, whether it might span a different set of years, contain a different set of variables, or use different data sources.

By default, the make_all_input_files() function will add all the data from 2005 through 2020 from the data sources described above associated with NOₓ, including variables for Stage 2 training. However, you can pass keyword arguments to change the behavior in many ways.

Note that it takes a long time, approximately an hour on Animus, to create an input netCDF, as can be seen by the timing information in the output below. This is largely due to the process of creating the Stage 2 data variables which involves a nested for-loop. Because there is presently little need to create many input files rapidly, I have not optimized this part of the code.

from unox.input import make_all_input_files

make_all_input_files(
    output_dir='no2_2005-2020',
)

Note: It may take around an hour to generate all input files.
Creating y input data...
	Creating y input data for nox in 2005...
	Creating y input data for nox in 2006...
	Creating y input data for nox in 2007...
	Creating y input data for nox in 2008...
	Creating y input data for nox in 2009...
	Creating y input data for nox in 2010...
	Creating y input data for nox in 2011...
	Creating y input data for nox in 2012...
	Creating y input data for nox in 2013...
	Creating y input data for nox in 2014...
	Creating y input data for nox in 2015...
	Creating y input data for nox in 2016...
	Creating y input data for nox in 2017...
	Creating y input data for nox in 2018...
	Creating y input data for nox in 2019...
	Creating y input data for nox in 2020...
Concatenating the y datasets
Saving y inputs to inputfiles/no2_2005-2020/no2_2005-2020.nc
Creating x input data...
	Creating x input file for 2005...
	Creating x input file for 2006...
	Creating x input file for 2007...
	Creating x input file for 2008...
	Creating x input file for 2009...
	Creating x input file for 2010...
	Creating x input file for 2011...
	Creating x input file for 2012...
	Creating x input file for 2013...
	Creating x input file for 2014...
	Adding stage 2 data for no2 in 2014
	Function 'fill_w_insitu' execution time: 0:06:42.550980
	Creating x input file for 2015...
	Adding stage 2 data for no2 in 2015
	Function 'fill_w_insitu' execution time: 0:06:58.153890
	Creating x input file for 2016...
	Adding stage 2 data for no2 in 2016
	Function 'fill_w_insitu' execution time: 0:07:04.098303
	Creating x input file for 2017...
	Adding stage 2 data for no2 in 2017
	Function 'fill_w_insitu' execution time: 0:06:56.500722
	Creating x input file for 2018...
	Adding stage 2 data for no2 in 2018
	Function 'fill_w_insitu' execution time: 0:07:04.373900
	Creating x input file for 2019...
	Adding stage 2 data for no2 in 2019
	Function 'fill_w_insitu' execution time: 0:07:05.905807
	Creating x input file for 2020...
	Adding stage 2 data for no2 in 2020
	Function 'fill_w_insitu' execution time: 0:07:14.309329
Concatenating the x datasets
Saving x inputs to inputfiles/no2_2005-2020/no2_2005-2020.nc
Sorting the dataset by time.
Sorting the y data by time.
Completed making all input files.
	Function 'make_all_input_files' execution time: 1:10:13.798157

Note that, in the process of making an input file, all instances of February 29th are dropped using the method convert_calendar('noleap'). This is to make the years all the same length.

Once an input file has been created, you can now go through the Workflow of setting up a run on Animus, transferring that to Trillium, running the U-net model, and bringing the result back to Animus for analysis.