Running the model

The documentation below describes how to run the U-net model and retrieve the results. This guide assumes you have followed the instructions on the Installation and Data pages.

Contents


back to top

Introduction

As was mentioned under “Creating virtual environments” on the Installation page, the two remote machines, Trillium and Animus, are used for different parts of this project. Trillium has GPU resources which allow the U-net model to run quickly and efficiently. However, as Trillium is an Alliance Canada system, it can make it fairly restrictive and difficult to perform plotting and analysis tasks. Therefore, after running the model on Trillium, I transfer the output to Animus to do the analysis. Animus also holds the data used to create the inputs for the model. Generally, I use Animus for all tasks related to this project except for running the model itself.

This guide details how to prepare a model run on Animus, transfer that preparation to Trillium and run the model, then transfer the model output back to Animus. A demonstration of how to use the analysis tools can be found in the Example usage.


back to top

Preparing a model run

The preparation for a model run starts on Animus. In principle, you could use your local machine, avoiding Animus all together. However, in order to do so, you would need to download the relevant data and some are currently not publicly available.

back to top

From Animus to HPC

The process of creating an input netCDF is explained on the Data page. Below is an explanation of the command used to transfer input files from Animus to HPC. You only need to do this once per different input file. If you plan on running many jobs with the same input file, you do not need to repeat this step every time.

The script HPC_from_animus.sh is set up to facilitate the transfer so you do not need to remember how to format a scp command and works by taking in the following arguments:

  • -f: Filename

    • The name of the file to transfer.

    • Can be used individually, adding a -f flag for each file to transfer.

  • -i: Inputfile

    • If specified, it will look for an input file with the name given in the -f flag.

    • This flag does not accept any input, it is just a binary.

  • -c: Cluster

    • The name of the cluster to transfer to, the default being trillium.

Below is an example of transferring the no2_2005-2020 input file from Animus to Trillium. Note: this must be run on Animus.

(env_name) username@animus-c:~/unox$ bash HPC_from_animus.sh -f no2_2005-2020 -i 
-c, No cluster specified, defaulting to trillium
-i, Copying full input file directory for no2_2005-2020 to trillium from Animus
Enter passphrase for key '/home/<username>/.ssh/<GH_id>': 
(<username>@trillium.alliancecan.ca) Duo two-factor login for <username>

Enter a passcode or select one of the following options:

 1. Duo Push to <mobile device>

Passcode or option (1-1): 1
Success. Logging you in...
input_metadata.json         100% 1319   232.4KB/s   00:00    
no2_2005-2020.nc            100% 3882MB  77.1MB/s   00:50

back to top

Input configuration files

The parameters that a model run will use are defined in “input configuration” files. These are .json files stored in inputfiles/_input_configs/. The contents of the default configuration file, inputfiles/_input_configs/sample_config.json are shown below.

{
    "input_set": "no2_2005-2020",
    "x_vars": [
        "no2",
        "no2_tm1",
        "u10",
        "v10",
        "blh",
        "sp",
        "skt",
        "t2m",
        "ssrd"
    ],
    "zfi_vars": [
    ],
    "lsm_vars": [
    ],
    "stage_2": true,
    "stage_2_cutoff": 2013,
    "n_epochs": 100,
    "split_year": 2019,
    "split_value": 0.9,
    "grid_size": [56, 120],
    "act_reg": "L1",
    "act_reg_factor": 1e-08
}

All configuration files should follow that format and the attributes are explained below:

  • input_set: The name of the input netCDF to use.

  • x_vars: The list of variables to use as input to the model.

    • See the Data page for documentation of these variables.

    • Note, the y-variable is determined by an attribute in the input file.

  • zfi_vars: A list of variables for which to run Zeroed-Feature Importance experiments.

  • lsm_vars: A list of variables on which to apply the land-sea mask (lsm).

  • stage_2: A boolean as to whether to run Stage 2 of training.

  • stage_2_cutoff: The cutoff year for Stage 2 training.

    • Stage 2 training will start the year after the one specified here.

  • n_epochs: The number of epochs for which to run the training.

    • More epochs gives the chance for the model to improve its predictions, but extends the run time.

  • split_year: The year on which to make the split between the training / testing data and the validation data.

    • Note that this is inclusive. For example, if split_year is 2019, the data from 2019 and all following years will be kept for validation. The y-variable in the validation data is never shown to the model.

  • split_value: The fraction of the data to be used for training, the remaining to be used for testing.

    • Note that this applies to the data left over after splitting off the validation data.

  • grid_size: A list of the number of grid cells to use in latitude and longitude.

  • act_reg: The type of activity regularizer to use in the model.

  • act_reg_factor: The value of the factor to use in the activity regularizer.

Note that .json files have slightly different syntax compared to a Python dictionary.

  • Lists cannot have a comma after the last item in the list.

  • Boolean values must be lower case. That is, true and false.

When preparing for a model run, make sure the configuration file you wish to use is present on the HPC cluster in the inputfiles/_input_configs/ directory. This can be accomplished by creating a configuration file on Animus, then using the HPC_from_animus.sh script to transfer it.

(env_name) username@animus-c:~/unox$ bash HPC_from_animus.sh -f inputfiles/_input_configs/my_new_config.json 
-c, No cluster specified, defaulting to trillium
Enter passphrase for key '/home/<username>/.ssh/<GH_id>': 
(<username>@trillium.alliancecan.ca) Duo two-factor login for <username>

Enter a passcode or select one of the following options:

1. Duo Push to <mobile device>

Passcode or option (1-1): 1
Success. Logging you in...
my_new_config.json         100%  443   137.4KB/s   00:00

Or, one can simply create a new configuration file on HPC directly, which is what I usually do.


back to top

Running the model on HPC

To actually run the model, go to the HPC, in this case, Trillium.

back to top

Submitting a model run

I have created a script, HPC_job_submit.sh which handles much of the boiler-plate necessary for submitting a job to the Alliance Canada scheduler and works by taking in the following arguments:

  • -j: Job name

    • The name of the job to submit, the default being test_unet.

    • This should be a short and identifiable name (i.e., grid_test0, grid_test1, etc.).

    • If a directory under HPC_runs/ with the specified name already exists, the script will prompt you to decide whether to overwrite it.

  • -i: Input configuration file

    • The name of the configuration file in the inputfiles/_input_configs/ directory to use.

    • The default is sample_config.

  • -t: Type of run

    • The type of model run to use. Current options are:

      • test: The default job which runs the run_model.py script once using the HPC_GPU_slurm.sh launcher.

      • zfi_set: A Zeroed Feature Importance run. This runs the run_model.py script a number of times equal to the number of “x” input variables using the HPC_GPU_slurm.sh launcher.

  • -v: Version

    • The version of the code to use, either 1 (default, current code) or 0 (legacy code).

    • This was implemented during the transition from Mist to Trillium and is deprecated. You can safely ignore this argument if only running on Trillium.

  • -c: Cluster

    • The name of the cluster to transfer to, the default being trillium.

Here is an example of submitting a job named no2_example_run on HPC:

username@HPC: unox$ bash HPC_job_submit.sh -j no2_example_run
===== Begin HPC_job_submit.sh =====
-j, Name specified, using JOBNAME=no2_example_run
-i, No config file specified, using CONFIG_FILE=sample_config
    Configuration file inputfiles/_input_configs/sample_config.json found.
-t, No run type specified, using TYPE=test
    Using LAUNCHER=HPC_GPU_slurm.sh
-v, No version specified, using VERSION=1
-c, Using cluster: trillium
Directory for job HPC_runs/no2_example_run already exists
Would you like to overwrite it? (y/n)
y
Overwriting directory HPC_runs/no2_example_run
Sending HPC notifications to email: <your_email@domain>
Submitted batch job 199403
[<username>@trig-login01 unox]$ 

The output can be used to confirm you set the arguments as expected.

back to top

Monitoring a job

Most standard jobs take around 40 minutes to an hour to run. There are three main ways to monitor jobs as they are running: by email, the scheduler queue, or the log file. For more information, see the Alliance Canada documentation on Monitoring Jobs.

Email monitoring

By adding your email to the HPC_slurm.sh script on HPC as described on the Installation, you should receive emails every time a job begins or ends. Here’s an example email notifying of a job starting:

Subject line: Trillium-GPU slurm Job_id=199403 Name=no2_example_run Began, Queued time 00:00:01

Body of message:

scontrol show jobid 199403
JobId=199403 JobName=no2_example_run
  UserId=<username>(<userID>) GroupId=<username>(<userID>) MCS_label=N/A
  Priority=958038 Nice=0 Account=def-dylan QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:02 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2026-01-13T14:13:03 EligibleTime=2026-01-13T14:13:03
  AccrueTime=2026-01-13T14:13:03
  StartTime=2026-01-13T14:13:04 EndTime=2026-01-13T15:13:04 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-13T14:13:04 Scheduler=Main
  Partition=compute AllocNode:Sid=trig-login01:1848577
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=trig0012
  BatchHost=trig0012
  NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  ReqTRES=cpu=1,mem=192500M,node=1,billing=1,gres/gpu=1
  AllocTRES=cpu=24,mem=192500M,node=1,billing=1,gres/gpu=1
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=192500M MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/scratch/<username>/unox/HPC_GPU_slurm.sh
  WorkDir=/scratch/<username>/unox
  Comment=/opt/slurm/bin/sbatch --export=NONE --get-user-env=L --job-name=no2_example_run HPC_GPU_slurm.sh -j no2_example_run -i default -t test -v 1 -c trillium
  StdErr=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
  StdIn=/dev/null
  StdOut=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
  CpusPerTres=gpu:24
  TresPerNode=gres/gpu:1
  MailUser=<your_email@domain> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

I highly recommend setting up a rule in your email client to automatically send emails from the scheduler to it’s own folder so they don’t clog up your main inbox. You will not receive emails before the job has started and, depending on the day and how long you allot for the job to run, it can spend a significant time in the queue.

Once a job completes, you will receive an email like the one below:

Subject line: Trillium-GPU slurm Job_id=199403 Name=no2_example_run Ended, Run time 00:44:00, COMPLETED, ExitCode 0

Body of message:

scontrol show jobid 199403
JobId=199403 JobName=no2_example_run
  UserId=<username>(<user_ID>) GroupId=<username>(<user_ID>) MCS_label=N/A
  Priority=958038 Nice=0 Account=def-dylan QOS=normal
  JobState=COMPLETED Reason=None Dependency=(null)
  Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:44:00 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2026-01-13T14:13:03 EligibleTime=2026-01-13T14:13:03
  AccrueTime=2026-01-13T14:13:03
  StartTime=2026-01-13T14:13:04 EndTime=2026-01-13T14:57:04 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-13T14:13:04 Scheduler=Main
  Partition=compute AllocNode:Sid=trig-login01:1848577
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=trig0012
  BatchHost=trig0012
  NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  ReqTRES=cpu=1,mem=192500M,node=1,billing=1,gres/gpu=1
  AllocTRES=cpu=24,mem=192500M,node=1,billing=1,gres/gpu=1
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=192500M MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/scratch/<username>/unox/HPC_GPU_slurm.sh
  WorkDir=/scratch/<username>/unox
  Comment=/opt/slurm/bin/sbatch --export=NONE --get-user-env=L --job-name=no2_example_run HPC_GPU_slurm.sh -j no2_example_run -i default -t test -v 1 -c trillium
  StdErr=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
  StdIn=/dev/null
  StdOut=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
  CpusPerTres=gpu:24
  TresPerNode=gres/gpu:1
  MailUser=<your_email@domain> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT


sacct -j 199403
JobID           JobName    Account    Elapsed  MaxVMSize     MaxRSS  SystemCPU    UserCPU ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
199403       write_doc+  def-dylan   00:44:00                        04:10.694  15:13.927      0:0
199403.batch      batch  def-dylan   00:44:00          0  69889304K  04:10.692  15:13.927      0:0
199403.exte+     extern  def-dylan   00:44:00          0        28K  00:00.001   00:00:00      0:0

Scheduler queue monitoring

To monitor the jobs you have submitted, including those in the queue, you can use the squeue command and add the -u flag with your username. To make this command easy to execute, I recommend adding this line to your ~/.bashrc file on HPC:

# .bashrc
...
alias mysq='squeue -u <username>'
...

Then, monitoring the queue looks like this:

username@HPC: unox$ mysq
  JOBID           USER      ACCOUNT             NAME  ST  TIME_LEFT  PARTITION NODES  TRES_PER_NODE NODELIST (REASON)
 199403     <username>    def-dylan  no2_example_run   R      40:24    compute     1     gres/gpu:1 trig0012 (None)

The output is formatted to be very wide, so to make the columns line up correctly, you need to make your console window wide enough.

The ST column give the status, which is usually R for running or PD for pending. Once a job completes, it will no longer show up in this queue and you should get an email to notify you that it is done. The TIME_LEFT column gives the amount of allocated time left that the job can use. Jobs will run until they complete or they hit this limit.

If you need to cancel a job, use the scancel command with the job ID.

username@HPC: unox$ scancel 199403
scancel: Terminating job 199403

Log file monitoring

The last way to monitor a job is by the log file that is being continuously updated as the job is running. These logs will be in HPC_runs/<name_of_run>/log_<job_ID>.txt on HPC and capture everything that would go to the standard output from the code like echo and print statements. If you open this file in VSCodium, it will update every time you navigate back to that tab. This can be a useful way to see what part of the code a particular run is currently in. Note that the log files are very extensive, reaching 10’s of thousands of lines.

Expand for relevant sections of an example log file
===== Begin HPC_slurm.sh =====
-j, Name specified, using JOBNAME=no2_example_run
-i, Input files specified, using CONFIG_FILE=default
-t, Run type specified, using TYPE=test
    Using CODEFILE=src/unox/HPC/run_model.py
-c, Using cluster: trillium

Loading modules for Trillium HPC environment
-v 1, using updated code
Activating virtualenv from /home/<username>/.virtualenvs/unoxTrilliumNC/bin/activate

Directory for job HPC_runs/no2_example_run already exists

Running src/unox/HPC/run_model.py with savedir HPC_runs/no2_example_run

2026-01-13 14:13:22.248220: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-01-13 14:13:23.058670: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-01-13 14:13:23.304601: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-01-13 14:13:23.380209: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-01-13 14:13:23.980752: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-13 14:13:31.845597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78763 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:06:00.0, compute capability: 9.0

===== Begin run_model.py =====
Current working directory: /scratch/<username>/unox
Using input arguments:
	argv[1], savedir: HPC_runs/no2_example_run/
	argv[2], config_file: inputfiles/_input_configs/sample_config.json
	argv[3], version: 1
	Shape of first xtrain file: (364, 56, 120, 9)
	Shape of first ytrain file: (364, 56, 120, 1)
After concatenation:
	Shape of xtrain: (5096, 56, 120, 9)
	Shape of ytrain: (5096, 56, 120, 1)
After data split:
	Shape of xtrain: (4586, 56, 120, 9)
	Shape of ytrain: (4586, 56, 120, 1)
	Shape of xvalid: (510, 56, 120, 9)
	Shape of yvalid: (510, 56, 120, 1)
Done loading data sets for stage 1
(56, 120, 9)
	Shape of model input layer to build: ((56, 120, 9))
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ model_input         │ (None, 56, 120,   │          0 │ -                 │
│ (InputLayer)        │ 9)                │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block1_Conv1        │ (None, 56, 120,   │     10,496 │ model_input[0][0] │
│ (Conv2D)            │ 128)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block1_Conv2        │ (None, 56, 120,   │    295,168 │ Block1_Conv1[0][… │
│ (Conv2D)            │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block1_MaxPool      │ (None, 28, 60,    │          0 │ Block1_Conv2[0][… │
│ (MaxPooling2D)      │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block2_Conv1        │ (None, 28, 60,    │    590,080 │ Block1_MaxPool[0… │
│ (Conv2D)            │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block2_Conv2        │ (None, 28, 60,    │  1,180,160 │ Block2_Conv1[0][… │
│ (Conv2D)            │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block2_MaxPool      │ (None, 14, 30,    │          0 │ Block2_Conv2[0][… │
│ (MaxPooling2D)      │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block3_Conv1        │ (None, 14, 30,    │  2,359,808 │ Block2_MaxPool[0… │
│ (Conv2D)            │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block3_Conv2        │ (None, 14, 30,    │  4,719,616 │ Block3_Conv1[0][… │
│ (Conv2D)            │ 1024)             │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block3_MaxPool      │ (None, 7, 15,     │          0 │ Block3_Conv2[0][… │
│ (MaxPooling2D)      │ 1024)             │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block4_Conv1        │ (None, 7, 15,     │  9,438,208 │ Block3_MaxPool[0… │
│ (Conv2D)            │ 1024)             │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block4_Conv2        │ (None, 7, 15,     │  9,438,208 │ Block4_Conv1[0][… │
│ (Conv2D)            │ 1024)             │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block4_Permute1     │ (None, 1024, 7,   │          0 │ Block4_Conv2[0][… │
│ (Permute)           │ 15)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block4_Reshape      │ (None, 1024, 105) │          0 │ Block4_Permute1[… │
│ (Reshape)           │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block4_Permute2     │ (None, 105, 1024) │          0 │ Block4_Reshape[0… │
│ (Permute)           │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ LSTM1 (LSTM)        │ (None, 105, 1024) │  8,392,704 │ Block4_Permute2[… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ LSTM2 (LSTM)        │ (None, 105, 1024) │  8,392,704 │ LSTM1[0][0]       │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block5_Reshape      │ (None, 7, 15,     │          0 │ LSTM2[0][0]       │
│ (Reshape)           │ 1024)             │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block5_UpConv       │ (None, 14, 30,    │  2,097,664 │ Block5_Reshape[0… │
│ (Conv2DTranspose)   │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lambda (Lambda)     │ (None, 14, 30,    │          0 │ Block5_UpConv[0]… │
│                     │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lambda_1 (Lambda)   │ (None, 14, 30,    │          0 │ Block2_Conv2[0][… │
│                     │ 512)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate         │ (None, 14, 30,    │          0 │ lambda[0][0],     │
│ (Concatenate)       │ 2048)             │            │ Block3_Conv2[0][… │
│                     │                   │            │ lambda_1[0][0]    │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block5_Conv1        │ (None, 14, 30,    │  4,718,848 │ concatenate[0][0] │
│ (Conv2D)            │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block5_Conv2        │ (None, 14, 30,    │    590,080 │ Block5_Conv1[0][… │
│ (Conv2D)            │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block6_UpConv       │ (None, 28, 60,    │    262,400 │ Block5_Conv2[0][… │
│ (Conv2DTranspose)   │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lambda_2 (Lambda)   │ (None, 28, 60,    │          0 │ Block6_UpConv[0]… │
│                     │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lambda_3 (Lambda)   │ (None, 28, 60,    │          0 │ Block1_Conv2[0][… │
│                     │ 256)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate_1       │ (None, 28, 60,    │          0 │ lambda_2[0][0],   │
│ (Concatenate)       │ 1024)             │            │ Block2_Conv2[0][… │
│                     │                   │            │ lambda_3[0][0]    │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block6_Conv1        │ (None, 28, 60,    │  1,179,776 │ concatenate_1[0]… │
│ (Conv2D)            │ 128)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block6_Conv2        │ (None, 28, 60,    │    147,584 │ Block6_Conv1[0][… │
│ (Conv2D)            │ 128)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block7_UpConv       │ (None, 56, 120,   │     65,664 │ Block6_Conv2[0][… │
│ (Conv2DTranspose)   │ 128)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lambda_4 (Lambda)   │ (None, 56, 120,   │          0 │ Block7_UpConv[0]… │
│                     │ 128)              │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate_2       │ (None, 56, 120,   │          0 │ lambda_4[0][0],   │
│ (Concatenate)       │ 384)              │            │ Block1_Conv2[0][… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block7_Conv1        │ (None, 56, 120,   │    221,248 │ concatenate_2[0]… │
│ (Conv2D)            │ 64)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Block7_Conv2        │ (None, 56, 120,   │     36,928 │ Block7_Conv1[0][… │
│ (Conv2D)            │ 64)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ model_output        │ (None, 56, 120,   │         65 │ Block7_Conv2[0][… │
│ (Conv2D)            │ 1)                │            │                   │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘
 Total params: 54,137,409 (206.52 MB)
 Trainable params: 54,137,409 (206.52 MB)
 Non-trainable params: 0 (0.00 B)

#### Begin training stage 1 ####
Epoch 1/250
2026-01-13 14:13:42.012413: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8905
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1768331622.661541 3933740 gpu_timer.cc:114] Skipping the delay kernel, measurement accuracy will be reduced
...
W0000 00:00:1768331626.336762 3933742 gpu_timer.cc:114] Skipping the delay kernel, measurement accuracy will be reduced

  1/153 ━━━━━━━━━━━━━━━━━━━━ 20:16 8s/step - loss: 109.0138 - msenonzero: 109.0138 - r2_keras: -0.0754
  3/153 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 108.5387 - msenonzero: 108.5387 - r2_keras: -0.0764 
...

Epoch 250: val_loss did not improve from 7.37319

55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - loss: 5.4319 - msenonzero: 5.4319 - r2_keras: 0.9313 - val_loss: 7.3870 - val_msenonzero: 7.3870 - val_r2_keras: 0.9012
Generating predictions for year: 2019

 1/12 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step
 5/12 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
 9/12 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
12/12 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step
Generating predictions for year: 2020

 1/12 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step
 5/12 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
 9/12 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
12/12 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step
output_metadata: {'savedir': 'HPC_runs/no2_example_run/', 'config_path': 'inputfiles/_input_configs/sample_config.json', 'config_dict': {'input_set': 'no2_2005-2020', 'x_vars': ['no2', 'no2_tm1', 'u10', 'v10', 'blh', 'sp', 'skt', 't2m', 'ssrd'], 'stage_2': True, 'stage_2_cutoff': 2013, 'lsm_vars': [], 'grid_size': [56, 120]}, 'version': 1, 'n_epochs': 250, 'model_fmt': 'keras', 'input_fmt': 'nc', 'split_year': 2019, 'split_value': 0.9, 'train_years': {'stage1': [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], 'stage2': [2014, 2015, 2016, 2017, 2018]}, 'pred_years': {'stage1': [2019, 2020], 'stage2': [2019, 2020]}, 'unet_build_shape': (56, 120, 9)}

Done running test_unet.py

scontrol show job 199403
JobId=199403 JobName=no2_example_run
   UserId=<username>(<user_ID>) GroupId=<username>(<user_ID>) MCS_label=N/A
   Priority=958038 Nice=0 Account=def-dylan QOS=normal
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:44:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2026-01-13T14:13:03 EligibleTime=2026-01-13T14:13:03
   AccrueTime=2026-01-13T14:13:03
   StartTime=2026-01-13T14:13:04 EndTime=2026-01-13T14:57:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-13T14:13:04 Scheduler=Main
   Partition=compute AllocNode:Sid=trig-login01:1848577
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=trig0012
   BatchHost=trig0012
   NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=192500M,node=1,billing=1,gres/gpu=1
   AllocTRES=cpu=24,mem=192500M,node=1,billing=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=192500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/<username>/unox/HPC_GPU_slurm.sh
   WorkDir=/scratch/<username>/unox
   Comment=/opt/slurm/bin/sbatch --export=NONE --get-user-env=L --job-name=no2_example_run HPC_GPU_slurm.sh -j no2_example_run -i default -t test -v 1 -c trillium 
   StdErr=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
   StdIn=/dev/null
   StdOut=/scratch/<username>/unox/HPC_runs/no2_example_run/log_199403.txt
   CpusPerTres=gpu:24
   TresPerNode=gres/gpu:1
   MailUser=<your_email@domain> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
   

sacct -j 199403
JobID           JobName    Account    Elapsed  MaxVMSize     MaxRSS  SystemCPU    UserCPU ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
199403       write_doc+  def-dylan   00:44:00                         00:00:00   00:00:00      0:0 
199403.batch      batch  def-dylan   00:44:00                         00:00:00   00:00:00      0:0 
199403.exte+     extern  def-dylan   00:44:00                         00:00:00   00:00:00      0:0 

The end of the log file contains the same information as in the email notification that the job finished running.


back to top

Gathering model output

Once the job has completed running sucessfully, the next step is to get the relevant output back to Animus for analysis.

back to top

From HPC to Animus

The script HPC_to_animus.sh is set up to facilitate the transfer of job outputs so you do not need to remember how to format a scp command. It works by taking in the following arguments:

  • -f: Filename

    • The name of the file to transfer.

    • Can be used individually, adding a -f flag for each file to transfer.

  • -j: HPC job

    • If specified, it will look for a repository within HPC_runs/ with the name given in the -f flag.

    • This flag does not accept any input, it is just a binary.

  • -c: Cluster

    • The name of the cluster to transfer to, the default being trillium.

  • -m: Model

    • Whether to transfer the model (.keras) files. Note, these files are large.

    • The default behavior will not transfer model files.

    • This flag does not accept any input, it is just a binary.

Below is an example of transferring the no2_example_run model run from Animus to Trillium. Note: this must be run on Animus.

(env_name) username@animus-c:~/unox$ bash HPC_to_animus.sh -f no2_example_run -j 
-c, No cluster specified, defaulting to trillium
-j, Copying full HPC job directory for no2_example_run from trillium to Animus
Enter passphrase for key '/home/<username>/.ssh/<GH_id>': 
(<username>@trillium.alliancecan.ca) Duo two-factor login for <username>

Enter a passcode or select one of the following options:

 1. Duo Push to <mobile device>

Passcode or option (1-1): 1
Success. Logging you in...
Completed file transfer to Animus

Now that the output from the model run is back on Animus, it can be analyzed. See the Analysis page for details. To run multiple jobs at once, see the guide on Running ensemble models.