Skip to content

Merlin workflow example with heat conduction problem

Kevin" Seung Whan Chung edited this page May 2, 2024 · 4 revisions

Motivation

The overall ROM workflow involves several stages: offline FOM sample generation, POD/DMD training, projection-ROM operator building, and online ROM prediction. These stages in general cannot be executed automatically by one click:

  • some manual pre/post-processing are required at each stage;
  • even with the shell scripts that automatize such pre/post-processing, it is necessary for large-scale simulations to be submitted to a job-queueing system (such as slurm/moab) and wait for the queue and the actual job to be complete.

merlin makes the workflow seamless and automatic, even combined with HPC job submission system. Main advantages of using merlin are:

  • Can overview/orchestrate the entire workflow with one yaml file
  • HPC job submission can be automatic: the jobs will be submitted, queued, executed automatically as soon as the prerequisite jobs are complete.
  • Each execution case is containerized: an execution of the workflow stores in a separate directory all the command scripts, the results, and the error messages. This helps debugging of the workflow. Also, it prevents unintended data corruption from the previous runs.

Prerequisites

Installation

The installation procedure for merlin is well documented here.

Server configuration

merlin can be used locally without any job submission, and in such case no server is required to be configured. In order to support job submission in parallel, server is required to be configured for merlin. Detailed procedures are documented in the merlin documentation.

For LC machines, dedicated IT servers can be created and configured. For the detailed instructions, see LLNL LC confluence page.

Heat conduction demo

This demo executes the workflow equivalent to examples/dmd/heat_conduction_hdf.sh. For detail setup of the workflow, see examples/merlin/heat_conduction_hdf.yaml.

Instruction

Starting location

Assuming the libROM is built at $LIBROM_DIR, we move to $LIBROM_DIR/examples/merlin. We should see two files available,

Screenshot 2024-05-02 at 2 16 57 PM

heat_conduction_hdf.yaml is the merlin config file for orchestrating the entire workflow. heat_conduction_hdf_samples.csv provides the sample parameter values that will be run in the workflow.

Case 1: running locally

If not using the server, we can use merlin locally. We first set up the batch type in heat_conduction_hdf.yaml:

batch:
   type: local

Simply running the following command will start the workflow:

merlin run --local heat_conduction_hdf.yaml

Results

If the workflow is initiated, whether it is successful or not, a new directory heat_conduction_hdf_cases is created:

Screenshot 2024-05-02 at 2 24 42 PM

The result of the workflow we just executed is all stored in a directory tagged with a time stamp, heat_conduction_hdf_cases/heat_conduction_hdf_20240502-142042:

Screenshot 2024-05-02 at 2 27 48 PM

This result directory has the following structure:

   heat_conduction_hdf_cases
   |- heat_conduction_hdf_$(time_label)
      |- dmd_data
      |- dmd_list
      |- merlin_info
      |- prepare_dir
      |- sample_foms
      |- test_fom

Each directory corresponds to:

  • dmd_data: snapshots for the training/test parameter values
  • dmd_list: list files for parameter values
  • merlin_info: detailed merlin info that corresponds to this run case
  • prepare_dir: command script/output/error of the step prepare_dir
  • sample_foms: command script/output/error of the step sample_foms
  • test_fom: command script/output/error of the step test_fom

For the step test_form, the results are stored as follows:

Screenshot 2024-05-02 at 2 32 15 PM
  • MERLIN_FINISHED is an empty text file that indicates the successful run of the step test_fom.
  • test_fom.sh is the actual command line script that was executed for the step.
  • test_fom.out is the output result of executing test_fom.sh
  • test_fom.err is the error message from executing test_fom.sh, if failed. If successful, this file is an empty text file.

Case 2: running distributed

If running distributed way, the batch type in heat_conduction_hdf.yaml should be set to flux. Also the bank and resources should be specified:

batch:
   type: flux
   bank: asccasc
   queue: pdebug
   shell: /bin/bash
   nodes: 1

We first run the configuration file to initiate the workflow,

merlin run heat_conduction_hdf.yaml
Screenshot 2024-05-02 at 2 38 40 PM

Unlike running locally, this does not start the jobs immediately. Rather, this initiate the workers in the server (configured for merlin), staying in the server until the workflow is finished. We then let the workers start the jobs:

merlin run-workers heat_conduction_hdf.yaml

This will return a similar command-line output as in case 1. Once all the jobs are finished, we should stop the workers,

merlin stop-workers

This will create the same result directory as for the case 1.