Migrating Production Workflows to the Unified Workflow using the Strangler Fig Pattern

State of the Existing Workflow System

The collection of UFS Application includes many code bases that each defines its own workflow:

Medium Range Weather (MRW) uses the global_workflow and is responsible for GFS, GDAS, and GEFS operational systems
Short Range Weather (SRW) uses its self contained workflow, and is responsible for RRFS, RTMA, and CMAQ operational systems
The Hurricane Analysis and Forecast System (HAFS) uses its self contained workflow, and is responsible for the HAFS operational system
The Reforecast and Reanalysis system
The Subseasonal to Seasonal (S2S) or Seasonal Forecast System (SFS) is under development.

Each App follows the NCEP Central Operations (NCO) standards for operational implementation (available here), meaning that each system is architected in a nearly identical way. The implementation of that architecture, however, can vary tremendously from App to App. To meet NCO Standards, each App must be compatible with running with the ecFlow Workflow manager, and run each of its set of tasks via run scripts, or drivers that follow a layered design, as outlined in this figure.

The job card, or submission script, is the interface between the platform scheduler (PBSPro, Slurm, etc) and the ecFlow workflow manager used in operations, but in research settings Rocoto and Cylc are alternative workflow managers that are typically supported. The creation of this script typically takes the form of a template for ecFlow and Cylc, and is automatically generated by Rocoto. The job card typically exports a minimal set of variables related to the time of day and location of code and configuration files.

The J-Job is typically, but not necessarily, a bash script that sets up location (application/data directory) and temporal (date/cycle) variables, initializes the data and working directories, and calls the ex-script.

The ex-script is the driver for the bulk of the application, and may call one or more ush (utility bash) scripts. Neither of these layers is necessarily written in bash. The tasks in the ex-script layer include:

Data staging in the working directory
Setting task-specific parameters in respective configuration files
Running compiled executable(s)
Delivering data to downstream locations

This basic structure aims to ensure a clear delineation of responsibilities across many, many systems currently running in operations at NCEP.

Each of the Applications are used in both research and operations, so should maintain both reliability standards while being flexible enough to allow for innovation. This often means setting different parameters for a variety of compiled executables, some of which change the requirements on the order, number, or types of tasks to be run in the workflow. To coordinate such changes, all of the Applications have implemented some sort of "Configuration Layer" that is not typically delivered to NCO with an operational implementation.

The configuration layer is responsible for meshing a small subset of user-defined configuration files (typically written in either bash, YAML, or INI formats) with a much larger set of default settings to provide an experiment setup configuration file (in a likely corresponding format). Alongside that experiment file (or files), the configuration layer may also write workflow manager files (Rocoto XML, job cards for ecFlow or Cycl), component configuration files (namelists, XMLs, etc.), and create a work space for the experiment. It may also link in pre-configured static data (fix files).

In addition to the creation of these files, the configuration layer will typically also perform some sort of validation of the user-provided settings. There may be checks for whether certain choices work together nicely, if the data type is correctly specified, or if file paths exist on disk.

The configuration layer is the main user interface to any of these systems. Once an experiment is defined, the rest of the process of running batch jobs is automated.

Monitoring jobs in the batch queueing system and through the workflow manager are the primary mechanisms for observing the system. Debugging is limited to accessing log files written by either the workflow manager, scheduler, or the task.

In summary, a user of these Applications might expect to follow the following steps to run a workflow:

Clone a workflow repository from
Build Fortran executables included with the App
Modify configuration settings in a user config file and/or export environment variables
Run a configuration layer script on the command line, providing the user config file and appropriate environment (Situational) Start a server or cron job for iteratively completing all the tasks in the experiment workflow
Monitor jobs via batch queueing system
Debug failures by reading experiment logs

Apps vary significantly in the specifics of this user interaction, and may or may not require users to manually stage input data, build other supporting software, choose which components must be built for their experiment, and on and on.

What is the Strangler Fig Pattern?

The "Strangler Fig Pattern" is a software design pattern that involves gradually replacing an existing system with a new one, using the old one as a foundation. The new system is built around the existing system and slowly takes over its responsibilities until the old system is eventually replaced. The pattern is named after the Strangler Fig plant, which grows around a host tree and gradually chokes it, eventually replacing it entirely. The pattern is often used in cases where replacing a legacy system all at once is not feasible, but a slow and incremental approach can be taken[^1].

[^3]

How is the Strangler Fig Pattern Implemented?

The implementation of the Strangler Fig Pattern involves several steps. The first step is to understand the functionality of the legacy system and how it is used, which will help determine what parts of the system can be replaced and when. Next, a new system is chosen to replace the legacy system based on the requirements of the organization and the legacy system. A facade is created to act as an interface between the legacy system and the new system. The facade redirects incoming requests to either the legacy system or the new system, depending on which system is best suited to handle the request. The legacy system is then gradually replaced by the new system, with the new system taking over the responsibilities of the legacy system, one piece at a time, until the entire system has been replaced. Finally, the new system must be monitored and maintained to ensure it is functioning as expected, which may involve making updates or changes to the new system as required. [1, 2, 3]

The Strangler Fig Pattern steps can be defined as:

Identify the functionality of the legacy system: The first step is to understand the functionality of the legacy system and how it is used. This will help determine what parts of the system can be replaced and when.
Decide on the new system: The next step is to decide on the new system that will eventually replace the legacy system. This system should be chosen based on the requirements of the organization and the legacy system.
Create a facade: A facade is created that acts as an interface between the legacy system and the new system. The facade redirects incoming requests to either the legacy system or the new system, depending on which system is best suited to handle the request.
Incrementally replace the legacy system: The legacy system is gradually replaced by the new system. The new system takes over the responsibilities of the legacy system, one piece at a time, until the entire system has been replaced.
Monitor and maintain: Finally, the new system must be monitored and maintained to ensure it is functioning as expected. This may involve making updates or changes to the new system as required.

The implementation of the Strangler Fig Pattern requires a strong understanding of the legacy system, a well-defined plan, and good communication between the development team and the stakeholders.

Example Strangler Fig Pattern Implementation

A company has a legacy system that manages its customer orders and inventory. The system is outdated and difficult to maintain, but a complete replacement is not feasible due to the size and complexity of the system.

Understanding the legacy system: The company's development team starts by studying the legacy system to understand how it works and what it does. They determine what parts of the system can be replaced and when.
Choosing a new system: The development team decides to implement a new system using microservices architecture, which is more flexible and easier to maintain than the legacy system.
Creating a facade: The development team creates a facade that acts as an interface between the legacy system and the new system. The facade receives incoming requests and redirects them to either the legacy system or the new system, depending on which system is best suited to handle the request.
Incrementally replacing the legacy system: The development team starts by replacing the parts of the legacy system that are most outdated and difficult to maintain. For example, they might start by replacing the inventory management component with a new microservice. The new microservice takes over the responsibilities of the legacy component and communicates with the rest of the system through the facade. The development team continues this process, replacing more and more components of the legacy system with new microservices until the entire system has been replaced.
Monitoring and maintaining: Once the new system is in place, the development team monitors and maintains it to ensure it is functioning as expected. They make updates and changes as required to keep the system running smoothly.

This example shows how the Strangler Fig Pattern can be used to gradually replace a legacy system with a new one. The incremental approach allows the company to replace the legacy system without disrupting its operations and to ensure the new system is working as expected before fully replacing the legacy system.

The Strangler Fig Pattern Applied to the Unified Workflow

The Unified Workflow, following the Strangler Fig Pattern, will be comprised of three essential subsystems that work together to deliver an end product given user-defined setting. While additional components may be added to the full system in the future, the unification can be labeled complete or successful with just these three in place. The Configuration Subsystem, the workflow manager, and the Component Drivers.

The Configuration Subsystem

From an analysis of the existing UFS Apps, the configuration subsystem serves as the main user interface, and will be treated as the Strangler Fig facade. It is responsible for several tasks, including:

Gathering user provided parameters, both required and optional, that may overwrite any number of default settings
Managing default settings for all portions of the System –
- Workflow definition, including which tasks to run and their dependencies
- Computational platform, including available resources, data locations, run environment requirements, and batch scheduler information
- Component parameters, including required resources and Fortran-required parameters available in namelists, etc.
Validating that settings are appropriate and compatible, examples include
- Files exist where expected
- Parameters are consistent with their required types, e.g., a string was provided integer was required
- Parameters are consistent with each other, e.g., a large physics time step was set with a very small grid spacing
- Tasks are not turned on, and there is no other way specified to get the data they produce.
Creating experiment directories
Populating the directories with necessary fix files, workflow definition files, and experiment configuration files.
Optionally starting the workflow manager.
Logging and reporting

As a facade, the Configuration Subsystem will be responsible for ensuring proper interfaces to the Workflow Manager and standalone tools will be needed to interface with the existing j-jobs/ex-scripts for their configuration, which often relies on global variables that are "sourced" from other bash config scripts.

The Workflow Manager

The UFS Community and operations are already familiar with, and therefore require, the use of specific workflow managers. It is not the mission of this project to recreate a workflow manager, but to interface with existing ones. Since the Unified Workflow will follow an object-oriented design, the concepts used in other Python-defined workflow managers (such as Luigi) will be leveraged to help build a more standard solution for UFS.

The Unified Workflow plans to support ecFlow, Rocoto, and cylc through interfaces that should make compatibility with other platform-agnostic workflow managers such as Luigi, Airflow, or parsl possible. These concepts should also allow for compatibility with many of the cloud service providers' workflow management solutions.

Not only will these interfaces meet our needs in the current computational environments, they could lead us more readily into the future of cloud and distributed computing.

Component Drivers

The UW Component Drivers will replace existing run scripts, which are in large part written in bash and incredibly specific to the App for which they run. In the legacy system, these run scripts follow the layered structure as described above, and are in large part independent from each other. These Component Drivers can then be replaced one at a time, or even in parallel, until all are replaced for a given App, as in Step 4 of applying the Strangler Fig Pattern.

The design of the Component Drivers is important in ensuring we have configured an experiment appropriately, and can run it as expected. The drivers will be defined by classes with required methods adopting standard interfaces, which will unlock the polymorphism needed for the "plug and play" feel we're after.

Similar to Luigi, UW will adopt the requirement for three methods on a Task class: requires(), run(), and output(). They are responsible for the following:

The requires() method defines the dependencies on files, other tasks, etc.
The run() method includes the logic of how to achieve completion of a given task.
The output() method defines a data structure of expected output of the task when complete, i.e., a list or dict of files.

Unlike Luigi, however, the configuration of each Task will be handled via the Unified Workflow config objects (described here), and a few additional methods will be used that are specific to the HPC environments in which UFS workflows operate. They include:

A resources() method to return a Config Object of the Task-specific HPC resources needed.
A job_card() method that turns the resources config object into a batch card for the configured Task
A validate_config() method that ensures a prepared experiment configuration object meets the requirements of the Component in question.

In addition to the required methods, there can be any number of support methods that perform auxiliary tasks for the Component. One such set of methods might perform validation of different sections of the Component configuration object, perhaps that's one each for dictable configure files (like the namelist and model_configure for the weather nodel) and templated configures files (like NEMS configure, also for the weather model). A similar set of methods might exist for the creation of those files.

Further still, methods would be needed for data gathering for fix files and those files listed by the requires() method, wherever they may be.

This design allows a Task object to communicate clearly with the Configuration Subsystem, the Workflow manager (or its interface), and other Task objects, about its needs and requirements, decoupling the Task-specifics from any other subsystem of the App.

The definition of an experiment workflow, then, boils down to the creation of a set of objects that is available to each of the subsystems in an App.

Behavior of a Unified Workflow App

Once each of these subsystems is fully in place, the new system should behave as follows. In this example, we will focus on an example for running a regional cold start forecast.

The user creates a YAML file to define the well-documented required settings, alongside any optional ones for the desired experiment.

A cold start forecast for the ufs-weather-model is pretty well-defined by all the major UFS Apps, and includes running tasks that make the grid, climatology files, and orography files, followed by tasks that gather initial conditions and boundary conditions, and finally run the weather model forecast. Since this set is a frequently-used workflow definition (on its own, or as part of a more complicated workflow), it is pre-defined. That means there is a Python definition of the set of objects corresponding to those tasks, and configurable by the YAML file the user provides.

The user then chooses this as a configuration option as a pre-defined workflow, also setting that in their YAML file. They run a command like this, choosing a machine:

./create_experiment.py my_config.yaml hera.yaml

Because the user chose to run with the Rocoto workflow manager, an XML file was generated that defines the workflow. A Python file that defines the objects in that workflow is also written – let's call that expt_workflow.py. The user must start a cron job to complete the defined workflow (other workflow managers might require starting or connecting to a centralized server at this step). The XML has already been filled in with all the resource requirements of each of the Task objects defined by their resources() methods. The dependencies are also filled in based on their requirements() methods.

This step also called on the validate() methods of all the objects to ensure that the defaults and the user-provided parameters were appropriate for each of the objects created.

Each time Rocoto submits a job to the Slurm scheduler, Rocoto creates a job card in memory. The command the job card runs will be something like this example for the forecast (of course, the appropriate handling of environments will also be required):

python -c "import expt_workflow; forecast.run()"

Perhaps the user didn't run rocotorun with the appropriate verbosity to see what the job card would have looked like, and it's important in part of their debugging process. The forecast object can help them out by simply running the following on the command line on the front end node:

python -c "import expt_workflow; forecast.job_card()"

In fact, many of the configuration settings of any object in the workflow can be interrogated in much the same way, including re-running in an interactive session on a compute node.

Discussion and Feedback

Discussion and feedback pages for the wiki can be found here

References

[^1]: Fowler, M. (2004, June 29). Bliki: Stranglerfigapplication. martinfowler.com. Retrieved February 3, 2023, from https://martinfowler.com/bliki/StranglerFigApplication.html

[^2]: Martinekuan. (n.d.). Strangler fig pattern - azure architecture center. Azure Architecture Center | Microsoft Learn. Retrieved February 3, 2023, from https://learn.microsoft.com/en-us/azure/architecture/patterns/strangler-fig

[^3]: Heusser, M. (2020, June 29). What is the strangler pattern and how does it work?: TechTarget. App Architecture. Retrieved February 3, 2023, from https://www.techtarget.com/searchapparchitecture/tip/A-detailed-intro-to-the-strangler-pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly