Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job_manager: Add htcondorVC3 job manager #251

Open
wants to merge 89 commits into
base: maint-0.6
Choose a base branch
from

Conversation

CodyKank
Copy link

Integrating the VC3 job manager class for 0.6 branch.

khurtado and others added 30 commits February 19, 2019 14:54
Added htcondor submission for workflow steps.
No monitoring of any file transfers is implemented yet.
Add Cody changes for condor submission
…ller

Conflicts:
	reana_job_controller/rest.py

Merged upstream changes. Commit condor changes.

Updated to most recent rjc. Added delete job for condor.
…ller into cody_dev

Conflicts:
	reana_job_controller/rest.py
	setup.py

Merge khurtado's RJC with reanahub's current master.

Conflicts involved changing the hardcoded paths
within rjc/rest.py. Setup.py didn't like the
added 'htcondor' requirement.
Create first instance of HTCondor's job manager
There are a few VC3 specifics within.
Testing is still needed.
Added ability to call htcondor_job_manager
over the already defined kubernetes_job_manager.
- Implement watch jobs method for condor.
Changed how the htcondor job wrapper searches for
a container. Instead of assuming singularity
now searches through a list to be expanded
later.

Singularity is still hardcoded with arguments.
Erroneously left out shifter in module search list,
fixed incorrect usage of cntr_path in job_wrapper.
To account for environments with bash versions > 4.0
associative arrays in job_wrapper for htcondor were
removed and replaced with simple arrays.
Adjusted call to modules within job_wrapper due to
failure on Blue Waters.
First implementation to search for shifter
khurtado and others added 6 commits April 6, 2020 18:20
Should only apply to tag 0.6.0
Removed an older method for passing the htcondor schedd address
to the container, instead of at build time we moved it long ago
to within the yaml configuration file for reana-cluster
Bump HTCondor installed through pip from 8.9.1 to 8.9.6
condor.py was used in the past before the creation of
job_manager classes.

import os

MAX_JOB_RESTARTS = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these variables are used only in VC3 context, it would be good to name the file more specifically. Examples:

  • Do we need to have them as separate file? If not, consider moving into htcondorvc3 class.
  • Shall they be reused for other VC3 backends than just HTCondor? If yes, leave them in separate file named like vc3_config.py or vc3_variables.py.
  • We can always use vc3_... name prefix technique for other possible VC3-specific files such as job wrapper. Currently it lives in files/job_wrapper.sh, it may be better to use something like scripts/vc3/job_wrapper.sh.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khurtado any thoughts on if we'd like to move these or prefix the naming?

Copy link
Contributor

@khurtado khurtado May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it, it seems we do not use these 2 anymore:

SHARED_VOLUME_PATH = os.getenv('SHARED_VOLUME_PATH', '/var/reana')
SHARED_VOLUME_PATH_ROOT = os.getenv('SHARED_VOLUME_PATH_ROOT', SHARED_VOLUME_PATH)

So, I think we can simply remove variables.py, get rid of this line and set MAX_JOB_RESTARTS = 3 below the import calls in htcondorvc3_job_manager.py

# -*- coding: utf-8 -*-
#
# This file is part of REANA.
# Copyright (C) 2017, 2018 CERN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see your contribution!

  • Please amend copyright years.
  • Please add yourself and @khurtado to the author list in AUTHORS.rst.
  • Regarding merging, we usually merge feature branches with linear history. There are basically two options: (1) Shall we rebase and keep the history? (2) Shall we squash everything under one commit? What would you prefer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add us to AUTHORS.rst!

Regarding merging, we are fine with whatever would be easier for the REANA team. If squashing everything under one commit is easier for you then we are fine with that, the history of our commits is not as important to us as integrating them.

#!/bin/bash

# Replicate input files directory structure
# @TODO: This could be executed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have:

etc/job_wrapper.sh
files/job_wrapper.sh

It is not clear which files are for what.

What about using:

etc/htcondorcern/job_wrapper.sh
etc/htcondorvc3/job_wrapper.sh

so that we can easily support several backends and distinguish between them?

This would mean that we'd have to move quite a few CERN-specific files from etc/* to /etc/htcondorcern/*` ourselves... but merging this branch will be quite a lot of work anyway, so could perhaps do it alongside?!

(An alternative would be to use the same directory but strictly "cern" and "vc" naming prefix everywhere.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like creating different subdirectories in etc:

etc/htcondorvc3/job_wrapper.sh
etc/htcondorcern/job_wrapper.sh

As you said, handling several different backends would be easier to manage and visualize than prefixes or suffixes (job_wrapper_cern.sh / cern_job_wrapper.sh).

If you're OK with the subdirectories in etc/, I'd be happy to help move CERN specific files as well as the VC3 files.

Dockerfile Outdated
@@ -77,5 +80,6 @@ EXPOSE 5000

ENV COMPUTE_BACKENDS $COMPUTE_BACKENDS
ENV FLASK_APP reana_job_controller/app.py
ENV REANA_LOG_LEVEL DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be necessary to hard-code REANA_LOG_LEVEL in Dockerfile. One could pass it as environment variable...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add a bit more information on this: the way of passing this configuration is a bit cumbersome right now as one would have to: go inside RWC, read the config value and pass it as an environment variable to RJC. In reanahub/reana#277 there is a description of the current process and possible solutions we could take to centralise these configuration variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should have been taken out, I missed this when cleaning things up.

def __init__(self, docker_img=None, cmd=None, prettified_cmd=None,
env_vars=None, workflow_uuid=None, workflow_workspace=None,
cvmfs_mounts='false', shared_file_system=False,
job_name=None, kerberos=False, kubernetes_uid=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added in the meantime support for running unpacked Singularity images from CVMFS. (See unpacked_img in master branch.) Is this something that would interest you for VC3 integration as well? If yes, this will be something to look into. If not, we can either raise NotImplemented or pass while merging. Asking just for the future in order to know how to propagate v0.6.0 -> master changes to the HTCondorJobManagerVC3 class the best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current moment we haven't explored CVMFS very much as we are targeting mainly National Science Foundation HPC sites here in the US which will not support it, so for now when merging a raise NotImplemented may be best.

Dockerfile Outdated
Comment on lines 13 to 14
pip install --upgrade pip && \
pip install htcondor==8.9.6 retrying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have these libraries installed at this specific point in the Dockerfile, maybe because of the deb packages? Otherwise, we could just continue setting them in setup.py so they will be installed at this point, and later available in the code .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, these should be handled in setup.py, I'll clean these up.

Comment on lines 72 to 76
# How is this set in the environment?
# It is hardcoded in Dockerfile and there is no code
# in workflow-controller to override that for the job-controller.
#SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS',
# DEFAULT_COMPUTE_BACKEND).split(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point for improvement 👍. The way this is set is a bit shady right now as one could pass it in REANA-Workflow-Controller with the process described in https://github.com/reanahub/reana-job-controller/pull/251/files#r416428666, but actually this is set at build time because that knowledge is used to selectively install the correct packages.

So when we build images for reana.cern.ch we use the Makefile like:

BUILD_ARGUMENTS="COMPUTE_BACKENDS=htcondorcern,slurmcern,kubernetes" BUILD_TYPE=release make build

What is happening behind the scenes is that we pass these build args to REANA-Job-Controller docker build, which results in the final image having this environment variable.

Side note, as it might be of interest here: Then we take this image and we name it reanahub/reana-job-controller-htcondorcern-slurmcern, not very pretty but this is what we have for now :) (see previous discussions here).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note Diego, it's very helpful! We'll be sure to follow that from now on. Perhaps we've been 'cheating' while building our images, we have simply been building them straight from the dockerfile within the RJC repo rather than through the reanahub/reana repo. I suppose this is a bad habit of ours.

CodyKank and others added 12 commits May 11, 2020 14:23
Rename MAX_JOB_RESTARTS to MAX_NUM_RETRIES to make it consistent with the HTCondor CERN implementation.
To better organize job wrappers, files/job_wrapper
has been moved to etc/htcondorvc3/job_wrapper.sh
Increase number of default retries.
Fix issue with krb5 dialog
Increase parrot timeout
htcondor\_submit.py no longer used
Remove old settings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants