v4.4.10.0 Updates for acquire, logging and cpu test (#1054)

NOAA-OWP · Feb 2, 2024 · a14e5e1 · a14e5e1
1 parent 682c15b
commit a14e5e1
Show file tree

Hide file tree

Showing 17 changed files with 214 additions and 263 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -1,2 +1,3 @@
 # Set the default behavior for end of lines.
 * text eol=lf
+docs/CHANGELOG.md merge=union
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -6,9 +6,10 @@ repos:
     rev: v4.4.0
     hooks:
         -   id: trailing-whitespace
-            exclude: CHANGELOG.md
+            # Below is python regex to exclude all .md files
+            exclude: .*md$
         -   id: end-of-file-fixer
-            exclude: Pipfile.lock
+            exclude:  Pipfile.lock
         -   id: check-added-large-files
             args: ['--maxkb=5000']
         -   id: check-json

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -45,46 +45,77 @@ If you would like to contribute, please follow these steps:
     ```
     git checkout -b <dev-your-bugfix-or-feature>
     ```
-4. [pre-commit](https://pre-commit.com/) is used to run auto-formatting and linting tools. Please follow the steps in the link provided to install `pre-commit` locally.  `pre-commit` can be installed as a git hook to verify your staged changes adhere to the project's style and format requirements (configuration defined in [pyproject.toml](/pyproject.toml)).
-5. After you've installed `pre-commit` via `pip` , `homebrew`, or `conda`, check the version to verify installation:
-    ```
-    $ pre-commit --version
-    ```
-
-6. Initialize the pre-commit hooks included within the root directory of `inundation-mapping`:
-
+4. Pre-commit installation:
+
+   [pre-commit](https://pre-commit.com/) is used to run auto-formatting and enforce styling rules.
+   It is a critical part of development and is enforced at the 'git commit' step. Key tools are included **inside the docker container** if you want to execute the correctly configured linting and formatting command line executables there. If you intend to execute `flake8`, `black` or `isort` from the command line **outside of the docker container**, additional configuration and installation is required, which will not be described here.
+
+   **Note: These steps below are similar to another required critical step (pre-commit configuration) later in this document, which also needs to be run**.
+
+   If pre-commit is not already installed on your system:
+   ```
+   pip install pre-commit
+   ```
+   All related tools (git hook scripts) are installed under the `pre-commit install` step, not at this level. See https://pre-commit.com/#install
+
+   If you get an error message during the install of pre-commit which says:
+
+   *Installing collected packages: pre-commit
+       WARNING: The script pre-commit is installed in '/home/{your_user_name}/.local/bin' which is not on PATH.
+       Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.`,
+   then you will need to do some additional configuration. You need to adjust your path.*
+   ```
+   (Adjusting the path to be exactly the path you see in the WARNING message above from your console output).
+   export PATH="/home/{your_user_name}/.local/bin:$PATH"
+   ```
+   To test that it installed correctly, is pathed correctly and check the version:
+   ```
+   pre-commit --version
+   ```
+   It should respond with the phrase *pre-commit 3.6.0* (version may not be exact).
+
+
+5. pre-commit configuration:
+
+   Now, you need to configure your local clone of the repo to honor the pre-commit hooks.
+   The `pre-commit` package is used to pick up the pre-commit hooks which verify your staged changes adhere to the project's style and format requirements (configuration defined in [pyproject.toml](/pyproject.toml)).
+
+   Initialize the pre-commit hooks included within the root directory of this code repository (`inundation-mapping`):
     ```
     $ pre-commit install
     ```
-7. At this point, you should be set up with `pre-commit`. When a commit is made it will run the pre-commit hooks defined in [`.pre-commit-config.yaml`](.pre-commit-config.yaml). For reference, you may run any of the pre-commit hooks manually before issuing the `git commit` command (see below). Some tools used by the pre commit git hook (`isort`, `flake8`, & `black`) are also available as command line executables within the Docker container, however, it is recommend to run them through `pre-commit` outside of the container, as it picks up the correct configuration.
+
+6. At this point, you should be set up with `pre-commit`. When a commit is made it will run the pre-commit hooks defined in [`.pre-commit-config.yaml`](.pre-commit-config.yaml). For reference, you may run any of the pre-commit hooks manually before issuing the `git commit` command (see below). Some tools used by the pre commit git hook (`isort`, `flake8`, & `black`) are also available as command line executables **within the Docker container***, however, it is recommended to run them through `pre-commit` **outside of the container**, as it picks up the correct configuration.
 
-    ```
-    # Check only the staged changes
-    pre-commit run
-
-    # Check all files in the repo
-    pre-commit run -a
+   ```
+   # Check only the staged changes
+   pre-commit run
 
-    # Run only the flake8 formatting tool
-    pre-commit run -a flake8
-    ```
+   # Check all files in the repo
+   pre-commit run -a
 
-8. Build the Docker container:
+   # Run only the flake8, isort, or black.
+   pre-commit run -a flake8
+   pre-commit run -a isort
+   pre-commit run -a black
+   ```
+7. Build the Docker container:
     ```
     Docker build -f Dockerfile -t <image_name>:<tag> <path/to/repository>
     ```
-9. [Within the container](README.md#startrun-the-docker-container), ensure sure unit tests pass ([instructions here](/unit_tests/README.md)).
+
+8.  [Within the container](README.md#startrun-the-docker-container), ensure sure unit tests pass ([instructions here](/unit_tests/README.md)).
     ```
     pytest unit_tests/
     ```
 
-10. Outside of the Docker container, commit your changes:
+9. Outside of the Docker container, commit your changes:
     ```
-    git commit -m <descriptive sentence or two changes>
+    git commit -m "<descriptive sentence or two of changes>"
     ```
-    This will invoke pre-commit hooks mentioned in step 7 that will lint & format the code. In many cases non-compliant code will be rectified automatically, but in some cases manual changes will be necessary. Make sure all of these checks pass, if not, make necessary changes and re-issue `git commit -m <...>`.
-
-9. Push to your forked branch:
+    This will invoke pre-commit hooks mentioned in step 6 that will lint & format the code (some others as well). In many cases non-compliant code will be rectified automatically, but in some cases manual changes will be necessary. Make sure all of these checks pass. If not, make necessary changes (`git add <...>`), and re-issue `git commit -m "<...>"`.
+   
+10. Push to your forked branch:
     ```
     git push -u origin
     ```
@@ -93,4 +124,5 @@ If you would like to contribute, please follow these steps:
     git push --set-upstream origin <your branch>
     ```
 
-10. Submit a pull request on [inundation-mapping's GitHub page](https://github.com/NOAA-OWP/inundation-mapping) (please review checklist in [PR template](/.github/PULL_REQUEST_TEMPLATE.md) for additional PR guidance).
+11. Submit a pull request on [inundation-mapping's GitHub page](https://github.com/NOAA-OWP/inundation-mapping) (please review checklist in [PR template](/.github/PULL_REQUEST_TEMPLATE.md) for additional PR guidance).
+
diff --git a/data/usgs/acquire_and_preprocess_3dep_dems.py b/data/usgs/acquire_and_preprocess_3dep_dems.py
@@ -39,7 +39,7 @@ def acquire_and_preprocess_3dep_dems(
     extent_file_path,
     target_output_folder_path='',
     number_of_jobs=1,
-    retry=False,
+    repair=False,
     skip_polygons=False,
     target_projection='EPSG:5070',
 ):
@@ -79,9 +79,11 @@ def acquire_and_preprocess_3dep_dems(
         - number_of_jobs (int):
             This program supports multiple procs if multiple procs/cores are available.
 
-        - retry (True / False):
-            If retry is True and the file exists (either the raw downloaded DEM and/or)
-            the projected one, then skip it
+        - repair (True / False):
+            If repair is True then look for output DEMs that are missing or are too small (under 10mg).
+            This happens often as there can be instabilty when running long running processes.
+            USGS calls and networks can blip and some of the full BED can take many, many hours.
+            It will also look for DEMs that were missed entirely on previous runs.
 
         - skip_polygons (bool)
              If True, then we will not attempt to create polygon files for each dem file. If false,
@@ -120,37 +122,28 @@ def acquire_and_preprocess_3dep_dems(
                 f"For the output path of {target_output_folder_path}, the child directory"
                 " need not exist but the parent folder must."
             )
-        os.makedirs(target_output_folder_path)
+        os.makedir(target_output_folder_path)
+
     else:  # path exists
-        if not retry:
+        if repair is False:
             file_list = os.listdir(target_output_folder_path)
             if len(file_list) > 0:
                 print()
-                msg = f"The target output folder of {target_output_folder_path} appears to not be empty.\n\n"
-                "Do you want to empty the folder first?\n"
-                "  -- Type 'overwrite' if you want to empty the folder and continue.\n"
-                "  -- Type any other value to abort and stop the program.\n"
-                "  ?="
-
-                resp = input(msg).lower()
+                msg = (
+                    f"The target output folder of {target_output_folder_path} appears to not be empty.\n\n"
+                    "Do you want to empty the folder first?\n"
+                    "  -- Type 'overwrite' if you want to empty the folder and continue.\n"
+                    "  -- Type any other value to abort and stop the program.\n"
+                )
+                print(msg)
+                resp = input(" ?=").lower()
                 if resp == "overwrite":
                     shutil.rmtree(target_output_folder_path)
                     os.mkdir(target_output_folder_path)
                 else:
                     print("Program stopped\n")
                     sys.exit(0)
-        else:  # might want to retry but the folder isn't there yet
-            # It is ok if the child diretory does not exist, but the parent folder must
-            # parent directory, we want to reset it
-            parent_dir = os.path.abspath(os.path.join(target_output_folder_path, os.pardir))
-            print(parent_dir)
-            if not os.path.exists(parent_dir):
-                raise ValueError(
-                    f"For the output path of {target_output_folder_path}, the child directory"
-                    " need not exist but the parent folder must."
-                )
-            shutil.rmtree(target_output_folder_path)
-            os.mkdir(target_output_folder_path)
+        # no else:
 
     # I don't need the crs_number for now
     crs_is_valid, err_msg, crs_number = val.is_valid_crs(target_projection)
@@ -177,25 +170,25 @@ def acquire_and_preprocess_3dep_dems(
 
     # download dems, setting projection, block size, etc
     __download_usgs_dems(
-        extent_file_names, target_output_folder_path, number_of_jobs, retry, target_projection
+        extent_file_names, target_output_folder_path, number_of_jobs, repair, target_projection
     )
 
     if skip_polygons is False:
         polygonize(target_output_folder_path)
 
     end_time = datetime.utcnow()
-    fh.print_end_header('Loading 3dep dems', start_time, end_time)
+    fh.print_end_header('Loading 3dep dems complete', start_time, end_time)
 
     print()
     print(
         '---- NOTE: Remember to scan the log file for any failures. If you find errors in the'
-        ' log file, delete the output file and retry'
+        ' log file, delete the output file and repair.'
     )
     print()
     logging.info(fh.print_date_time_duration(start_time, end_time))
 
 
-def __download_usgs_dems(extent_files, output_folder_path, number_of_jobs, retry, target_projection):
+def __download_usgs_dems(extent_files, output_folder_path, number_of_jobs, repair, target_projection):
     '''
     Process:
     ----------
@@ -244,7 +237,7 @@ def __download_usgs_dems(extent_files, output_folder_path, number_of_jobs, retry
                 'download_url': __USGS_3DEP_10M_VRT_URL,
                 'target_projection': target_projection,
                 'base_cmd': base_cmd,
-                'retry': retry,
+                'repair': repair,
             }
 
             try:
@@ -268,7 +261,9 @@ def __download_usgs_dems(extent_files, output_folder_path, number_of_jobs, retry
     print("==========================================================")
 
 
-def download_usgs_dem_file(extent_file, output_folder_path, download_url, target_projection, base_cmd, retry):
+def download_usgs_dem_file(
+    extent_file, output_folder_path, download_url, target_projection, base_cmd, repair
+):
     '''
     Process:
     ----------
@@ -289,8 +284,9 @@ def download_usgs_dem_file(extent_file, output_folder_path, download_url, target
             ie) EPSG:5070 or EPSG:2276, etc
         - base_cmd (str)
              The basic GDAL command with string formatting wholes for key values.
-        - retry (bool)
-             If True, and the file exists, downloading will be skipped.
+        - repair (bool)
+             If True, and the file does not exist or is too small (under 10mb),
+             it will attempt to download.
 
     '''
 
@@ -302,11 +298,14 @@ def download_usgs_dem_file(extent_file, output_folder_path, download_url, target
     # which is related to to the spatial extents of the dem and the vrt combined.
     # so, super small .tifs are correct.
 
-    if (retry) and (os.path.exists(target_path_raw)):
-        msg = f" - Downloading -- {target_file_name_raw} - Skipped (already exists (see retry flag))"
-        print(msg)
-        logging.info(msg)
-        return
+    if (repair) and (os.path.exists(target_path_raw)):
+        if os.stat(target_path_raw).st_size < 1000000:
+            os.remove(target_path_raw)
+        else:
+            msg = f" - Downloading -- {target_file_name_raw} - Skipped (already exists (see retry flag))"
+            print(msg)
+            logging.info(msg)
+            return
 
     msg = f" - Downloading -- {target_file_name_raw} - Started"
     print(msg)
@@ -356,7 +355,7 @@ def polygonize(target_output_folder_path):
     """
     dem_domain_file = os.path.join(target_output_folder_path, 'DEM_Domain.gpkg')
 
-    msg = f" - Polygonizing -- {dem_domain_file} - Started"
+    msg = f" - Polygonizing -- {dem_domain_file} - Started (be patient, it can take a while)"
     print(msg)
     logging.info(msg)
 
@@ -368,6 +367,7 @@ def polygonize(target_output_folder_path):
     dem_gpkgs = gpd.GeoDataFrame()
 
     for n, dem_file in enumerate(dem_files):
+        print(f"Polygonizing: {dem_file}")
         edge_tif = f'{os.path.splitext(dem_file)[0]}_edge.tif'
         edge_gpkg = f'{os.path.splitext(edge_tif)[0]}.gpkg'
 
@@ -446,9 +446,8 @@ def __setup_logger(output_folder_path):
     '''
     sample usage (min params):
         python3 /foss_fim/data/usgs/acquire_and_preprocess_3dep_dems.py
-            -e /data/inputs/wbd/HUC6_ESPG_5070/
-            -t /data/inputs/3dep_dems/10m_5070/
-            -r
+            -e /data/inputs/wbd/wbd/HUC8_South_Alaska/
+            -t /data/inputs/3dep_dems/10m_South_Alaska/
             -j 20
 
     Notes:
@@ -499,10 +498,10 @@ def __setup_logger(output_folder_path):
     )
 
     parser.add_argument(
-        '-r',
-        '--retry',
-        help='OPTIONAL: If included, it will skip files that already exist.'
-        'Default is all will be loaded/reloaded.',
+        '-rp',
+        '--repair',
+        help='OPTIONAL: If included, it process only HUCs missing output DEMs or if the output DEM'
+        ' is too small (under 10 MB), which does happen.',
         required=False,
         action='store_true',
         default=False,

diff --git a/data/usgs/preprocess_ahps_usgs.py b/data/usgs/preprocess_ahps_usgs.py
diff --git a/data/usgs/preprocess_download_usgs_grids.py b/data/usgs/preprocess_download_usgs_grids.py
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -1,6 +1,34 @@
 All notable changes to this project will be documented in this file.
 We follow the [Semantic Versioning 2.0.0](http://semver.org/) format.
 
+## v4.4.10.0 - 2024-02-02 - [PR#1054](https://github.com/NOAA-OWP/inundation-mapping/pull/1054)
+
+Recent testing exposed a bug with the `acquire_and_preprocess_3dep_dems.py` script. It lost the ability to be re-run and look for files that were unsuccessful earlier attempts and try them again. It may have been lost due to confusion of the word "retry". Now "retry" means restart the entire run. A new flag called "repair"  has been added meaning fix what failed earlier.  This is a key feature it is common for communication failures when calling USGS to download DEMs.  And with some runs taking many hours, this feature becomes important.
+
+Also used the opportunity to fix a couple of other minor issues:
+1) Reduce log output
+2) Add a test for ensuring the user does not submit job numbers (num of cpu requests) to exceed the system max cpus. This test exists in a number of places in the code but way later in the processing stack after alot of processing has been done. Now it is done at the start of the fim pipeline stack.
+3) remove arguments for "isaws" which is no longer in use and has not been for a while.
+4) quick upgrade to the tracker log that keeps track of duration of each unit being processed.
+
+### Changes
+
+
+- `data\usgs\`
+    - `acquire_and_preprocess_3dep_dems.py`: Re-add a feature which allowed for restarting and redo missing outputs or partial outputs. System now named as a "repair" system.
+- `fim_pipeline.sh`:  remove the parallel `--eta` flag to reduce logging. It was not needed, also removed "isaws" flag.
+- `fim_pre_processing.sh`: Added validation tests for maximum CPU requests (job numbers)
+- `fim_post_processing.sh`: Added a permissions updated as output folders were being locked due to permissions.
+- `fim_process_unit_wb.sh`: Fixed a bug with output folders being locked due to permissions, but it was not recursive.
+- `src`
+    - `bash_functions.sh`: Added function so the unit timing logs would also have a time in percentage so it can easily be used to calculate averages.
+    - `delineate_hydros_and_produce_HAND.sh`: Removed some unnecessary logging. Changed a few gdal calls to be less verbose.
+    - `derive_level_paths.py`: Changed verbose to false to reduce  unnecessary logging.
+    - `run_by_branch.sh`: Removed some unnecessary logging. Added a duration system so we know how long the branch took to process.
+    - `run_unit_by_wb.sh`: Removed some unnecessary logging. Changed a few gdal calls to be less verbose.
+    - `split_flows.py`: Removed progress bar which was unnecessary and was adding to logging.
+
+<br/><br/>
 
 ## v4.4.9.2 - 2024-02-02 - [PR#1066](https://github.com/NOAA-OWP/inundation-mapping/pull/1066)
 

diff --git a/fim_pipeline.sh b/fim_pipeline.sh
@@ -46,8 +46,6 @@ usage()
       -o                : Overwrite outputs if they already exist.
       -skipcal          : If this param is included, the S.R.C. will be updated via the calibration points.
                             will be skipped.
-      -isaws            : If this param is included, AWS objects will be used where possible
-                        - Note: This feature is not yet implemented.
       -x                : If this param is included, the crosswalk will be evaluated.
 
 
@@ -104,13 +102,13 @@ if [ -f "$hucList" ]; then
     if [ "$jobHucLimit" = "1" ]; then
         parallel --verbose --lb -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file $runName :::: $hucList
     else
-        parallel --eta -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file $runName :::: $hucList
+        parallel -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file $runName :::: $hucList
     fi
 else
     if [ "$jobHucLimit" = "1" ]; then
         parallel --verbose --lb -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file $runName ::: $hucList
     else
-        parallel --eta -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file  $runName  ::: $hucList
+        parallel -j $jobHucLimit --colsep ',' --joblog $logFile -- $process_wb_file  $runName  ::: $hucList
     fi
 fi