Skip to content

Public repository for tracking permits published by the U.S. Army Corps in certain districts.

License

Notifications You must be signed in to change notification settings

AtlasPublicPolicy/wetlands-tracker

Repository files navigation

Wetlands Tracker

Overview

The U.S. Army Corps of Engineers (USACE) evaluates permit applications for any work, including construction and dredging, in the Nation’s navigable waters. The Wetlands Impact Tracker compiles public notices of those permit applications from USACE, typically stored in PDFs. The data pulled from these notices can help users better understand the impact of development projects on sensitive areas by revealing and summarizing individual notices, and aggregating data. We encourage you to use this tool to explore and better understand how development projects are impacting the communities you work with and live in.

Data

All data powering the dashboard is located on an AWS S3 bucket with public read access. Here are the links to the CSVs:

Prerequisite and Installation

  • Install Python: If you don't have Python installed, download and install it from the official Python website.
  • Clone or Download the Repository:
    • If Git is installed, clone the repository using the following command in Git Bash on Windows or terminal in other systems:
      cd the-path-you-would-like-to-hold-the-repository
      git clone https://github.com/AtlasPublicPolicy/wetlands-tracker.git
      
    • If you don't have Git installed, you can download the repository as a ZIP file from the GitHub page. Click on the "Code" button and select "Download ZIP." After downloading, extract the ZIP file to the directory of your choice.
  • Set Up and Activate a Virtual Environment in PowerShell or Command Prompt on Windows or in terminal in Other Systems:
    • Create a new virtual environment:
      # Navigate to the project directory:
      cd the-path-you-hold-the-repository
      
      # Create a virtual environment:
      virtualenv venv
      
      # or
      python -m venv venv
      
    • Activate the virtual environment:
      Windows:
      .\venv\Scripts\activate
      
      macOS and Linux
      source venv/bin/activate
      
  • Set up an AWS S3 bucket:
    • A folder to place the scrapped data: dashboard-data
    • A folder to place notice PDFs: full-pdf
      NOTE: If you do not want to use AWS S3 bucket, you may uncomment the 8th parameter in the configuration self.directory = "data_schema/" and the function to export tables to your directory in the main() [main_extractor.dataframe_to_csv(main_tbls[df_name], df_name, config.directory) for df_name in main_tbls] to store data locally.

Usage

In an active virtual environment

  1. Set up the configuration in main.py:

    • Create a file named "api_key.env" and provide the following keys:

      • AZURE_API_KEY=your_azure_api_key
      • AZURE_ENDPOINT=https://your-azure-endpoint.com
      • AWS_ACCESS_KEY_ID=your_aws_access_key_id
      • AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
      • OPENAI_API_KEY=your_openai_api_key
      • REDIVIS_API_KEY=your_redivis_api_key
    • Modify the following parameters as needed:

      • update: Do you want to scrape all historical notices or only recently updated ones. 1, update; 0, first-time-scraping; default as 1. Note: Be cautious when setting the update parameter to 1 (scrape all historical notices), as it might run for an extended period and incur high costs for Azure and LLM services.

      • n_days: How many days in the past would you like to search for updated notices: numeric # from 0 to 500; default as 14.

      • max_notices: How many maximum notices (sorted by date) to download?

      • district: which district you would like to scrape: "New Orleans", "Galveston", "Jacksonville", "Mobile", or "all"; default as "all".

      • tbl_to_upload: which table you would like to upload to Redivis? Any of tables in the list = ["main", "manager", "character", "mitigation", "location", "fulltext", "summary", "wetland", "embed", "validation", "aws", "geocoded"], "none" or "all"; defaul as "all".

      • price_cap: For Azure summarization, please set a price cap; defaul as 5 ($).

      • n_sentences: How many sentences you would like to have for summarization; defaul as 4.

      • directory: file directory; default as "data_schema/".

      • overwrite_redivis: Overwrite file with same name on Redivis; 1, yes; 0, no; default as no.

      • skipPaid: Skip paid services including OpenAI and Azure Summaries. 1, skip; 0, do not skip; default = 0

      • tesseract_path: If you have problem running OCR(Optical Character Recognition), please specify the path for tesseract.exe such as "C:/Program Files/Tesseract-OCR/tesseract.exe"; default as None.

      • GPT_MODEL_SET: Set GPT model; default as "gpt-3.5-turbo-0613"

  2. Run main.py in the virtual environment:

    (venv) $ python main.py
    

File Descriptions

  • requirements.txt: Lists all Python dependencies required for running the project.
  • Other scripts: workflow

Troubleshooting

  • log.txt: You can find messages, warnings, and errors here.
  • error_report.md: This file captures the potential problems with the PDF reading process, special notices, Regex patterns, and LLM performance, which will not break the running and will not be reported in log.txt.

Contributing

Users are encouraged to report issues directly in the GitHub repository. We plan to maintain this repository at-least through 2024. While we welcome pull requests, we cannot guarantee that they will be reviewed or accepted in a timely manner.

Contact

Please reach out to us at [email protected] with any questions or comments.

About

Public repository for tracking permits published by the U.S. Army Corps in certain districts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published