This repository contains ETL (Extract, Transform, Load) functions for collecting and processing sensor data from various sources, including the US Geological Survey (USGS) and South Florida Water Management District (SFWMD). The functions are deployed as AWS Lambda functions, triggered on scheduled intervals to gather and process real-time data from these sensors.
- Introduction
- Lambda Functions
- Project Structure
- ETL Process Overview
- Functions Breakdown
- Environment Variables
- Deployment
This project consists of multiple AWS Lambda functions that extract sensor data from USGS and SFWMD. Each function is triggered by a scheduled CRON job, defined in the serverless.yml
configuration file, allowing for periodic data extraction, processing, and storage. The functions are individually packaged to optimize deployment efficiency.
These functions extract data from various USGS sensors at defined intervals:
- etlGroundwaterSensor: Collects groundwater sensor data.
- etlGaugeHeightSensor: Gathers gauge height sensor data.
- etlDischargeRateSensor: Retrieves discharge rate sensor data.
- etlStreamElevationSensor: Fetches stream elevation sensor data.
- etlPrecipitationSensor: Acquires precipitation sensor data.
These functions process and load the data collected from the USGS sensors:
- etlGroundwaterData
- etlGaugeHeightData
- etlDischargeRateData
- etlStreamElevationData
- etlPrecipitationData
- wmdUploadStations: Uploads SFWMD station data.
- etlWmdTimeseriesData: Extracts time series data from SFWMD.
- etlWmdAggregateData: Processes and aggregates daily mean data.
The project is organized into the following directories, which includes utilities, services, and ETL (Extract, Transform, Load) functions for handling data from various sources:
src/
├── library/
│ └── data/ # Utilities handling data operations
├── services/ # Http request methods
└── functions/
├── etl-data/
│ ├── sfwmd/
│ │ ├── aggregate/ # ETL for aggregated data
│ │ └── timeseries/ # ETL for time series data
│ └── usgs/ # ETL for USGS data
└── etl-sensors/
├── sfwmd/ # ETL for SFWMD sensors
└── usgs/ # ETL for USGS sensors
The ETL process for sensors and data consists of the following steps:
- Extract: Retrieve sensor metadata and data from SFWMD and USGS APIs.
- Transform: Process, filter, and structure the data for the target database.
- Load: Push the transformed data into the database or other storage solutions.
- Sensor ETL: Fetches existing stations (sensors) from the storage system, compares them against incoming data, and uploads new or updated stations.
- Data ETL: Aggregates or retrieves timeseries observations for these stations, transforms the data into the required format, and stores it in the system.
The ETL sensors functions manage the metadata associated with water monitoring stations (sensors). These functions ensure the correct handling and update of sensor metadata in the system.
-
File:
etl-sensors/sfwmd/model.js
This file defines the logic for handling SFWMD sensor stations, specifically for extracting, comparing, and loading station metadata.Methods:
getExistingStations()
: Retrieves existing sensor stations from the storage.compareStations()
: Compares incoming stations with the existing ones to check for duplicates.loadStation()
: Uploads new stations to the storage if they don’t already exist.
-
File:
etl-sensors/sfwmd/index.js
The entry point for processing SFWMD stations, which calls the methods from theUploadStations
class to handle the stations in a batch.Process:
- Fetches existing stations.
- Compares new station data to avoid duplicates.
- Uploads new stations.
The ETL data functions manage the timeseries and aggregate data collected from the sensors. These functions are responsible for pulling raw data, transforming it, and pushing it to the database.
-
File:
etl-data/sfwmd/aggregate/model.js
This file handles the process of fetching aggregated data from SFWMD sensors, transforming the data, and loading it into the system.Methods:
filterStations()
: Filters stations based on specific properties (e.g., stationId).checkStoredObservations()
: Checks if there are any stored observations for the station.extractStationObservations()
: Extracts the latest sensor observations from the external API.transform()
: Transforms the raw observation data into the required structure.compareStationObservations()
: Compares extracted observations with existing ones.loadObservations()
: Loads the new or updated observations into the database.
-
File:
etl-data/sfwmd/aggregate/index.js
The entry point for processing SFWMD aggregated data. It coordinates the ETL process, calling various methods to extract, transform, and load data.Process:
- Filters active stations.
- Extracts observations.
- Transforms and compares the new observations with the existing data.
- Loads the updated observations into the system.
The .env
file is used during development for testing individual functions. It includes basic parameters such as API keys and sensor settings.
API_KEY= # The key required to authenticate requests during development.
SERVER_ENDPOINT= # The URL for the server endpoint you're interacting with.
SENSOR_CODE= # A 5-digit code representing the USGS sensor being used.
HAS_UPSTREAM_DOWNSTREAM= # Boolean indicating whether upstream and downstream data is included.
RECORDS_PERIOD= # The time period for which records are fetched, e.g., 1 day (P1D).
RECORDS_INTERVAL= # The interval between each record fetch, e.g., 15 minutes (PT15M).
The .env.yml
file is used for defining project-specific parameters and deployment environment settings. This is where you specify AWS regions, area tags, API keys for different environments, and API endpoints.
region: # The AWS region where resources are deployed.
areatag: # An optional tag to identify or categorize specific areas.
dev-apikey: # The API key or secret for the development environment.
prod-apikey: # The API key or secret for the production environment.
dev-serverEndPoint: # The URL of the API gateway endpoint in the development environment.
prod-serverEndPoint: # The URL of the API gateway endpoint in the production environment.
The deployment process for this repository involves creating multiple AWS Lambda functions, each with its own CRON schedule as defined in the serverless.yml
file. The deployment is managed using the AWS CLI and the Serverless Framework. Each function is packaged individually for efficient deployment and versioning.
Steps to deploy:
- Install the Serverless Framework and AWS CLI if not already installed.
- Install Node packages:
npm install
- Ensure that the necessary AWS credentials are configured using the
serverless
profile, or in~/.aws/credentials
. - Run the following command to deploy all the Lambda functions defined in the
serverless.yml
file:Orsls deploy --stage <stage-name>
npm deploy-dev
- The Lambda functions will be deployed, and their CRON schedules will be set according to the intervals specified in
serverless.yml
(e.g., every 12 hours, 15 minutes, etc.).
Each function will be triggered automatically based on its schedule without manual intervention, ensuring timely extraction and processing of sensor data from the USGS and SFWMD systems.