Skip to content

pycontw/pycon-etl

Repository files navigation

PyConTW ETL

Python CI Docker Image CI

Using Airflow to implement our ETL pipelines.

Table of Contents

Prerequisites

Installation

There are several tools available to create a virtual environment in Python.

Below are the steps to manage a virtual environment using venv:

  1. Create a Virtual Environment

    To create a virtual environment, run the following command:

    python -m venv venv

    In this example, venv is the name of the virtual environment directory, but you can replace it with any name you prefer.

  2. Activate the Virtual Environment

    After creating the virtual environment, activate it using the following command:

    source venv/bin/activate
  3. Install Dependencies

    After activating the virtual environment, you can install the required dependencies:

    # Install airflow and dev dependencies
    pip install -r requirements.txt -r requirements-dev.txt -c constraints-3.8.txt
    
    # black is conflict with click, so install it separately
    pip install black==19.10b0 click==7.1.2
  4. Deactivate the Virtual Environment

    When you're done working in the virtual environment, you can deactivate it with:

    deactivate

Configuration

  1. For development or testing, run cp .env.template .env.staging. For production, run cp .env.template .env.production.

  2. Follow the instructions in .env.<staging|production> and fill in your secrets. If you are running the staging instance for development as a sandbox and do not need to access any specific third-party services, leaving .env.staging as-is should be fine.

Contact the maintainer if you don't have these secrets.

⚠ WARNING: About .env Please do not use the .env file for local development, as it might affect the production tables.

BigQuery (Optional)

Set up the Authentication for GCP: https://googleapis.dev/python/google-api-core/latest/auth.html *After running gcloud auth application-default login, you will get a credentials.json file located at $HOME/.config/gcloud/application_default_credentials.json. Run export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" if you have it. * service-account.json: Please contact @david30907d via email or Discord. You do not need this json file if you are running the sandbox staging instance for development.

Running the Project

If you are a developer 👨‍💻, please check the Contributing Guide.

If you are a maintainer 👨‍🔧, please check the Maintenance Guide.

Local Environment with Docker

For development/testing:

# Build the local dev/test image
make build-dev

# Start dev/test services
make deploy-dev

# Stop dev/test services
make down-dev

The difference between production and dev/test compose files is that the dev/test compose file uses a locally built image, while the production compose file uses the image from Docker Hub.

If you are a authorized maintainer, you can pull the image from the GCP Artifact Registry.

Docker client must be configured to use the GCP Artifact Registry.

gcloud auth configure-docker asia-east1-docker.pkg.dev

Then, pull the image:

docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:{tag}

There are several tags available:

  • cache: cache the image for faster deployment
  • test: for testing purposes, including the test dependencies
  • staging: when pushing to the staging environment
  • latest: when pushing to the production environment

Production

Please check the Production Deployment Guide.

Contact

PyCon TW Volunteer Data Team - Discord

About

No description, website, or topics provided.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages