The goal of this project is to take the data exported form Tyrrrz/DiscordChatExporter and put it into a relational database so aggregations can be easily calculated and so the data can be used in other parts of an ETL pipeline.
- Scraping Discord
- This page explains how to get your own Discord data to feed into this ETL pipeline
- Setup Postgres
- This doc contains instructions to setup and access a local postgres server
- Setup Postgraphile
- Postgraphile generate and runs a graphql API from just looking inside a postgres database
- neo4j Docs
- Setup neo4j and contains some example queries, including how to reset the database
Transforming the data from DiscordChatExporter
Requirements:
- S3 Bucket loaded with data from DiscordChatExporter
- Postgres Database, you can use postgres.dockercompose.yml if you do not have on already setup
Steps:
Setup python virtual environment and install requirements.txt
python3.10 minimum unless you install deps manually
# install pip
curl https://bootstrap.pypa.io/get-pip.py | python3 $1
python3 -m pip install virtualenv
sudo apt install python3-venv # Debian Distros
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
Set environment variables using .env file
cp .env_example .env
$EDITOR .env
Update the environment variables under DB Select
and S3
, the ones below
# DB Select
db_select='postgres'
db_url='psql://$USER:$PASS@$HOSTNAME:$PORT/$DATABASE_NAME'
# S3
aws_access_key_id=''
aws_secret_access_key=''
endpoint_url=''
bucket_name=''
Run ETL pipeline, also remember tmux exists
# Using Bash
source env/bin/activate
python3 run_dag.py &
cat *.log