Standalone client to scrape lobbyist disclosure pages from www.sec.state.ma.us/LobbyistPublicSearch/ and upload them to a postgres database
The docker-compose.yml
file configures a postgres database and a python container for the scraper.
Note that the container runs as root and will change the file permissions on the files it writes. You can run sudo chown -R $(id -u):$(id -g) .
to reset permissions.
- Install docker and docker compose v2.
- Build the images with
docker compose build
- Start the services with
docker compose up -d
. This will return once they're up. - Open a shell into the python container with
docker compose exec lobby bash
. This gives you a terminal into the development environment, connected to your source directory. So this will reflect changes you make. - Run your scraper commands
poetry run python main.py
- Shut down the services:
docker compose down
. Add-v
to also delete the database
Put credentials for the postgres database into .env. The credentials for the cloud database are stored in the aws console:
DB_HOST="localhost"
DB_PORT="5432"
DB_USER="postgres"
DB_PASSWORD="password"
DB_NAME="lobbying"
Then run this command to create a parquet file with the contents of the table.
poetry run python dump_lobbying_site_view.py --format json lobbying_view.json.zip