Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

Extract, Transform and Load processes for loading grant data into Beehive Data

Notifications You must be signed in to change notification settings

TechforgoodCAST/beehive-data-etl

Repository files navigation

Beehive Data

Status

CircleCI

Initial setup

  • Setup MongoDB and get it running
  • setup virtual environment, and activate it
  • install any needed requirements pip install -r requirements.txt
  • install the application package through pip install -e .
  • Add an environment variable FLASK_APP pointing to beehivedata.beehivedata
  • Initialise the database by running flask init_db
  • Fetch charity data by running flask fetch_charities and then flask import_charities
  • Fetch any grants data using flask fetch_all
  • Run the server with flask run

External libraries used

(Installed through requirements.txt)

Run development server

Run flask run from the command line.

For development/debug mode set FLASK_DEBUG environmental variable to 1.

Fetch charity data

Charity data can be downloaded using the flask fetch_charities command, and then imported into mongodb by running flask import_charities. The data comes from:

When fetching data on Scottish charities you'll need to agree to the terms and conditions.

Fetch 360 Giving data

This step can either be run in one go using flask fetch_all, or in the individual steps shown below. You can also run all the update procedures without fetching new data by running flask update_all

1. Download published data and import into Mongo

This command will fetch the data registry and save it to a mongo database. It then goes through the data registry, downloads each file, converts to json (if needed) and save all the grants in the database.

Run the command using:

$ flask fetch_data

The command can also be run to just fetch the files that have been updated since a given date:

$ flask fetch_data --files-since 2017-01-01

You can also set it to just download the data for a particular funder, using a comma-separated list of the funder prefixes, slugs or names. Eg:

$ flask fetch_data --funders 360G-ocf

The command line options for this are:

  • --files-since: fetch only files updated after this date (in YYYY-MM-DD format, default all files)
  • --funders: only fetch these funders (list of funder prefixes separated by comma, default all funders)
  • --registry: where to find the data registry (default http://data.threesixtygiving.org/data.json)

2. Update organisation and charity details

These two steps update the organisations in the data. They are run using:

$ flask update_organisations
$ flask update_charity

update_organisations tries to guess the organisation type of the recipient organisation and apply the Beehive codes to it. It also processes the grant according to the function in fetch_data, so it can be useful to rerun if you don't want to fetch all the data again

update_charities gets data about the recipient from the charities MongoDB collection. It then tries to work out the type of organisation, how long they have operated for, and get the latest financial information.

Note: this stage allows for multiple recipients, but the end result only outputs the first recipient.

@todo: Add in companies data here too.

3. Update beneficiaries

Using regexes and other techniques, try to identify the beneficiaries of each grant, including the age range and gender.

$ flask update_beneficiaries

4. Update geography

Using regexes and other techniques, try to identify the countries served by each grant.

$ flask update_geography

Deploy to Heroku

The site is designed to be deployed using Heroku. You'll need to run a mongodb instance and make the connection URI available as a config variable MONGODB_URI.

Run tests

The site uses pytest to run the tests. The test database will be created with a different database name, and then destroyed at the end of every test.

The tests use seed data from tests/seed_data which is based on actual 360giving data. Some of the files have been changed to give a wider range of test scenarios.

The tests are run by running:

$ python -m pytest tests

The deployed version of the site also has circleci integration meaning the tests are run after every github commit. The current test status is:

CircleCI

About

Extract, Transform and Load processes for loading grant data into Beehive Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published