diff --git a/README.md b/README.md index 20c09a6..245212c 100644 --- a/README.md +++ b/README.md @@ -1,62 +1,81 @@ # IATI Tables -## Documentation +IATI Tables transforms IATI data into relational tables. -https://iati-tables.readthedocs.io/en/latest/ +To access the data please go to the [website](https://iati-tables.codeforiati.org/) and for more information on how to use the data please see the [documentation site](https://iati-tables.readthedocs.io/en/latest/). -## Installation +## How to run the processing job -### Backend Python code (batch job) +The processing job is a Python application which downloads the data from the [IATI Data Dump](https://iati-data-dump.codeforiati.org/), transforms the data into tables, and outputs the data in various formats such as CSV, PostgreSQL and SQLite. It is a batch job, designed to be run on a schedule. + +### Prerequisites + +- postgresql +- sqlite +- zip + +### Install Python requirements ``` -git clone https://github.com/codeforIATI/iati-tables.git -cd iati-tables python3 -m venv .ve source .ve/bin/activate -pip install -r requirements_dev.txt +pip install pip-tools +pip-sync requirements_dev.txt ``` -Install postgres, sqlite and zip. e.g. on Ubuntu: +### Set up the PostgreSQL database + +Create user `iatitables`: ``` -sudo apt install postgresql sqlite3 zip +sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'" ``` -Create a iatitables user and database: +Create database `iatitables` ``` -sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'" sudo -u postgres psql -c "create database iatitables encoding utf8 owner iatitables" ``` -Run the code: +Set `DATABASE_URL` environment variable ``` export DATABASE_URL="postgresql://iatitables:PASSWORD_CHANGEME@localhost/iatitables" -export IATI_TABLES_S3_DESTINATION=- -export IATI_TABLES_SCHEMA=iati -python -c 'import iatidata; iatidata.run_all(processes=6, sample=50)' ``` -Run with refresh=False to avoid fetching all the data every time it's run. This -is very useful for quicker debugging. +### Configure the processing job -``` -python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)' -``` +The processing job can be configured using the following environment variables: + +`DATABASE_URL` (Required) + +- The postgres database to use for the processing job. + +`IATI_TABLES_OUTPUT` (Optional) + +- The path to output data to. The default is the directory that IATI Tables is run from. + +`IATI_TABLES_SCHEMA` (Optional) + +- The schema to use in the postgres database. + +`IATI_TABLES_S3_DESTINATION` (Optional) -`processes` is the number of processes spawned, and `sample` is the number of -publishers data processed. A sample size of 50 is pretty quick and generally -works. Smaller sample sizes, e.g. 1 fail because not all tables get created, -see https://github.com/codeforIATI/iati-tables/issues/10 +- By default, IATI Tables will output local files in various formats, e.g. pg_dump, sqlite, and CSV. To additionally upload files to S3, set the environment variable `IATI_TABLES_S3_DESTINATION` with the path to your S3 bucket, e.g. `s3://my_bucket`. -Running the tests: +### Run the processing job ``` -python -m pytest iatidata/ +python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)' ``` -Linting: +Parameters: + +- `processes` (`int`, default=`5`): The number of workers to use for parts of the process which are able to run in parallel. +- `sample` (`int`, default=`None`): The number of datasets to process. This is useful for local development because processing the entire data dump can take several hours to run. A minimum sample size of 50 is recommended due to needing enough data to dynamically create all required tables (see https://github.com/codeforIATI/iati-tables/issues/10). +- `refresh` (`bool`, default=`True`): Whether to download the latest data at the start of the processing job. It is useful to set this to `False` when running locally to avoid re-downloading the data every time the process is run. + +## How to run linting and formating ``` isort iatidata/ @@ -65,25 +84,27 @@ flake8 iatidata/ mypy iatidata/ ``` -### Web front-end - -Install Node JS 20. e.g. on Ubuntu: +## How to run unit tests ``` -curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - -sudo apt install nodejs +python -m pytest iatidata/ ``` -Install yarn: +## How to run the web front-end + +### Prerequisites: + +- Node.js v20 + +Change the working directory: ``` -sudo npm install -g yarn +cd site ``` Install dependencies: ``` -cd site yarn install ``` @@ -93,7 +114,7 @@ Start the development server: yarn serve ``` -Build and view the site: +Or, build and view the site in production mode: ``` yarn build @@ -101,18 +122,10 @@ cd site/dist python3 -m http.server --bind 127.0.0.1 8000 ``` -### Docs +## How to run the documentation -For live preview while writing docs, run the following command and go to http://127.0.0.1:8000 +The documentation site is built with Sphinx. To view the live preview locally, run the following command: ``` sphinx-autobuild docs docs/_build/html ``` - -## Update requirements - -``` -pip install pip-tools -pip-compile --upgrade -pip-sync requirements.txt -``` diff --git a/docs/schema.rst b/docs/schema.rst index b14461f..1d2aad9 100644 --- a/docs/schema.rst +++ b/docs/schema.rst @@ -2,6 +2,18 @@ Data Schema =========== -The `IATI Tables homepage `_ shows a list of the tables and columns available, with descriptions. +See the available tables and columns on the `IATI Datasette instance `_. -The :code:`_link` column acts as a primary key for each table. The :code:`_link_activity` column acts as a foreign key back to the :code:`activity` table. +Global Columns +-------------- + +The following columns are available in all tables: + +:code:`_link` + The primary key for each table. +:code:`_link_activity` or :code:`_link_organisation` + The foreign key to the :code:`activity` or :code:`organisation` table respectively. +:code:`dataset` + The name of the dataset this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/dataset/`. +:code:`prefix` + The registry publisher ID this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/publisher/`. diff --git a/iatidata/__init__.py b/iatidata/__init__.py index 7b81d8c..b39adc9 100644 --- a/iatidata/__init__.py +++ b/iatidata/__init__.py @@ -45,7 +45,7 @@ schema = os.environ.get("IATI_TABLES_SCHEMA") -s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "s3://iati/") +s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "-") output_path = pathlib.Path(output_dir) @@ -1271,8 +1271,14 @@ def transaction_breakdown(): def sql_process(): - augment_transaction() - transaction_breakdown() + try: + augment_transaction() + transaction_breakdown() + except Exception: + logger.error( + "Processing on the 'transaction' table failed, this is usually caused by the sample size being too small" + ) + raise def export_stats(): @@ -1615,6 +1621,8 @@ def upload_all(): ["s3cmd", "setacl", f"{s3_destination}{file}", "--acl-public"], check=True, ) + else: + logger.info("Skipping upload to S3") def run_all(