diff --git a/README.md b/README.md
index 20c09a6..245212c 100644
--- a/README.md
+++ b/README.md
@@ -1,62 +1,81 @@
# IATI Tables
-## Documentation
+IATI Tables transforms IATI data into relational tables.
-https://iati-tables.readthedocs.io/en/latest/
+To access the data please go to the [website](https://iati-tables.codeforiati.org/) and for more information on how to use the data please see the [documentation site](https://iati-tables.readthedocs.io/en/latest/).
-## Installation
+## How to run the processing job
-### Backend Python code (batch job)
+The processing job is a Python application which downloads the data from the [IATI Data Dump](https://iati-data-dump.codeforiati.org/), transforms the data into tables, and outputs the data in various formats such as CSV, PostgreSQL and SQLite. It is a batch job, designed to be run on a schedule.
+
+### Prerequisites
+
+- postgresql
+- sqlite
+- zip
+
+### Install Python requirements
```
-git clone https://github.com/codeforIATI/iati-tables.git
-cd iati-tables
python3 -m venv .ve
source .ve/bin/activate
-pip install -r requirements_dev.txt
+pip install pip-tools
+pip-sync requirements_dev.txt
```
-Install postgres, sqlite and zip. e.g. on Ubuntu:
+### Set up the PostgreSQL database
+
+Create user `iatitables`:
```
-sudo apt install postgresql sqlite3 zip
+sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
```
-Create a iatitables user and database:
+Create database `iatitables`
```
-sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
sudo -u postgres psql -c "create database iatitables encoding utf8 owner iatitables"
```
-Run the code:
+Set `DATABASE_URL` environment variable
```
export DATABASE_URL="postgresql://iatitables:PASSWORD_CHANGEME@localhost/iatitables"
-export IATI_TABLES_S3_DESTINATION=-
-export IATI_TABLES_SCHEMA=iati
-python -c 'import iatidata; iatidata.run_all(processes=6, sample=50)'
```
-Run with refresh=False to avoid fetching all the data every time it's run. This
-is very useful for quicker debugging.
+### Configure the processing job
-```
-python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
-```
+The processing job can be configured using the following environment variables:
+
+`DATABASE_URL` (Required)
+
+- The postgres database to use for the processing job.
+
+`IATI_TABLES_OUTPUT` (Optional)
+
+- The path to output data to. The default is the directory that IATI Tables is run from.
+
+`IATI_TABLES_SCHEMA` (Optional)
+
+- The schema to use in the postgres database.
+
+`IATI_TABLES_S3_DESTINATION` (Optional)
-`processes` is the number of processes spawned, and `sample` is the number of
-publishers data processed. A sample size of 50 is pretty quick and generally
-works. Smaller sample sizes, e.g. 1 fail because not all tables get created,
-see https://github.com/codeforIATI/iati-tables/issues/10
+- By default, IATI Tables will output local files in various formats, e.g. pg_dump, sqlite, and CSV. To additionally upload files to S3, set the environment variable `IATI_TABLES_S3_DESTINATION` with the path to your S3 bucket, e.g. `s3://my_bucket`.
-Running the tests:
+### Run the processing job
```
-python -m pytest iatidata/
+python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
```
-Linting:
+Parameters:
+
+- `processes` (`int`, default=`5`): The number of workers to use for parts of the process which are able to run in parallel.
+- `sample` (`int`, default=`None`): The number of datasets to process. This is useful for local development because processing the entire data dump can take several hours to run. A minimum sample size of 50 is recommended due to needing enough data to dynamically create all required tables (see https://github.com/codeforIATI/iati-tables/issues/10).
+- `refresh` (`bool`, default=`True`): Whether to download the latest data at the start of the processing job. It is useful to set this to `False` when running locally to avoid re-downloading the data every time the process is run.
+
+## How to run linting and formating
```
isort iatidata/
@@ -65,25 +84,27 @@ flake8 iatidata/
mypy iatidata/
```
-### Web front-end
-
-Install Node JS 20. e.g. on Ubuntu:
+## How to run unit tests
```
-curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
-sudo apt install nodejs
+python -m pytest iatidata/
```
-Install yarn:
+## How to run the web front-end
+
+### Prerequisites:
+
+- Node.js v20
+
+Change the working directory:
```
-sudo npm install -g yarn
+cd site
```
Install dependencies:
```
-cd site
yarn install
```
@@ -93,7 +114,7 @@ Start the development server:
yarn serve
```
-Build and view the site:
+Or, build and view the site in production mode:
```
yarn build
@@ -101,18 +122,10 @@ cd site/dist
python3 -m http.server --bind 127.0.0.1 8000
```
-### Docs
+## How to run the documentation
-For live preview while writing docs, run the following command and go to http://127.0.0.1:8000
+The documentation site is built with Sphinx. To view the live preview locally, run the following command:
```
sphinx-autobuild docs docs/_build/html
```
-
-## Update requirements
-
-```
-pip install pip-tools
-pip-compile --upgrade
-pip-sync requirements.txt
-```
diff --git a/docs/schema.rst b/docs/schema.rst
index b14461f..1d2aad9 100644
--- a/docs/schema.rst
+++ b/docs/schema.rst
@@ -2,6 +2,18 @@
Data Schema
===========
-The `IATI Tables homepage `_ shows a list of the tables and columns available, with descriptions.
+See the available tables and columns on the `IATI Datasette instance `_.
-The :code:`_link` column acts as a primary key for each table. The :code:`_link_activity` column acts as a foreign key back to the :code:`activity` table.
+Global Columns
+--------------
+
+The following columns are available in all tables:
+
+:code:`_link`
+ The primary key for each table.
+:code:`_link_activity` or :code:`_link_organisation`
+ The foreign key to the :code:`activity` or :code:`organisation` table respectively.
+:code:`dataset`
+ The name of the dataset this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/dataset/`.
+:code:`prefix`
+ The registry publisher ID this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/publisher/`.
diff --git a/iatidata/__init__.py b/iatidata/__init__.py
index 7b81d8c..b39adc9 100644
--- a/iatidata/__init__.py
+++ b/iatidata/__init__.py
@@ -45,7 +45,7 @@
schema = os.environ.get("IATI_TABLES_SCHEMA")
-s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "s3://iati/")
+s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "-")
output_path = pathlib.Path(output_dir)
@@ -1271,8 +1271,14 @@ def transaction_breakdown():
def sql_process():
- augment_transaction()
- transaction_breakdown()
+ try:
+ augment_transaction()
+ transaction_breakdown()
+ except Exception:
+ logger.error(
+ "Processing on the 'transaction' table failed, this is usually caused by the sample size being too small"
+ )
+ raise
def export_stats():
@@ -1615,6 +1621,8 @@ def upload_all():
["s3cmd", "setacl", f"{s3_destination}{file}", "--acl-public"],
check=True,
)
+ else:
+ logger.info("Skipping upload to S3")
def run_all(