codeforIATI · Bjwebb · Aug 28, 2024 · Aug 5, 2024 · Aug 5, 2024 · Aug 5, 2024
diff --git a/README.md b/README.md
@@ -1,62 +1,81 @@
 # IATI Tables
 
-## Documentation
+IATI Tables transforms IATI data into relational tables.
 
-https://iati-tables.readthedocs.io/en/latest/
+To access the data please go to the [website](https://iati-tables.codeforiati.org/) and for more information on how to use the data please see the [documentation site](https://iati-tables.readthedocs.io/en/latest/).
 
-## Installation
+## How to run the processing job
 
-### Backend Python code (batch job)
+The processing job is a Python application which downloads the data from the [IATI Data Dump](https://iati-data-dump.codeforiati.org/), transforms the data into tables, and outputs the data in various formats such as CSV, PostgreSQL and SQLite. It is a batch job, designed to be run on a schedule.
+
+### Prerequisites
+
+- postgresql
+- sqlite
+- zip
+
+### Install Python requirements
 
 ```
-git clone https://github.com/codeforIATI/iati-tables.git
-cd iati-tables
 python3 -m venv .ve
 source .ve/bin/activate
-pip install -r requirements_dev.txt
+pip install pip-tools
+pip-sync requirements_dev.txt
 ```
 
-Install postgres, sqlite and zip. e.g. on Ubuntu:
+### Set up the PostgreSQL database
+
+Create user `iatitables`:
 
 ```
-sudo apt install postgresql sqlite3 zip
+sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
 ```
 
-Create a iatitables user and database:
+Create database `iatitables`
 
 ```
-sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
 sudo -u postgres psql -c "create database iatitables encoding utf8 owner iatitables"
 ```
 
-Run the code:
+Set `DATABASE_URL` environment variable
 
 ```
 export DATABASE_URL="postgresql://iatitables:PASSWORD_CHANGEME@localhost/iatitables"
-export IATI_TABLES_S3_DESTINATION=-
-export IATI_TABLES_SCHEMA=iati
-python -c 'import iatidata; iatidata.run_all(processes=6, sample=50)'
 ```
 
-Run with refresh=False to avoid fetching all the data every time it's run. This
-is very useful for quicker debugging.
+### Configure the processing job
 
-```
-python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
-```
+The processing job can be configured using the following environment variables:
+
+`DATABASE_URL` (Required)
+
+- The postgres database to use for the processing job.
+
+`IATI_TABLES_OUTPUT` (Optional)
+
+- The path to output data to. The default is the directory that IATI Tables is run from.
+
+`IATI_TABLES_SCHEMA` (Optional)
+
+- The schema to use in the postgres database.
+
+`IATI_TABLES_S3_DESTINATION` (Optional)
 
-`processes` is the number of processes spawned, and `sample` is the number of
-publishers data processed. A sample size of 50 is pretty quick and generally
-works. Smaller sample sizes, e.g. 1 fail because not all tables get created,
-see https://github.com/codeforIATI/iati-tables/issues/10
+- By default, IATI Tables will output local files in various formats, e.g. pg_dump, sqlite, and CSV. To additionally upload files to S3, set the environment variable `IATI_TABLES_S3_DESTINATION` with the path to your S3 bucket, e.g. `s3://my_bucket`.
 
-Running the tests:
+### Run the processing job
 
 ```
-python -m pytest iatidata/
+python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
 ```
 
-Linting:
+Parameters:
+
+- `processes` (`int`, default=`5`): The number of workers to use for parts of the process which are able to run in parallel.
+- `sample` (`int`, default=`None`): The number of datasets to process. This is useful for local development because processing the entire data dump can take several hours to run. A minimum sample size of 50 is recommended due to needing enough data to dynamically create all required tables (see https://github.com/codeforIATI/iati-tables/issues/10).
+- `refresh` (`bool`, default=`True`): Whether to download the latest data at the start of the processing job. It is useful to set this to `False` when running locally to avoid re-downloading the data every time the process is run.
+
+## How to run linting and formating
 
 ```
 isort iatidata/
@@ -65,25 +84,27 @@ flake8 iatidata/
 mypy iatidata/
 ```
 
-### Web front-end
-
-Install Node JS 20. e.g. on Ubuntu:
+## How to run unit tests
 
 ```
-curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
-sudo apt install nodejs
+python -m pytest iatidata/
 ```
 
-Install yarn:
+## How to run the web front-end
+
+### Prerequisites:
+
+- Node.js v20
+
+Change the working directory:
 
 ```
-sudo npm install -g yarn
+cd site
 ```
 
 Install dependencies:
 
 ```
-cd site
 yarn install
 ```
 
@@ -93,26 +114,18 @@ Start the development server:
 yarn serve
 ```
 
-Build and view the site:
+Or, build and view the site in production mode:
 
 ```
 yarn build
 cd site/dist
 python3 -m http.server --bind 127.0.0.1 8000
 ```
 
-### Docs
+## How to run the documentation
 
-For live preview while writing docs, run the following command and go to http://127.0.0.1:8000
+The documentation site is built with Sphinx. To view the live preview locally, run the following command:
 
 ```
 sphinx-autobuild docs docs/_build/html
 ```
-
-## Update requirements
-
-```
-pip install pip-tools
-pip-compile --upgrade
-pip-sync requirements.txt
-```
diff --git a/docs/schema.rst b/docs/schema.rst
@@ -2,6 +2,18 @@
 Data Schema
 ===========
 
-The `IATI Tables homepage <https://iati-tables.codeforiati.org/>`_ shows a list of the tables and columns available, with descriptions.
+See the available tables and columns on the `IATI Datasette instance <https://datasette.codeforiati.org/iati>`_.
 
-The :code:`_link` column acts as a primary key for each table. The :code:`_link_activity` column acts as a foreign key back to the :code:`activity` table.
+Global Columns
+--------------
+
+The following columns are available in all tables:
+
+:code:`_link`
+  The primary key for each table.
+:code:`_link_activity` or :code:`_link_organisation`
+  The foreign key to the :code:`activity` or :code:`organisation` table respectively.
+:code:`dataset`
+  The name of the dataset this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/dataset/<DATASET_NAME>`.
+:code:`prefix`
+  The registry publisher ID this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/publisher/<PREFIX>`.
diff --git a/iatidata/__init__.py b/iatidata/__init__.py
@@ -45,7 +45,7 @@
 
 schema = os.environ.get("IATI_TABLES_SCHEMA")
 
-s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "s3://iati/")
+s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "-")
 
 output_path = pathlib.Path(output_dir)
 
@@ -1271,8 +1271,14 @@ def transaction_breakdown():
 
 
 def sql_process():
-    augment_transaction()
-    transaction_breakdown()
+    try:
+        augment_transaction()
+        transaction_breakdown()
+    except Exception:
+        logger.error(
+            "Processing on the 'transaction' table failed, this is usually caused by the sample size being too small"
+        )
+        raise
 
 
 def export_stats():
@@ -1615,6 +1621,8 @@ def upload_all():
                 ["s3cmd", "setacl", f"{s3_destination}{file}", "--acl-public"],
                 check=True,
             )
+    else:
+        logger.info("Skipping upload to S3")
 
 
 def run_all(