Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand documentation #50

Merged
merged 5 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 59 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,81 @@
# IATI Tables

## Documentation
IATI Tables transforms IATI data into relational tables.

https://iati-tables.readthedocs.io/en/latest/
To access the data please go to the [website](https://iati-tables.codeforiati.org/) and for more information on how to use the data please see the [documentation site](https://iati-tables.readthedocs.io/en/latest/).

## Installation
## How to run the processing job

### Backend Python code (batch job)
The processing job is a Python application which downloads the data from the [IATI Data Dump](https://iati-data-dump.codeforiati.org/), transforms the data into tables, and outputs the data in various formats such as CSV, PostgreSQL and SQLite. It is a batch job, designed to be run on a schedule.

### Prerequisites

- postgresql
- sqlite
- zip

### Install Python requirements

```
git clone https://github.com/codeforIATI/iati-tables.git
cd iati-tables
python3 -m venv .ve
source .ve/bin/activate
pip install -r requirements_dev.txt
pip install pip-tools
pip-sync requirements_dev.txt
```

Install postgres, sqlite and zip. e.g. on Ubuntu:
### Set up the PostgreSQL database

Create user `iatitables`:

```
sudo apt install postgresql sqlite3 zip
sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
```

Create a iatitables user and database:
Create database `iatitables`

```
sudo -u postgres psql -c "create user iatitables with password 'PASSWORD_CHANGEME'"
sudo -u postgres psql -c "create database iatitables encoding utf8 owner iatitables"
```

Run the code:
Set `DATABASE_URL` environment variable

```
export DATABASE_URL="postgresql://iatitables:PASSWORD_CHANGEME@localhost/iatitables"
export IATI_TABLES_S3_DESTINATION=-
export IATI_TABLES_SCHEMA=iati
python -c 'import iatidata; iatidata.run_all(processes=6, sample=50)'
```

Run with refresh=False to avoid fetching all the data every time it's run. This
is very useful for quicker debugging.
### Configure the processing job

```
python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
```
The processing job can be configured using the following environment variables:

`DATABASE_URL` (Required)

- The postgres database to use for the processing job.

`IATI_TABLES_OUTPUT` (Optional)

- The path to output data to. The default is the directory that IATI Tables is run from.

`IATI_TABLES_SCHEMA` (Optional)

- The schema to use in the postgres database.

`IATI_TABLES_S3_DESTINATION` (Optional)

`processes` is the number of processes spawned, and `sample` is the number of
publishers data processed. A sample size of 50 is pretty quick and generally
works. Smaller sample sizes, e.g. 1 fail because not all tables get created,
see https://github.com/codeforIATI/iati-tables/issues/10
- By default, IATI Tables will output local files in various formats, e.g. pg_dump, sqlite, and CSV. To additionally upload files to S3, set the environment variable `IATI_TABLES_S3_DESTINATION` with the path to your S3 bucket, e.g. `s3://my_bucket`.

Running the tests:
### Run the processing job

```
python -m pytest iatidata/
python -c 'import iatidata; iatidata.run_all(processes=6, sample=50, refresh=False)'
```

Linting:
Parameters:

- `processes` (`int`, default=`5`): The number of workers to use for parts of the process which are able to run in parallel.
- `sample` (`int`, default=`None`): The number of datasets to process. This is useful for local development because processing the entire data dump can take several hours to run. A minimum sample size of 50 is recommended due to needing enough data to dynamically create all required tables (see https://github.com/codeforIATI/iati-tables/issues/10).
- `refresh` (`bool`, default=`True`): Whether to download the latest data at the start of the processing job. It is useful to set this to `False` when running locally to avoid re-downloading the data every time the process is run.

## How to run linting and formating

```
isort iatidata/
Expand All @@ -65,25 +84,27 @@ flake8 iatidata/
mypy iatidata/
```

### Web front-end

Install Node JS 20. e.g. on Ubuntu:
## How to run unit tests

```
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install nodejs
python -m pytest iatidata/
```

Install yarn:
## How to run the web front-end

### Prerequisites:

- Node.js v20

Change the working directory:

```
sudo npm install -g yarn
cd site
```

Install dependencies:

```
cd site
yarn install
```

Expand All @@ -93,26 +114,18 @@ Start the development server:
yarn serve
```

Build and view the site:
Or, build and view the site in production mode:
Bjwebb marked this conversation as resolved.
Show resolved Hide resolved

```
yarn build
cd site/dist
python3 -m http.server --bind 127.0.0.1 8000
```

### Docs
## How to run the documentation

For live preview while writing docs, run the following command and go to http://127.0.0.1:8000
The documentation site is built with Sphinx. To view the live preview locally, run the following command:

```
sphinx-autobuild docs docs/_build/html
```

## Update requirements

```
pip install pip-tools
pip-compile --upgrade
pip-sync requirements.txt
```
16 changes: 14 additions & 2 deletions docs/schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@
Data Schema
===========

The `IATI Tables homepage <https://iati-tables.codeforiati.org/>`_ shows a list of the tables and columns available, with descriptions.
See the available tables and columns on the `IATI Datasette instance <https://datasette.codeforiati.org/iati>`_.

The :code:`_link` column acts as a primary key for each table. The :code:`_link_activity` column acts as a foreign key back to the :code:`activity` table.
Global Columns
--------------

The following columns are available in all tables:

:code:`_link`
The primary key for each table.
:code:`_link_activity` or :code:`_link_organisation`
The foreign key to the :code:`activity` or :code:`organisation` table respectively.
:code:`dataset`
The name of the dataset this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/dataset/<DATASET_NAME>`.
:code:`prefix`
The registry publisher ID this row came from. This can be used to find the dataset in the IATI registry, using the URL: :code:`https://www.iatiregistry.org/publisher/<PREFIX>`.
14 changes: 11 additions & 3 deletions iatidata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@

schema = os.environ.get("IATI_TABLES_SCHEMA")

s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "s3://iati/")
s3_destination = os.environ.get("IATI_TABLES_S3_DESTINATION", "-")

output_path = pathlib.Path(output_dir)

Expand Down Expand Up @@ -1271,8 +1271,14 @@ def transaction_breakdown():


def sql_process():
augment_transaction()
transaction_breakdown()
try:
augment_transaction()
transaction_breakdown()
except Exception:
logger.error(
"Processing on the 'transaction' table failed, this is usually caused by the sample size being too small"
)
raise


def export_stats():
Expand Down Expand Up @@ -1615,6 +1621,8 @@ def upload_all():
["s3cmd", "setacl", f"{s3_destination}{file}", "--acl-public"],
check=True,
)
else:
logger.info("Skipping upload to S3")


def run_all(
Expand Down