Bigmetadata comprises four parts:
ETL | Observatory | Metadata | Catalog
Everything is dockerized and can be run on a standalone EC2 instance provided Docker is available.
-
ETL: Luigi tasks that extract data from anywhere on the web, transform the data in modular and observable steps, generate metadata, and upload observatory datasets in the process. While the ETL pushes transformed datasets to CartoDB, it actually is backed by its own separate Postgres/PostGIS database.
-
Observatory: transformed and standardized datasets living on a CartoDB instance. This is where we can pull our actual data from, whether shuffling bytes or using PL/Proxy. This is also where preview visuals using widgets or other Carto JS interfaces can get their underlying data from.
-
Metadata: human and machine-readable descriptions of the data in the observatory. Table schema can be found in tasks/meta.py. There are six related tables,
obs_table
,obs_column_table
,obs_column
,obs_column_tag
,obs_tag
, andobs_column_to_column
. An overarching denormalized view can be found inobs_meta
. -
Catalog: a static HTML guide to data in the observatory generated from the metadata. Docs are generated using Sphinx and hosted on GitHub pages.
See QUICKSTART.
Most of the common tasks have already been wrapped up in the Makefile
:
make deploy-html-redirect
: Deploy a redirect (https://carto.com/data/) to Github Pagesmake catalog
: Regenerate the catalogmake deploy-html-catalog
: [DEPRECATED] Deploy the catalog to Github Pagesmake sh
: Drop into the bigmetadata container to run shell scriptsmake python
: Drop into an interactive Python shell in the bigmetadata containermake psql
: Drop into an interactive psql session in the bigmetadata container's databasemake rebuild-all
: Task to clean and rebuild all the*-all
tasks. You can disable the heavy tasks, which by now isus-all
addingRUN_HEAVY_TASKS=false
The ETL tasks have also already been wrapped up in the Makefile:
make [docker-]au-all
: ETL the entirety of the Australian datamake [docker-]br-all
: ETL the entirety of the Brazilian datamake [docker-]ca-all
: ETL the entirety of the Canadian datamake [docker-]es-all
: ETL the entirety of the Spanish datamake [docker-]eu-all
: ETL the entirety of the Eurostat datamake [docker-]fr-all
: ETL the entirety of the French datamake [docker-]mx-all
: ETL the entirety of the Mexican datamake [docker-]uk-all
: ETL the entirety of the United Kingdom datamake [docker-]us-all
: ETL the entirety of the United States data
Any other task can be run using docker-compose
:
docker-compose run bigmetadata luigi --module tasks.path.to.task \
TaskName --param1 val1 --param2 val2
Or, more conveniently, make -- run
(which will use the local scheduler):
make -- run path.to.task.TaskName --param1 val1 --param2 val2
For example, to run QCEW numbers for one quarter:
make -- run us.bls.QCEW --year 2014 --qtr 4
Or using Docker:
make -- docker-run us.bls.QCEW --year 2014 --qtr 4
If you want to use the local scheduler, you can add SCHEDULER=--local-scheduler
to the make
task
We've added a new script to watch for containers execution and notify when it ends. Is located in scripts/watch_containers.py
usage: python3 watch_containers.py [-h] [--since SINCE]
[--pooling-time POOLING_TIME]
[--notification-channel {stdout,slack,logfile}]
name
Let me explain the options:
- name: Name, complete or a part of it, of the container to watch
- since: [optional] Date in the
YYYY-mm-dd HH:MM:SS
UTC format since the script starts looking for docker containers to watch. Default value is utcnow() value - pooling-time: [optional] How many seconds we check for containers completion. Default value is 60 seconds
- notification-channel: [optional] Where to send the notifications: stdout or slack. Default value is
stdout
For the slack notification you need to have an environment variable called SLACK_WEBHOOK
that contains the URL for the incoming webhook you've created to send the messages.
docker-compose.yml
allows configuring two environment variables:
DO_LOCAL_POSTGRESQL_LIB_DIR
(default:./postgres/data
).DO_LOCAL_POSTGRESQL_TMP_DIR
(default:./tmp
).
That's useful if you have an external HD for storing DO.
In bigmetadata's postgres, tables live in the observatory
schema, with an
auto-generated hash for the tablename prefixed by obs
(EG obs_<hash>
).
In obs_table
, the table ID refers back to the task which created that table
as well as the parameters called.
Column IDs in metadata are fully-qualified, like tables, such as
us.bls.avg_wkly_wage_trade_transportation_and_utilities
. In tables
themselves, column names are the colname
from metadata,
so the above would be avg_wkly_wage_trade_transportation_and_utilities
in
an actual table.
Our users need third party data in their maps to make it possible for them to better interpret their own data.
In order to make this as simple as possible, we need to rethink the prevailing model of finding external data by where it's sourced from. Instead, we should think of finding external data by need.
For example, instead of requiring our user to think "I need race or demographic data alongside my data, I'm in the US, so I should look at the census", we should enable our user to look up race or demographic data -- and figure out which columns they need from the census without particularly worrying about delving into the source.