Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Latest commit

 

History

History
95 lines (61 loc) · 3.83 KB

README.md

File metadata and controls

95 lines (61 loc) · 3.83 KB

Analytics for data.gov.ua

PLEASE USE RESPONSIVELY WITH RESPECT TO DATA.GOV.UA INFRASTRUCTURE

A simple to use full ETL component, which reliably gets metadata of datasets from data.gov.ua and uploads it to ElasticSearch exposing Kibana as the search and analytics UI.

To use crawler application separately or to change its parameters, check out app folder

Default crawler options are set accordingly to data.gov.ua robots.txt Crawl-delay parameter from 09/01/2017, which equals to 10 seconds delay between requests.

One command away from starting it - Quickstart

Kibana Screenshot

Prerequisite

  • Docker (On OSX or Windows use only native docker distribution)

Quickstart

docker-compose up

Wait at least 30 minutes for some data to be downloaded and indexed in ES and then open localhost:5601 to access Kibana.

Uncheck Index contains time-based events checkbox, in the "Index Patterns" field, type data.gov.ua-* and then press "Create". Use kibana to query metadata and setup your visualizations.

Kibana

If you are already familiar with Kibana time range functionality, you may also leave time-based events checkbox checked and choose @timestamp, created or changed as the default timefield for data.gov.ua-* index.

To add default visualizations and dashboard like on the screenshot above, follow these steps:

  1. Open Kibana - localhost:5601
  2. Go to Management -> Index Patterns
  3. Set data.gov.ua-* as an index pattern and choose any created or updated field as your time based field if you want to run Timelion queries.
  4. Go to Management -> Saved Objects
  5. Click Import button and choose dashboard.json file from kibana folder
  6. Go to Dashboard and click Open
  7. Select one of the 2 dashboards available
  8. If you chose a time-based field when setting index pattern, you will not see any statistics until you change the time range in the top right corner of the kibana dashboard.

To get more information on how to use Kibana consult its documentation

Cleaning Up

To stop containers, execute:

docker-compose stop

To fully cleanup the system removing all the downloaded data and containers, run:

docker-compose down --rmi all

For any other commands, consult Docker Compose Documentation

How it works

It schedules a batch crawling job with the following cron string 0 10 0 * * 6. This means that crawler will run every Saturday at 00:10. Check out app folder for more options. It also runs a docker container with rotating proxy.

Services

  • crawler - node.js app to crawl data and store it in a file. Has an http_proxy environment variable set to use rotating proxy server.
  • proxy - Proxy server.
  • elasticsearch - ElasticSearch service. Exposes 9200 port, so use localhost:9200 to access ES API.
  • logstash - Logstash service. Configuration files can be found in logstash folder
  • kibana - Kibana service. Exposes 5601 port, so open localhost:5601 to access Kibana UI.

Data Volumes

  • metadata - stores crawled metadata files
  • elasticsearch_config - stores ElasticSearch configuration files
  • elasticsearch_data - stores ElasticSearch data files
  • kibana_config - stores Kibana configuration files

To list names of the data volumes, run:

docker volume ls

then

docker volume inspect [volume_name]

Mountpoint field represents a path to files on your local filesystem.

License

MIT (c) Artem Sorokin