Analytics for data.gov.ua
PLEASE USE RESPONSIVELY WITH RESPECT TO DATA.GOV.UA INFRASTRUCTURE
A simple to use full ETL component, which reliably gets metadata of datasets from data.gov.ua and uploads it to ElasticSearch exposing Kibana as the search and analytics UI.
To use crawler application separately or to change its parameters, check out app folder
Default crawler options are set accordingly to data.gov.ua robots.txt Crawl-delay
parameter from 09/01/2017, which equals to 10 seconds delay between requests.
One command away from starting it - Quickstart
- Docker (On OSX or Windows use only native docker distribution)
docker-compose up
Wait at least 30 minutes for some data to be downloaded and indexed in ES and then open localhost:5601 to access Kibana.
Uncheck Index contains time-based events
checkbox, in the "Index Patterns" field, type data.gov.ua-*
and then press "Create". Use kibana to query metadata and setup your visualizations.
If you are already familiar with Kibana time range functionality, you may also leave time-based events checkbox checked and choose @timestamp
, created
or changed
as the default timefield for data.gov.ua-*
index.
To add default visualizations and dashboard like on the screenshot above, follow these steps:
- Open Kibana - localhost:5601
- Go to
Management
->Index Patterns
- Set
data.gov.ua-*
as an index pattern and choose anycreated
orupdated
field as your time based field if you want to run Timelion queries. - Go to
Management
->Saved Objects
- Click
Import
button and choosedashboard.json
file from kibana folder - Go to
Dashboard
and clickOpen
- Select one of the 2 dashboards available
- If you chose a time-based field when setting index pattern, you will not see any statistics until you change the time range in the top right corner of the kibana dashboard.
To get more information on how to use Kibana consult its documentation
To stop containers, execute:
docker-compose stop
To fully cleanup the system removing all the downloaded data and containers, run:
docker-compose down --rmi all
For any other commands, consult Docker Compose Documentation
It schedules a batch crawling job with the following cron string 0 10 0 * * 6
. This means that crawler will run every Saturday at 00:10. Check out app folder for more options. It also runs a docker container with rotating proxy.
crawler
- node.js app to crawl data and store it in a file. Has anhttp_proxy
environment variable set to use rotating proxy server.proxy
- Proxy server.elasticsearch
- ElasticSearch service. Exposes 9200 port, so use localhost:9200 to access ES API.logstash
- Logstash service. Configuration files can be found in logstash folderkibana
- Kibana service. Exposes 5601 port, so open localhost:5601 to access Kibana UI.
metadata
- stores crawled metadata fileselasticsearch_config
- stores ElasticSearch configuration fileselasticsearch_data
- stores ElasticSearch data fileskibana_config
- stores Kibana configuration files
To list names of the data volumes, run:
docker volume ls
then
docker volume inspect [volume_name]
Mountpoint
field represents a path to files on your local filesystem.
MIT (c) Artem Sorokin