Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Scripts to get metadata from data.gov.ua website and to upload it to ElasticSearch exposing Kibana interface to get metadata insights and view analytics

License

Notifications You must be signed in to change notification settings

sorjef/data-gov-ua-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analytics for data.gov.ua

PLEASE USE RESPONSIVELY WITH RESPECT TO DATA.GOV.UA INFRASTRUCTURE

A simple to use full ETL component, which reliably gets metadata of datasets from data.gov.ua and uploads it to ElasticSearch exposing Kibana as the search and analytics UI.

To use crawler application separately or to change its parameters, check out app folder

Default crawler options are set accordingly to data.gov.ua robots.txt Crawl-delay parameter from 09/01/2017, which equals to 10 seconds delay between requests.

One command away from starting it - Quickstart

Kibana Screenshot

Prerequisite

  • Docker (On OSX or Windows use only native docker distribution)

Quickstart

docker-compose up

Wait at least 30 minutes for some data to be downloaded and indexed in ES and then open localhost:5601 to access Kibana.

Uncheck Index contains time-based events checkbox, in the "Index Patterns" field, type data.gov.ua-* and then press "Create". Use kibana to query metadata and setup your visualizations.

Kibana

If you are already familiar with Kibana time range functionality, you may also leave time-based events checkbox checked and choose @timestamp, created or changed as the default timefield for data.gov.ua-* index.

To add default visualizations and dashboard like on the screenshot above, follow these steps:

  1. Open Kibana - localhost:5601
  2. Go to Management -> Index Patterns
  3. Set data.gov.ua-* as an index pattern and choose any created or updated field as your time based field if you want to run Timelion queries.
  4. Go to Management -> Saved Objects
  5. Click Import button and choose dashboard.json file from kibana folder
  6. Go to Dashboard and click Open
  7. Select one of the 2 dashboards available
  8. If you chose a time-based field when setting index pattern, you will not see any statistics until you change the time range in the top right corner of the kibana dashboard.

To get more information on how to use Kibana consult its documentation

Cleaning Up

To stop containers, execute:

docker-compose stop

To fully cleanup the system removing all the downloaded data and containers, run:

docker-compose down --rmi all

For any other commands, consult Docker Compose Documentation

How it works

It schedules a batch crawling job with the following cron string 0 10 0 * * 6. This means that crawler will run every Saturday at 00:10. Check out app folder for more options. It also runs a docker container with rotating proxy.

Services

  • crawler - node.js app to crawl data and store it in a file. Has an http_proxy environment variable set to use rotating proxy server.
  • proxy - Proxy server.
  • elasticsearch - ElasticSearch service. Exposes 9200 port, so use localhost:9200 to access ES API.
  • logstash - Logstash service. Configuration files can be found in logstash folder
  • kibana - Kibana service. Exposes 5601 port, so open localhost:5601 to access Kibana UI.

Data Volumes

  • metadata - stores crawled metadata files
  • elasticsearch_config - stores ElasticSearch configuration files
  • elasticsearch_data - stores ElasticSearch data files
  • kibana_config - stores Kibana configuration files

To list names of the data volumes, run:

docker volume ls

then

docker volume inspect [volume_name]

Mountpoint field represents a path to files on your local filesystem.

License

MIT (c) Artem Sorokin

About

Scripts to get metadata from data.gov.ua website and to upload it to ElasticSearch exposing Kibana interface to get metadata insights and view analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published