Skip to content

Adding a regional data source

Richard Martin-Nielsen edited this page May 3, 2021 · 1 revision

Adding a new data set of regional (sub-national) data to covidregionaldata

Thank you for contributing to covidregionaldata! Please make sure you have read our contributing guide before reading on (see it here).

If you are adding data for an individual country read on. If you wish to add national level data (data spanning multiple countries) checkout the guide on adding national level data in addition to reading this.

Adding a prototype function

Our datasets are implemented using R6 methods. You can read more about these methods here.

This document will not describe in detail the mechanics of the top-level DataClass class, but the main (and sometimes only) thing you need to do to add a new data source to covidregionaldata is to create a new class which inherits from DataClass and "fills out" the framework it provides. Lots of the hard work of downloading, processing and returning is done by functions in the DataClass ancestor or in some additional functions in utils.R.

With this basis, there are two approaches to making a new class. The first is to work from the CountryTemplate.R class. The second is to model your class on an existing one.

A (mostly) blank slate: working from the template

To help get you started we have provided a template here. As a first step copy this template into the R folder of your local copy of covidregionaldata and rename it to the country or data source you are adding support for. You should also rename all the CountryTemplate uses in the template (either using camel case if in code or using title case if written as text). For the next steps see here for an example simple dataset class and here for a complex class.

Imitation is sincere flattery: copying from an existing model

covidregionaldata already provides data from several different sources and classes have been written to function with a variety of input. Someone has probably already done something similar to what you need to do, and in this case you can see what steps have been put into each method, and which methods have been left out.

This approach will let you start with a framework with the parts you need already and you'll be able to see what other classes have done in each function to get an idea of what you need to do for yours.

Simple flow diagram for choosing a model to work from

  • Q1: Do you have data only for level 1?
    • Yes
      • Model based on Cuba or Canada or SouthAfrica or Italy or India [or ...]
    • No
      • Q2: Is your data available as separate sources for levels 1 and 2?
        • Model on USA, look at France
      • Q3: Is your data available is only for level 2 but then aggregated to get level 1 data?
        • Model on Lithuania or Brazil or Germany
      • Q4: Is some data available only at certain levels?
        • Model on Belgium (see also France)
      • Q5: Are you doing something different from all the others?
        • Look at UK, but this probably isn't what you want to do.

Overview of DataClass function flow

Most users will never create DataClass objects or interact with it directly, but will rely on its work when they call get_regional_data.

When get_regional_data is called for a particular source, the source is identified and the object of the correct class is created. The params specified in the call to get_regional_data are transferred into the class' internal fields (e.g. self$verbose) and then get is called, which then calls the following methods in the following hierarchical order.

  • get

    • download

    • clean

      • clean_common
      • clean_level_1 OR clean_level_2
    • process

    • filter

In most cases you will need to implement at least clean_common. clean_common and the level-specific clean functions if they exist should work to take data from self$data$raw (where there may be several named tibbles coming from separate data sources) and put it into self$data$clean as a single tibble with the data columns renamed and formatted (conversions of dates, standardization of names) and the columns level_1_region_code, region_level_1. For level 2 data, region_level_2 is also required; level_2_region_code is optional.

You may only need to provide one of clean_level_1 and clean_level_2; these will be called after clean_common, so if there is common cleaning logic that can be applied, it should be in clean_common.

You will only need a custom download method if your data is not available on static urls in csv format (as in the code for Mexico).

If you provide a new method for download, call super$download() first within it.

process does much of the work of filling in NA values, adding empty columns and providing totals where requested.

You should not need to write new methods for clean, filter or process.

Source data

You need an open and accessible data source, preferably in the form of a CSV file updated on a regular basis and accessible for download with a fixed (or predictable) URL.

You will place these urls into the common_data_urls named list, or into the level-specific data urls named lists, or into some combination of the two.

DataClass$download will download all files listed in common_data_urls and place the contents of each into self$data$raw, each as a tibble with the name of which was applied to the url in common_data_urls list, so

  common_data_urls = list(
      "main" = "https://covid19cubadata.github.io/data/covid19-casos.csv"
  )

results in the downloading of the data from https://covid19cubadata.github.io/data/covid19-casos.csv and it being placed into self$data$raw$main

Output data

Below is a list of the columns which get_regional_data will return for each country. You probably won't have data for all these columns and you do not need to generate empty or NA columns. Gaps in your data will be filled with NA, and cumulative sums will be calculated where necessary.

At a minimum, your get_regional_data_* function should provide date, one of region_level_1 or region_level_2 (as appropriate), one of level_1_region_code or level_2_region_code (as appropriate), and one of cases_new, cases_total, deaths_new, deaths_total, recovered_new, recovered_total, tested_new, tested_total.

date : the date that the counts were reported (YYYY-MM-DD).

region_level_1 : the level 1 region name. This column will be named differently for different countries (e.g. state, province), but this renaming is done by the function which calls your get_regional_data_* function, based on what is present in get_info_covidregionaldata (see below)

level_1_region_code : a standard code for the level 1 region. The column name reflects the specific administrative code used. Typically data returns the ISO 3166-2 standard, although where not available the column will be named differently to reflect its source.

region_level_2 : the level 2 region name. This column will be named differently for different countries (e.g. city, county). This renaming is done by DataClass functions based on what is stored

level_2_region_code : a standard code for the level 2 region. The column will be named differently for different countries (e.g. fips in the USA).

cases_new : new reported cases for that day

cases_total : total reported cases up to and including that day

deaths_new : new reported deaths for that day

deaths_total : total reported deaths up to and including that day

recovered_new : new reported recoveries for that day

recovered_total : total reported recoveries up to and including that day

hosp_new : new reported hospitalisations for that day

hosp_total : total reported hospitalisations up to and including that day (note this is cumulative total of new reported, not total currently in hospital)

tested_new : tests for that day

tested_total : total tests completed up to and including that day

How to download data

The R6 structure of DataClass means that you probably don't have to write code to download the data. By listing the source urls for CSV files of your data in the common_data_urls and/or the level_data_urls named lists, you invoke a generic download function which downloads the contents of each url and places it into self$data$raw$[name].

Looking at an example from Belgium, the following instructs the download function to download data on cases (main) and hospitalization (hosp) into self$data$raw$main and self$data$raw$hosp for every invocation. If data for level 1 is being generated, then self$data$raw$deaths is also filled with the contents of the specified downloaded CSV file.

    #' @field common_data_urls List of named links to raw data that are common
    #' across levels.
    common_data_urls = list(
      "main" = "https://epistat.sciensano.be/Data/COVID19BE_CASES_AGESEX.csv",
      "hosp" = "https://epistat.sciensano.be/Data/COVID19BE_HOSP.csv"
    ),
    #' @field level_data_urls List of named links to raw data specific to
    #' each level of regions. For Belgium, there are only additional data for
    #' level 1 regions.
    level_data_urls = list(
      "1" = list(
        "deaths" = "https://epistat.sciensano.be/Data/COVID19BE_MORT.csv"
      )
    ),

For more complex examples, have a look at the code for Mexico and the UK.

How to clean data

Clean your data. You should probably use lubridate::as_date (or another function) to generate the date field.

You may need to convert local region names, or adjust them.

You may want to remove fields which have no use to the end user (e.g., codes used for regions which do not correspond to ISO:3166 standards or the codes you will return).

General practice is to only return the data fields which covidregionaldata processes and provides, as listed above. If you don't have all these fields, they will be calculated where possible or replaced with NA where appropriate.

How to clean data when there is only one level

How to clean data with multiple levels

Finding ISO 3166-2 codes

Ideally we use ISO 3166-2 codes for sub-national regions at levels 1 and 2. Wikipedia provides a list of countries with the ISO 3166-2 codes available. The ISO provides an online browsing platform of country codes giving authoritative data, but the results mix levels 1 and 2 making it less straightforward for conversion and use.

The following code, adjusted from a version for France, was initially used to create lookup tables of Lithuanian municipality and county codes.

iso_3166_2_url <- "https://en.wikipedia.org/wiki/ISO_3166-2:LT"
iso_3166_2_table <- iso_3166_2_url %>%
  xml2::read_html() %>%
  rvest::html_nodes(xpath = '//*[@id=\"mw-content-text\"]/div/table') %>%
  rvest::html_table(fill = TRUE)

iso_3166_2_table
#> [[1]]
#> # A tibble: 10 x 3
#>    Code  `Subdivision Name (lt)` `Subdivision Name (en)[note 1]`
#>    <chr> <chr>                   <chr>                          
#>  1 LT-AL Alytaus apskritis       Alytus County                  
#>  2 LT-KU Kauno apskritis         Kaunas County                  
#>  3 LT-KL Klaipėdos apskritis     Klaipėda County                
#>  4 LT-MR Marijampolės apskritis  Marijampolė County             
#>  5 LT-PN Panevėžio apskritis     Panevėžys County               
#>  6 LT-SA Šiaulių apskritis       Šiauliai County                
#>  7 LT-TA Tauragės apskritis      Tauragė County                 
#>  8 LT-TE Telšių apskritis        Telšiai County                 
#>  9 LT-UT Utenos apskritis        Utena County                   
#> 10 LT-VL Vilniaus apskritis      Vilnius County                 
#> 
#> [[2]]
#> # A tibble: 60 x 3
#>    Code  `Subdivision name` `Subdivision category`
#>    <chr> <chr>              <chr>                 
#>  1 LT-01 Akmenė             district municipality 
#>  2 LT-02 Alytaus miestas    city municipality     
#>  3 LT-03 Alytus             district municipality 
#>  4 LT-04 Anykščiai          district municipality 
#>  5 LT-05 Birštono           municipality          
#>  6 LT-06 Biržai             district municipality 
#>  7 LT-07 Druskininkai       municipality          
#>  8 LT-08 Elektrėnai         municipality          
#>  9 LT-09 Ignalina           district municipality 
#> 10 LT-10 Jonava             district municipality 
#> # … with 50 more rows

How to incrementally implement the prototype class

How to test the prototype class

Tips and general guidance

Run lintr

lintr will pedantically check your code for style. Note that the default line length for covidregionaldata is now set to 120, so you can ignore some of the warnings about line lengths.

lintr("R/CountryName.R")

Run styler

styler will apply most of the fixes which you would have to do by hand to make lintr happier with your code.

styler::style_file("R/CountryName.R")

Use prefixer to clean non-ASCII source

Your data source or your region names may use non-ASCII characters. The prefixer:: add-in has a handy tool for converting non-ASCII characters to escaped versions.

This is a work in progress. Please comment in this issue if interested in expanding this guide.