Skip to content

Historical disease database (19th-20th century) for municipalities in the Netherlands

License

Notifications You must be signed in to change notification settings

sodascience/disease_database

Repository files navigation

Disease database

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Creating a historical disease database (19th-20th century) for municipalities in the Netherlands.

Cholera in the Netherlands

Preparation

This project uses pyproject.toml to handle its dependencies. You can install them using pip like so:

pip install .

We recommend using uv to manage the environment. First, install uv, then clone / download this repo, then run:

uv sync

this will automatically install the right python version, create a virtual environment, and install the required packages.

Note that if you encountered error: command 'cmake' failed: No such file or directory, you need to install cmake first. On macOS, run brew install cmake. Similarly, you may have to install apache-arrow separately as well (e.g., on macOS brew install apache-arrow).

Once these dependency issues are solved, run uv sync one more time.

Data extraction (1830-1879)

Between 1830 and 1879, Delpher historical news article data can be downloaded manually from here. The downloaded files, which are zip folders, take up lots of disk space because of inefficient data format.

The src/process_open_archive/extract_article_data.py script extracts the titles and texts from the zip folder for each article. Then, it stores all extracted data as a polars dataframe with three columns article_id, article_title and article_text. Finally, it is saved as a parquet file (article_data_{start_year}_{end_year}.parquet), with a much smaller size under processed_data/texts/from_1830_to_1879/.

With the src/process_open_archive/extract_meta_data.py script, we extract meta information about both the newspapers and the individual articles. This results in two kinds of polars dataframes saved in parquet format under processed_data/metadata/newspapers/from_1830_to_1879 and processed_data/metadata/articles/from_1830_to_1879, respectively.

  1. newspaper_meta_data_{start_year}_{end_year}.parquet includes these columns: newspaper_name, newspaper_location, newspaper_date, newspaper_years_digitalised, newspaper_years_issued, newspaper_language, newspaper_temporal, newspaper_publisher and newspaper_spatial.
  2. article_meta_data_{start_year}_{end_year}.parquet includes these columns: newspaper_id, article_id and article_subject.

Before you run the following script, make sure to put all the Delpher zip files under raw_data/open_archive.

python src/process_open_archive/extract_article_data.py
python src/process_open_archive/extract_meta_data.py

Then, run

python src/process_open_archive/combine_and_chunk.py

to join all the available datasets and create a yearly-chunked series of parquet files in the folder processed_data/combined.

Data harvesting (1880-1940)

After 1880, the data is not public and can only be obtained through the Delpher API:

  1. Obtain an api key (which looks like this df2e02aa-8504-4af2-b3d9-64d107f4479a) from Delpher, then put the api key in the file harvest_delpher_api/apikey.txt.
  2. Harvest the data following readme in the delpher api folder: src/harvest_delpher_api/readme.md

Data analysis

The script src/query/faster_query.py uses the prepared combined data to search for mentions of diseases and locations in articles. The file produces the plot shown above. It also produces this plot about Utrecht:

Contact

SoDa logo

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team at odissei-soda.nl

About

Historical disease database (19th-20th century) for municipalities in the Netherlands

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published