GitHub - ammar257ammar/biohackathon2022-bioschemas-consumer: An ETL pipeline to construct IDPcentral Knowledge Graph using Bioschemas JSON-LD data feeds

An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps

As part of the one-week Biohackathion Europe 2022 in Paris France, a group was formed to work on Project 23 titled, “Publishing and Consuming Schema.org DataFeeds.” Schema.org and Bioschemas, which is built on top of it (Gray et al., 2017), are lightweight vocabularies that aim at making the contents of web pages machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer. The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. The feed could consist of a single file containing the entire information or it could be divided into different files based on different aspects of the dataset, such as proteins and molecular entities, as in the case of ChEMBL. This would ease publisher and customer processing requirements and accelerate data collection. The aim of the project is to explore the implementation of the Schema.org proposal from both a producer and consumer perspective, for a variety of resources implementing different Bioschemas profiles. This report focuses on the consumer part of the project proposal where we explored an ETL pipeline (Extract-Transform-Load) approach and implemented a consumption pipeline that enables data dumps to be ingested into knowledge graphs (KG).

Preprint: https://biohackrxiv.org/7f95d/

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
figures		figures
input		input
output		output
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md
paper.bib		paper.bib
paper.md		paper.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps

About

Releases

Packages

Contributors 2

Languages

License

ammar257ammar/biohackathon2022-bioschemas-consumer

Folders and files

Latest commit

History

Repository files navigation

An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages