Healthcare Data Pipeline

The project aims to build a single source of true data storage for large healthcare datasets using Spark and S3. Some dashboards are also made in this project for visualization.

Tech Stack

Data lake: Amazon S3
Data source: PostgreSQL
Data read storage: MySQL on Amazone RDS
Processing layer: Apache Spark on EMR
Visualization: Power BI

Architecture

The architecture of this project is presented as follows:

Data is sourced from PostgreSQL and ingested into raw zone of Data Lake hosted on S3.
Raw data is cleansed and standardized before moving to cleansed zone.
Cleansed data is transformed into reportable form and loaded into curated zone.
Publish data from curated zone to Data read storage for higher performance report when connection from BI Tool.
Reports are created in Power BI from the data in MySQL.

Data Source

Source of raw data is from CMS. Data used is Medicare Part D.
Data source in PostgreSQL has 4 tables, total size around 10 GB:
- Prescriber_drug: ~ 25M rows
- Prescriber: ~ 1.1M rows
- Drug: ~115K rows
- State: ~30K rows

Visualization

Some dashboards create from the data from data read storage

Drug report

Prescriber report

Achievement in learning

Apache Spark

Components of Spark and how Spark works.
How to adjust resource (RAM, CPU, instances,...) for optimizing Spark performance and costs.
Tuning Spark application by using partition
Use Spark to implement a full data pipeline.
Fundamental of how to write Spark correct.
Manage Jar files for JDBC connection

Project set up

Implement logging and log file to track the Spark application
Test project on local mode before run on cluster.

AWS

Set up EMR for Spark
Track the resource utilization in EMR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Healthcare Data Pipeline

Tech Stack

Architecture

Data Source

Visualization

Achievement in learning

Apache Spark

Project set up

AWS

Files

README.md

Latest commit

History

README.md

File metadata and controls

Healthcare Data Pipeline

Tech Stack

Architecture

Data Source

Visualization

Achievement in learning

Apache Spark

Project set up

AWS