Data Warehouse on Cloud

Summary of project

This project is to build a data warehouse on Redshift for a music application. The logs from application will contains the event logs (historical music listening from users) and song library data. Data are stored as json format in AWS S3. There are three main tasks involved to accomplete the project:

Extract data from S3 to staging tables designed on Redshift

Transform data from staging table and Load into dimensional and fact table on Redshift

Verify and review fact data with analysis queries

How to run the python scripts

Run python create_tables.py to create the schema and newly tables including fact table (songplays) and dimensional tables (users, songs, artists and time).
Then run python etl.py to start the job of trasferring data from S3 to staging tables (staging_events & staging_songs), do transforming and loading data to targeted table designed on Redshift DW.
Run python test.py to examine the results. For example, below query to anwer the question of what is the top 10 songs listened in 2018.

SELECT s.title AS song_title,
       COUNT(songplay_id) AS listened_times,
       a.name AS artist_name
FROM music.songplays sp
INNER JOIN music.songs s ON s.song_id = sp.song_id
INNER JOIN music.artists a ON a.artist_id = sp.artist_id
WHERE EXTRACT(YEAR FROM start_time) = 2018
GROUP BY s.title, a.name
ORDER BY listened_times DESC
LIMIT 10

Result:

Files in the repository

dwh.cfg contains configuration parameters to access AWS Redshift cluster, S3 bucket and IAM role.
create_tables.py drops and creates tables. Whenever run this file, all tables are reset with blank data.
etl.ipynb extract data from S3 to staging tables on Redshift. Then it do transform and load data into relational tables which are designed in advanced from create_tables.py script.
sql_queries.py contains all sql queries for drop, create, insert, select information from/to Redshift database.

Dataset used in S3

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

s3.ObjectSummary(bucket_name='udacity-dend', key='song-data/A/A/A/TRAAAAK128F9318786.json')

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

s3.ObjectSummary(bucket_name='udacity-dend', key='log-data/2018/11/2018-11-01-events.json')

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
images		images
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouse on Cloud

Summary of project

How to run the python scripts

Files in the repository

Dataset used in S3

About

Releases

Packages

Languages

vdn-projects/data-warehouse-redshift

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse on Cloud

Summary of project

How to run the python scripts

Files in the repository

Dataset used in S3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages