Skip to content

My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).

Notifications You must be signed in to change notification settings

aabouzaid/modern-data-platform-poc

Repository files navigation

Modern Data Platform PoC

A proof of concept for the core of Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).

Note

This project is part of my Master of Science in Data Engineering at Edinburgh Napier University (April 2023).

Contents

Architecture

Core Components

The core components of the platform are:

  • Infrastructure (Kubernetes)
  • Data Ingestion (Argo Workflows + Python)
  • Data Storage (MinIO)
  • Data Processing (Dremio)

Initial Model

To visualise the interactions of the current implementation, the C4 software architecture model (Context, Containers, Components, and Code) is used.

The following is a simplified view of the initial architecture model (all the abstractions are combined together).

Modern Data Platform Initial Architecture Model

Deployment

Prerequisites: asdf, Linux operating system, and Docker Engine (tested with asdf 0.11.1, Ubuntu 20.04.5 LTS, and Docker Engine Community 23.0.1).

The following tools are used in the development:

  • Helm
  • KinD
  • Kubectl
  • Kustomize

They could be installed with corresponding versions via asdf:

asdf install

Create the local Kubernetes cluster:

kind create cluster \
  --config clusters/local/kind-cluster-config.yaml

Deploy the applications to the Kubernetes cluster:

kustomize build --enable-helm clusters/local | kubectl apply -f -

Wait for deployments to be ready:

# Ingress-Nginx.
kubectl rollout status deployment \
  --watch --namespace ingress-nginx ingress-nginx-controller

# MinIO.
kubectl rollout status deployment \
  --watch --namespace minio minio

# Argo Workflows.
kubectl rollout status deployment \
  --watch --namespace argo-workflows argo-workflows-server

# Dremio.
kubectl rollout status statefulset \
  --watch --namespace dremio dremio-master

Apply the data pipeline:

kubectl apply --namespace argo-workflows --filename \
  pipelines/ingestion/argo-workflow-covid19-subnational-data.yaml

Benchmarking

TPC-DS test suite has been used to assess the performance of the platform.

For complete results, please check the project Jupyter Notebook in the benchmarking section.

About

My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).

Topics

Resources

Stars

Watchers

Forks