This repo contains assignments for UC Berkeley's Machine Learning at Scale course. Content is organized by assignment and contain the following concepts:
Class Assignments
- Week 1: Mapreduce wordcount in command line
- Week 2: Wordcount in Python/Hadoop Streaming
- Week 4: Basics in MrJob
- Week 5: Reducer-side Inner Join
- Week 10: Basics in Spark
Homework Assignments
- HW 1: Naive Bayes spam filter in command line using Enron emails
- HW 2: Naive Bayes spam filter in Hadoop Streaming using Enron emails
- HW 3: Shopping Cart Analysis, Pairs vs Stripes, Secondary Sort, Custom Partioning
- HW 4: Tweet clustering via KMeans in MrJob
- HW 5: Large-scale joins in MrJob, EDA and synonym detection in Google n-gram corpus (on AWS)
- HW 6: Weighted OLS using Gradient Descent, Gaussian Mixture Models
- HW 7: Distributed Shortest-Path in MrJob (on AWS) using English Wikipedia
- HW 9: Distributed Pagerank in MrJob (on AWS) using English Wikipedia
- HW 10: KMeans,Ridge/Lasso Regression in MlLib
- HW 11: Logisitic Regression,SVM in both base Spark and MlLib (in Zeppelin Notebook)
- HW 12: Structured walkthrough of feature hashing, one-hot encoding, and click-through prediction using Criteo dataset
- HW 13: Pagerank in Spark (using Wikipedia) and Click-through prediction at Scale (using Criteo data)