🖼️ The MET Collection

An implementation of Apache Hadoop to count the unique objects in every curatorial department of The Met Collection using tools available on the Google Cloud Platform.

Overview

This project executes a MapReduce job with the Hadoop BigQuery Connector to count the number of unique exhibit types in every deparment of The MET Collection. The data are sourced from the Objects table in The Met Public Domain Art Works dataset that is hosted on Google BigQuery.

Maven Project Directory

This project was built around Apache Maven to manage the Java app dependencies. The pom.xml file is a configuration file that contains the information required to build the package dependencies (in this case the Hadoop and BigQuery clients), as well as relocate some of the packages. Keep in mind, the version of the Hadoop client declared in the pom.xml file should match the one run by your cluster. If in doubt, run $ Hadoop version from the command line of your instance (or Dataproc Cluster if you are using GCP) to identify the Hadoop version. Finally, you can check the latest version of your Java dependencies at the Maven Central Repository.

.
├── pom.xml                                    # Configuration file for Apache Maven
├── src                   
├    ├── main
├        ├── java
├            ├── met_objects                   # Main package name
├                ├── CountArtObjects.java      # Java project source code 
├                ├── TextArrayWritable.java    # Subclass extending ArrayWritable into a Text-type class
├── target
     ├── met-object-count-0.0.1.jar            # JAR file

Running Hadoop Jobs on Google Dataproc using the Cloud SDK

Compile the Java class files in your local machine using Maven (or another Java project management tool)

Copy the JAR file to the Cloud Storage bucket of your project:

 $ gsutil cp /home/usr/my_Maven_project/target/met-object-count-0.0.1.jar gs://${PROJECT}/hadoop_job_files

Create a Dataproc cluster:

 $ gcloud dataproc clusters create ${CLUSTER_NAME} \
     --worker-machine-type n1-standard-4 \
     --num-workers 0 \
     --image-version 2.0.5-debian10 \
     --region ${REGION} \
     --max-idle=30m

Submit a Hadoop job to the Dataproc cluster:

 $ gcloud dataproc jobs submit hadoop \
     --cluster ${CLUSTER_NAME} \
     --jar gs://${PROJECT}/hadoop_job_files/met-object-count-0.0.1.jar \
     --region ${REGION} \
     -- ${PROJECT} bigquery-public-data:the_met.objects ${OUTPUT_TABLE} gs://${PROJECT}/hadoop_job_files/output   # Hadoop job arguments

Explore the results in the specified BigQuery output table

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src/main/java/met_objects		src/main/java/met_objects
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🖼️ The MET Collection

Overview

Maven Project Directory

Running Hadoop Jobs on Google Dataproc using the Cloud SDK

About

Releases

Packages

Languages

License

MichailParaskevopoulos/the-met-collection-hadoop

Folders and files

Latest commit

History

Repository files navigation

🖼️ The MET Collection

Overview

Maven Project Directory

Running Hadoop Jobs on Google Dataproc using the Cloud SDK

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages