Skip to content

πŸ–ΌοΈ An implementation of Apache Hadoop to count the unique objects in every curatorial department of The Met Collection

License

Notifications You must be signed in to change notification settings

MichailParaskevopoulos/the-met-collection-hadoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ–ΌοΈ The MET Collection

An implementation of Apache Hadoop to count the unique objects in every curatorial department of The Met Collection using tools available on the Google Cloud Platform.

Overview

This project executes a MapReduce job with the Hadoop BigQuery Connector to count the number of unique exhibit types in every deparment of The MET Collection. The data are sourced from the Objects table in The Met Public Domain Art Works dataset that is hosted on Google BigQuery.

Maven Project Directory

This project was built around Apache Maven to manage the Java app dependencies. The pom.xml file is a configuration file that contains the information required to build the package dependencies (in this case the Hadoop and BigQuery clients), as well as relocate some of the packages. Keep in mind, the version of the Hadoop client declared in the pom.xml file should match the one run by your cluster. If in doubt, run $ Hadoop version from the command line of your instance (or Dataproc Cluster if you are using GCP) to identify the Hadoop version. Finally, you can check the latest version of your Java dependencies at the Maven Central Repository.

.
β”œβ”€β”€ pom.xml                                    # Configuration file for Apache Maven
β”œβ”€β”€ src                   
β”œ    β”œβ”€β”€ main
β”œ        β”œβ”€β”€ java
β”œ            β”œβ”€β”€ met_objects                   # Main package name
β”œ                β”œβ”€β”€ CountArtObjects.java      # Java project source code 
β”œ                β”œβ”€β”€ TextArrayWritable.java    # Subclass extending ArrayWritable into a Text-type class
β”œβ”€β”€ target
     β”œβ”€β”€ met-object-count-0.0.1.jar            # JAR file 

Running Hadoop Jobs on Google Dataproc using the Cloud SDK

  1. Compile the Java class files in your local machine using Maven (or another Java project management tool)

  2. Copy the JAR file to the Cloud Storage bucket of your project:

     $ gsutil cp /home/usr/my_Maven_project/target/met-object-count-0.0.1.jar gs://${PROJECT}/hadoop_job_files
    
  3. Create a Dataproc cluster:

     $ gcloud dataproc clusters create ${CLUSTER_NAME} \
         --worker-machine-type n1-standard-4 \
         --num-workers 0 \
         --image-version 2.0.5-debian10 \
         --region ${REGION} \
         --max-idle=30m 
    
  4. Submit a Hadoop job to the Dataproc cluster:

     $ gcloud dataproc jobs submit hadoop \
         --cluster ${CLUSTER_NAME} \
         --jar gs://${PROJECT}/hadoop_job_files/met-object-count-0.0.1.jar \
         --region ${REGION} \
         -- ${PROJECT} bigquery-public-data:the_met.objects ${OUTPUT_TABLE} gs://${PROJECT}/hadoop_job_files/output   # Hadoop job arguments
    
  5. Explore the results in the specified BigQuery output table

About

πŸ–ΌοΈ An implementation of Apache Hadoop to count the unique objects in every curatorial department of The Met Collection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages