Skip to content

Architecture

Valentin Kuznetsov edited this page Feb 7, 2018 · 4 revisions

CMSSpark architecture

The CMSSpark architecture is shown in figure below:

CMSSpark architecture

It consists of several components:

  • a wrapper script run_spark
  • a user based python template code (we call it workflow) which should implement initialization of Spark context and data processing pipeline
    • dbs_aaa.py represents an example of python template to aggregate data between CMS DBS and AAA records on HDFS. More examples can be found in the same location
    • cern_monit.py represents an example of python template to send data to CERN MONIT system

run_spark loads provided Python template code and perform data processing pipeline. It stores data back to HDFS where they can be inspected. Optionally, end-user can call run_spark cern_monit.py bundle to put data into CERN MONIT system (via Stomp AMQ call to specified end-point).

All python templated code are based on PySpark architecture.

Clone this wiki locally