Architecture

CMSSpark architecture

The CMSSpark architecture is shown in figure below:

CMSSpark architecture

It consists of several components:

a wrapper script run_spark
a user based python template code (we call it workflow) which should implement initialization of Spark context and data processing pipeline
- dbs_aaa.py represents an example of python template to aggregate data between CMS DBS and AAA records on HDFS. More examples can be found in the same location
- cern_monit.py represents an example of python template to send data to CERN MONIT system

run_spark loads provided Python template code and perform data processing pipeline. It stores data back to HDFS where they can be inspected. Optionally, end-user can call run_spark cern_monit.py bundle to put data into CERN MONIT system (via Stomp AMQ call to specified end-point).

All python templated code are based on PySpark architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

CMSSpark architecture

Clone this wiki locally