Skip to content

Scripts

Valentin Kuznetsov edited this page Dec 20, 2017 · 2 revisions

Scripts

This page provide description of all scripts available in CMSSpark.

submission scripts

  • bin/run_spark is a bash script which is a wrapper around spark-submit. It setups all necessary JAR (java archive libraries) to submit provided python script to HDFS+Spark.
  • run_aggregation is a bash script to aggregate CMS data-streams into records and submit them to CERN MONIT
  • cron4aggregation is a bash script to be used in crontab to submit run_aggregation script
  • cron4dbs_condor is a bash script to be used in crontab to submit run_spark dbs_condor.py script, it identifies last date of DBS+HTCondor data available on HDFS and use it in submission
  • cron4phedex is a bash script to be used in crontab to submit run_spark phedex.py script, it identifies last date of PhEDEx data available on HDFS and use it in submission

processing scripts

  • aso_stats.py provides PySpark pipeline to collect ASO statistics
  • data_aggregation.py provides PySpark CMS popularity pipeline to collect data from DBS, AAA, CMSSW, EOS, JM data streams
  • dbs_aaa.py provides example of PySpark pipeline for DBS+AAA aggregation
  • dbs_adler.py provides example of PySpark pipeline for DBS LFN adler lookup
  • dbs_block_lumis.py provides example of PySpark pipeline for DBS block lumi aggregation
  • dbs_cmssw.py provides example of PySpark pipeline for DBS+CMSSW aggregation
  • dbs_condor.py provides example of PySpark pipeline for DBS+HTCondor aggregation
  • dbs_eos.py provides example of PySpark pipeline for DBS+EOS aggregation
  • dbs_jm.py provides example of PySpark pipeline for DBS+JobMonitoring aggregation
  • dbs_lfn.py provides example of PySpark pipeline for DBS LFN look-up aggregation
  • dbs_phedex.py provides example of PySpark pipeline for DBS+PhEDEx aggregation
  • fts_aso.py provides example of PySpark pipeline for FTS+ASO aggregation
  • jm_stats.py provides example of PySpark pipeline for JobMonitoring stats
  • phedex.py provides example of PySpark pipeline for PhEDEx aggregation
  • phedex_agg.py provides example of PySpark pipeline for post-processing PhEDEx aggregation (replaced with mergePhedex.py or Go versions which are much faster)
  • wmarchive.py provides example of PySpark pipeline for WMArchive aggregation

core/utilities scripts

  • schema.py contains all data-stream scheamas
  • spark_utils.py provides generic utilities to access data-stream on HDFS and define data-streams tables
  • utils.py generic utilities

helper scripts

  • cern_monit.py helper script to submit given records to CERN MONIT via AMQ broker
  • dates.py helper script to produce series of dates
  • getCSV.py helper script to fetch HDFS dataframes and store them in local area (use together with dbs_condor.py)
  • mergePhedex.py helper script to process HDFS phedex dataframe and produce aggregated one
  • mergePhedex.go a Go equivalent of mergePhedex.py which runs about 5x times faster (on multi-core node)