Skip to content

Scrutiny plot

Valentin Kuznetsov edited this page Dec 20, 2017 · 4 revisions

Strategy

In order to produce scrutiny plot we perform several steps to aggregate the data:

  • collect phedex snapshot for desired period of time, e.g. 1 year
    • aggregate all phedex dataframes into single one which provides number of days each dataset present on a specific site
  • collect dataset, campaign, release, era aggregated dataframes for desired period of time

collection of phedex DataFrames

This step is done by running run_spark phedex.py script from CMSSpark, e.g.

hdir=hdfs:///cms
dates=`python src/python/CMSSpark/dates.py --range --format="%Y%m%d" --ndays=346`
for d in $dates; do
    cmd="PYTHONPATH=$PWD/src/python bin/run_spark phedex.py --yarn --fout=$hdir --date=$d"
    echo $cmd
    PYTHONPATH=$PWD/src/python bin/run_spark phedex.py --yarn --fout=$hdir --date=$d
done

It runs over PhEDEx snapshots on HDFS and collects the following DataFrames:

date, site, dataset, size, replica_date, groupid

The, we stage data back from HDFS to local disk:

hadoop fs -get $hdir/phedex .

Finally, we use either python or Go code to merge all PhEDEx DataFrames into single DataFrame:

# python script
python src/python/CMSSpark/mergePhedex.py --idir=$PWD/phedex --fout=phedex.cs --dates 20170101-20171212
# Go-based script
go run src/Go/mergePhedex.go -idir=$PWD/phedex -fout phedex.csv -dates 20170101-20171212

It produces the following DataFrame:

site,dataset,min_date,max_date,min_rdate,max_rdate,min_size,max_size,days,gid

where min/max-date/rdate/size are minimum and maximum date, replica create date and size, respectively. The days attribute is calculated from min/max dates, and gid is PhEDEx group id number.

collection of dataset, era, campaign, release DataFrames

The next step is involved production of dataset, era, campaign and release DataFrames from HTCondor classAds logs merged (if necessary) with DBS database snapshot on HDFS. To produce these DataFrames we use run_spark dbs_condor.py combination of scripts in the following way:

PYTHONPATH=$PWD/src/python bin/run_spark dbs_condor.py --yarn --fout=$hdir --date=$d

It yields results into $hdir/dbs_condor area on HDFS which we can stage back to local disk.

automation

The described above procedure for collecting PhEDEx and DBS+HTCondor information is automated via set of crontab jobs. In particular, we submit two scripts

  • cron4phedex
  • cron4dbs_condor

They identify last date of data available on HDFS and submit appropriate jobs to collect aggregated DataFrames.