Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113

mrceyhun · 2022-09-07T18:51:13Z

This PR includes calculation of last/first access times of datasets accessed by user jobs, using WMArchive data by filtering only CRAB* jobtype.

There are so many lines for extracting DBS additional information but main logic that extract LFN files and their information from WMArchive data can be found in udf_lfn_extract function.

@drkovalskyi WMArchive HDFS data goes back to 18 months, so to 2021-03. It means that if a file was accessed before 2021-03-01, we will not have this information. I'm running the spark job on full data and send it to ES. I'll inform you when it is visible in Kibana.

@vkuznet you're the WMArchive expert and creator of its producer. If you've time, a review would be great.

Bash script is long but it's our general script that we use for other cron jobs, so it's trivial on our side.

mrceyhun · 2022-09-07T19:44:21Z

Changed access time from data.wmats to data.meta_data.ts

brij01

Thanks @mrceyhun.
changes look good to me, description in the script file is very clear as well.
awaiting review by WMArchive expert @vkuznet

vkuznet · 2022-09-09T11:25:59Z

bin/cron4wma_crab_ds_access.sh

+# ------------------------------------------------------------------------------------------------------- RUN SPARK JOB
+# Required for Spark job in K8s
+util4logi "spark job starts"
+export PYTHONPATH=$script_dir/../src/python:$PYTHONPATH


this is very dangerous construct which hides the dependency that this script requires specific data structure with python source. I rather prefer to put it on top of the script and perform check of loading some python module which this script requires. If module import fail you can through an error asking user to setup proper PYTHONPATH environment.

vkuznet · 2022-09-09T11:27:56Z

bin/cron4wma_crab_ds_access.sh

+# Define logs path for Spark imports which produce lots of info logs
+LOG_DIR="$WDIR"/logs/$(date +%Y%m%d)
+mkdir -p "$LOG_DIR"
+


The script should print all environment variable it uses, e.g.

echo "LOG_DIR=$LOGDIR" ...

This will help you later in debug process.

vkuznet · 2022-09-09T11:28:27Z

bin/cron4wma_crab_ds_access.sh

@@ -0,0 +1,122 @@
+#!/bin/bash
+set -e


please add author part into the script.

vkuznet · 2022-09-09T11:29:13Z

src/python/CMSSpark/wmarchive_crab_file_access.py

+HDFS_DBS_PHYSICS_GROUPS = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/PHYSICS_GROUPS/part*.avro'
+HDFS_DBS_ACQUISITION_ERAS = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/ACQUISITION_ERAS/part*.avro'
+HDFS_DBS_DATASET_ACCESS_TYPES = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/DATASET_ACCESS_TYPES/part*.avro'
+


I suggest to add dump of all global variable to the stdout to help debugging process

vkuznet · 2022-09-09T11:30:14Z

src/python/CMSSpark/wmarchive_crab_file_access.py

+# Send data with STOMP AMQ
+# =====================================================================================================================
+def credentials(f_name):
+ if os.path.exists(f_name):


please add docstring to describe format of input file.

mrceyhun · 2022-09-11T21:19:14Z

Thanks @vkuznet , I applied all changes.

@drkovalskyi Spark job is ready to test. Since it takes so much time, I could not fully test it, I'll do tomorrow. And also there were some problems in swan which slowed me a bit.

In general:

dataset creation time will come from max(CREATION_DATE) of files.
Instead of filtering only CRAB3 wma jobs, I included all of them in case we need.
Joined them with previous access times that come from Rucio Replicas. There will be separate columns for dataset last access times, such as RucioLastAccess and WmaLastAccess.
Only datasets on DISK and prod RSEs(not test/temp) are used.
There are so many logical operations and joins of 10 tables, hope I'm not missing anything.

…of crab jobs

…sets

mrceyhun · 2022-10-04T21:42:36Z

This spark job requires proper documentation when it is completely done. Let me explain latest changes here:

As requested by Dima, we need to get last access times and access counts of user jobs to datasets. WMArchive provides user job information in jobtype:CRAB3. Without using any jobstatus filter, WmaAccessCnt is calculated for each dataset. Without giving into more detail, this information is extracted from files the lots of joins and aggregations with files of LFNArray in WMArchive, Rucio tables and DBS tables.
To be able to calculate access count properly, we need to calculate accesses to child datasets too. DBS dataset_parents table provide the parent/child relationship between datasets. Because a dataset can have multiple parents, I used a custom logic to get parent datasets count including their child datasets counts. This gist link gives a hint how it is calculated: https://gist.github.com/mrceyhun/b2b13dab8bc401d7f6b0e6035866d154 and you can check related functions. Said that WmaTotalAccessCnt provides total USER accesses to dataset and WmaAccessCnt provides USER access to dataset itself only. If we can say how deep parent:child dataset relationship is AS hierarchy level, to make things efficient, I did not include calculations after 5 level, though we can include. For example, NANOAOD->MINIAOD->AOD->RAW is 3 level: child->parent->grandparent->grandgrandparent .

When we come to access times, LastAccess provides latest access to dataset which include Rucio REPLICAS values. However, WmaLastAccess and WmaFirstAccess come from WMArchive USER jobs. As it can be understood by the naming, any field starts with Wma means only CRAB3 job results of WMArchive.

Last but not least, there 2 critical filters: RseType: DISK and RseKind:prod which means no test/temp RSE files(replicas) are included in the calculations that requires Rucio REPLICAS table, like LastAccess.

fyi @drkovalskyi

… and dbs tables

mrceyhun requested review from brij01 and leggerf September 7, 2022 18:51

mrceyhun self-assigned this Sep 7, 2022

mrceyhun force-pushed the f-wma-file-access branch 3 times, most recently from a6de041 to e684264 Compare September 7, 2022 19:43

mrceyhun force-pushed the f-wma-file-access branch 4 times, most recently from a8e9b52 to fb01a2e Compare September 7, 2022 22:15

brij01 reviewed Sep 8, 2022

View reviewed changes

vkuznet requested changes Sep 9, 2022

View reviewed changes

mrceyhun force-pushed the f-wma-file-access branch 3 times, most recently from d834c44 to 3324217 Compare September 11, 2022 21:15

mrceyhun force-pushed the f-wma-file-access branch from 3324217 to 79997a4 Compare September 12, 2022 14:03

vkuznet approved these changes Sep 12, 2022

View reviewed changes

mrceyhun added 2 commits October 3, 2022 11:48

Create spark job to get user file acc times using wmarchive lfnarray …

451fd5c

…of crab jobs

Implement joint rucio and wmarchive data for user access time of data…

d6d2c9c

…sets

mrceyhun force-pushed the f-wma-file-access branch from 79997a4 to fc648ae Compare October 4, 2022 21:16

Implement user accessess monitoring to datasets using wmarchive rucio…

4544fee

… and dbs tables

mrceyhun force-pushed the f-wma-file-access branch from fc648ae to 4544fee Compare October 4, 2022 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113

Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113

mrceyhun commented Sep 7, 2022 •

edited

Loading

mrceyhun commented Sep 7, 2022

brij01 left a comment

vkuznet Sep 9, 2022

vkuznet Sep 9, 2022

vkuznet Sep 9, 2022

vkuznet Sep 9, 2022

vkuznet Sep 9, 2022

mrceyhun commented Sep 11, 2022

mrceyhun commented Oct 4, 2022

		@@ -0,0 +1,122 @@
		#!/bin/bash
		set -e

Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113

Are you sure you want to change the base?

Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113

Conversation

mrceyhun commented Sep 7, 2022 • edited Loading

mrceyhun commented Sep 7, 2022

brij01 left a comment

Choose a reason for hiding this comment

vkuznet Sep 9, 2022

Choose a reason for hiding this comment

vkuznet Sep 9, 2022

Choose a reason for hiding this comment

vkuznet Sep 9, 2022

Choose a reason for hiding this comment

vkuznet Sep 9, 2022

Choose a reason for hiding this comment

vkuznet Sep 9, 2022

Choose a reason for hiding this comment

mrceyhun commented Sep 11, 2022

mrceyhun commented Oct 4, 2022

mrceyhun commented Sep 7, 2022 •

edited

Loading