Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518

gipert · 2023-10-12T16:40:56Z

Example: is_valid_0vbb is available in tier hit for the geds subsystem only. If loading data from spms and geds at the same time, is_valid_0vbb will randomly provide True or False (I guess because of uninitialized memory).

This is obviously very dangerous and must be fixed ASAP.

We cannot simply fix it by using default values unfortunately (NaN, for example, works only with floats), so we need to return a different data structure.

The text was updated successfully, but these errors were encountered:

jasondet · 2023-10-12T17:47:10Z

are you sure it's not just a problem related to the config files not being generated correctly? we have other parameters in the hit tier that are there for ged and not spm and did not have this problem in testing, I think, right @gracesong312 ?

gipert · 2023-10-12T20:37:04Z

Yes I am sure. It's just because in rectangular data structures you obviously need some placeholder, when the hit corresponding to a certain index does not define a column value:

hit_table name    is_valid_0vbb
1052802   V07302A True
1052802   S060    ?????

In case of floats one could use NaN, but no "missing" placeholder exists for booleans.

gracesong312 · 2023-10-12T20:37:50Z

I think I overlooked this in the original testing, here's a basic test I just ran on the legend-testdata files:

Loading energies and is_valid_hit (sipm only) and trapEmax and is_valid_0vbb (ge only) for both a sipm channel and a ge channel at the same time.
For sipm channel:
- trapEmax is 0 or a very small number (e-44)
- is_valid_0vbb is False
For ge channel:
- energies is a list, which occasionally has large numbers in it
- is_valid_hit is a list, mix of True and False
  So I'm not getting the issue with germanium parameters in sipm hits, but given that it's happening the other way around and the numbers change if I rerun the script, I assume it's just because I'm not running on enough files to see it.

Is it possible to just return None?

gipert · 2023-10-12T20:48:32Z

In my test I randomly get true or false in is_valid_0vbb for SiPM data. I can work on a MWE tomorrow.

@gracesong312 do you pre-allocate empty columns for the output table? This could explain why the values are unpredictable (because no actual value is ever written to the pre-allocated memory).

Pandas uses NumPy internally, so a boolean column cannot contain non-booleans. I would not force that column to be float in order to be able to use NaNs, I think we should use a different data structure. Discussion for next week!

gracesong312 · 2023-10-12T21:32:13Z

Yes, I allocate memory with np.empty which explains the random values.

pygama/src/pygama/flow/utils.py

Lines 137 to 144 in 663d352

 elif isinstance(tier_table[col], Array): 

 # Allocate memory for column for all channels 

 if col not in col_dict.keys(): 

 col_dict[col] = np.empty( 

 table_length, 

 dtype=tier_table[col].dtype, 

 ) 

 col_dict[col][tcm_idx] = tier_table[col].nda

jasondet · 2023-10-13T14:56:02Z

shouldn't the ged hits and spm hits be in two separate lists? otherwise we are going to have nans everywhere, right? hit_table name energy npe 1052802 V07302A 2039 nan 1052802 S060 nan 3

…

------------------------------------- Jason Detwiler Department of Physics, Box 351560 University of Washington Seattle, WA 98195 (206) 543-4054

On Thu, Oct 12, 2023 at 2:32 PM Grace Song ***@***.***> wrote: Yes, I allocate memory with np.empty which explains the random values. https://github.com/legend-exp/pygama/blob/663d352ef9d2d5b73b0c5927dad88692a5f80442/src/pygama/flow/utils.py#L137-L144 <https://urldefense.com/v3/__https://github.com/legend-exp/pygama/blob/663d352ef9d2d5b73b0c5927dad88692a5f80442/src/pygama/flow/utils.py*L137-L144__;Iw!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73chertl9qJYE$> — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/legend-exp/pygama/issues/518*issuecomment-1760392434__;Iw!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73chertEPmOLM$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AARCTXUQNNEWVH7WJLBRB7TX7BOWRANCNFSM6AAAAAA554USNE__;!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73cherfIjYTXM$> . You are receiving this because you commented.Message ID: ***@***.***>

gipert added bug Something isn't working flow High-level data management labels Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518

Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518

gipert commented Oct 12, 2023

jasondet commented Oct 12, 2023

gipert commented Oct 12, 2023

gracesong312 commented Oct 12, 2023

gipert commented Oct 12, 2023

gracesong312 commented Oct 12, 2023

jasondet commented Oct 13, 2023 via email

Random values in DataLoader.load() output if a stream/tier does not provide values for a certain column #518

Random values in DataLoader.load() output if a stream/tier does not provide values for a certain column #518

Comments

gipert commented Oct 12, 2023

jasondet commented Oct 12, 2023

gipert commented Oct 12, 2023

gracesong312 commented Oct 12, 2023

gipert commented Oct 12, 2023

gracesong312 commented Oct 12, 2023

jasondet commented Oct 13, 2023 via email

Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518

Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518