Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random values in DataLoader.load() output if a stream/tier does not provide values for a certain column #518

Open
gipert opened this issue Oct 12, 2023 · 6 comments
Labels
bug Something isn't working flow High-level data management

Comments

@gipert
Copy link
Member

gipert commented Oct 12, 2023

Example: is_valid_0vbb is available in tier hit for the geds subsystem only. If loading data from spms and geds at the same time, is_valid_0vbb will randomly provide True or False (I guess because of uninitialized memory).

This is obviously very dangerous and must be fixed ASAP.

We cannot simply fix it by using default values unfortunately (NaN, for example, works only with floats), so we need to return a different data structure.

@gipert gipert added bug Something isn't working flow High-level data management labels Oct 12, 2023
@jasondet
Copy link
Collaborator

are you sure it's not just a problem related to the config files not being generated correctly? we have other parameters in the hit tier that are there for ged and not spm and did not have this problem in testing, I think, right @gracesong312 ?

@gipert
Copy link
Member Author

gipert commented Oct 12, 2023

Yes I am sure. It's just because in rectangular data structures you obviously need some placeholder, when the hit corresponding to a certain index does not define a column value:

hit_table name    is_valid_0vbb
1052802   V07302A True
1052802   S060    ?????

In case of floats one could use NaN, but no "missing" placeholder exists for booleans.

@gracesong312
Copy link
Collaborator

I think I overlooked this in the original testing, here's a basic test I just ran on the legend-testdata files:

  • Loading energies and is_valid_hit (sipm only) and trapEmax and is_valid_0vbb (ge only) for both a sipm channel and a ge channel at the same time.
  • For sipm channel:
    • trapEmax is 0 or a very small number (e-44)
    • is_valid_0vbb is False
  • For ge channel:
    • energies is a list, which occasionally has large numbers in it
    • is_valid_hit is a list, mix of True and False
      So I'm not getting the issue with germanium parameters in sipm hits, but given that it's happening the other way around and the numbers change if I rerun the script, I assume it's just because I'm not running on enough files to see it.

Is it possible to just return None?

@gipert
Copy link
Member Author

gipert commented Oct 12, 2023

In my test I randomly get true or false in is_valid_0vbb for SiPM data. I can work on a MWE tomorrow.

@gracesong312 do you pre-allocate empty columns for the output table? This could explain why the values are unpredictable (because no actual value is ever written to the pre-allocated memory).

Pandas uses NumPy internally, so a boolean column cannot contain non-booleans. I would not force that column to be float in order to be able to use NaNs, I think we should use a different data structure. Discussion for next week!

@gracesong312
Copy link
Collaborator

Yes, I allocate memory with np.empty which explains the random values.

elif isinstance(tier_table[col], Array):
# Allocate memory for column for all channels
if col not in col_dict.keys():
col_dict[col] = np.empty(
table_length,
dtype=tier_table[col].dtype,
)
col_dict[col][tcm_idx] = tier_table[col].nda

@jasondet
Copy link
Collaborator

jasondet commented Oct 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flow High-level data management
Projects
None yet
Development

No branches or pull requests

3 participants