-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats calc tool #2628
base: main
Are you sure you want to change the base?
Stats calc tool #2628
Conversation
Since we should also support the processing of MCs, we might want to run the stats calc tool over multiple tels.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
pyproject.toml
Outdated
@@ -99,6 +99,7 @@ ctapipe-process = "ctapipe.tools.process:main" | |||
ctapipe-merge = "ctapipe.tools.merge:main" | |||
ctapipe-fileinfo = "ctapipe.tools.fileinfo:main" | |||
ctapipe-quickstart = "ctapipe.tools.quickstart:main" | |||
ctapipe-stats-calculation = "ctapipe.tools.stats_calculation:main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a verb here, like the other tools. E.g. ctapipe-calculate-pixel-statistics
@@ -0,0 +1,37 @@ | |||
StatisticsCalculatorTool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't use yaml
for anything apart from the configurations, I suggest to rename the configuration file to just tool_name.yaml
, i.e. stripping _config
.
), | ||
).tag(config=True) | ||
|
||
dl1a_column_name = CaselessStrEnum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is DL1a/b is "official"? Also, I'd perhaps use generic input_column_name
similar to the output one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, in ctapipe we use DL1_IMAGES
and DL1_PARAMETERS
to distinguish between things that are per-pixel vs. single quantities per event.
https://ctapipe.readthedocs.io/en/latest/api/ctapipe.io.DataLevel.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also not make this an enum. In the generic tool, users should be able to chose any column that has compatible shape. Just provide a clear error when the column is not found in the input file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be a list of columns, to compute on multiple at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed the column name and also polish the references to DL1a data by using pixel-wise image data which is more descriptive. ToolConfigurationError
is raised once the column is not found. Having list of columns seems a little bit of an overkill here, which would just make the code more complex. Maybe the aggregation config could be shared between the columns, but especially the outlier detection will be different between the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToolConfigurationError is raised once the column is not found. Having list of columns seems a little bit of an overkill here, which would just make the code more complex. Maybe the aggregation config could be shared between the columns, but especially the outlier detection will be different between the columns.
I think the case where you only want to know about a single column is quite rare, you are usually interested in multiple. So having to read all data again to compute metrics on a new column seems very limiting and a loop over columns shouldn't make the code much more complex.
dl1a_column_name: "image" | ||
output_column_name: "statistics" | ||
|
||
PixelStatisticsCalculator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to make this example a bit more complex and apply to the telescopes of a different kind (i.e. with different parameters of the PixelStatisticsCalculator)
"No faulty chunks found for telescope 'tel_id=%d'. Skipping second pass.", | ||
tel_id, | ||
) | ||
# Write the aggregated statistics and their outlier mask to the output file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think logically output writing shall be in the finish
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires keeping all intermediate results for each telescope in memory until finish
.
I'd write here, so only the data of a single telescope is in RAM at any given time.
stats_aggregator_type: [["type", "*", "SigmaClippingAggregator"]] | ||
chunk_shift: 1000 | ||
faulty_pixels_fraction: 0.1 | ||
outlier_detector_list: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the json syntax here? I think using the yaml syntax is much more readable:
outlier_detector_list:
- name: ...
apply_to: median
config:
...
- name: ...
apply_to: median
config:
...
# Get the telescope ids from the input data or use the allowed_tels configuration | ||
tel_ids = subarray.tel_ids if self.allowed_tels is None else self.allowed_tels | ||
# Read the whole dl1 images | ||
self.dl1_tables = input_data.read_telescope_events_by_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why load all data for all telescopes into memory when you then loop over them one by one?
rename the tool and file name only keep dl1 table of the particular telescope into RAM added tests for tool config errors rename input col name adopt yaml syntax in example config for stats calculation
This comment has been minimized.
This comment has been minimized.
Analysis Details0 IssuesCoverage and DuplicationsProject ID: cta-observatory_ctapipe_AY52EYhuvuGcMFidNyUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently fails to read prod3 files (which have no EFFECTIVE focal length information). The tool current fails with a focal_length_choice
exception, however it seems there is no way to set the focal length choioce since the TableLoader
is not set up to be configrable.
parent=self, subarray=subarray | ||
) | ||
# Read the input data with the 'TableLoader' | ||
self.input_data = TableLoader(input_url=self.input_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be configurable, so you need to pass parent=self
in the constructor. You should not expicitly set input_url
here, but rather just make an alias for TableLoader.input_url
. That will allow other options to be passed, like focal_length_choice
when needed
exists=True, | ||
directory_ok=False, | ||
file_ok=True, | ||
).tag(config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove this option, and instead use the parameter provided by TableLoader.
overwrite = Bool(help="Overwrite output file if it exists").tag(config=True) | ||
|
||
aliases = { | ||
("i", "input_url"): "StatisticsCalculatorTool.input_url", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("i", "input_url"): "StatisticsCalculatorTool.input_url", | |
("i", "input_url"): "TableLoader.input_url", |
), | ||
} | ||
|
||
classes = classes_with_traits(PixelStatisticsCalculator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add TableLoader
to this list, i.e.
classes = [TableLoader,] + classes_with_traits(PixelStatisticsCalculator)
|
||
def setup(self): | ||
# Check that the input and output files are not the same | ||
if self.input_url == self.output_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will also need to change all instances of self.input_url
to be self.input_data.input_url
prod 5 files should have effective focal length |
# Iterate over the telescope ids and calculate the statistics | ||
for tel_id in self.tel_ids: | ||
# Read the whole dl1 images for one particular telescope | ||
dl1_table = self.input_data.read_telescope_events_by_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a crash here:
% ctapipe-calculate-pixel-statistics -i events.dl1.h5 -o stats.h5
dl1_table = self.input_data.read_telescope_events_by_id(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/io/tableloader.py", line 1089, in read_telescope_events_by_id
tel_ids = self.subarray.get_tel_ids(telescopes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/instrument/subarray.py", line 549, in get_tel_ids
for telescope in telescopes:
TypeError: 'numpy.int16' object is not iterable
Seems to be due to passing an integer instead of a list, which is what is required by read_telescope_events_by_id
dl1_table = self.input_data.read_telescope_events_by_id( | |
dl1_table = self.input_data.read_telescope_events_by_id( | |
telescopes = [tel_id,] |
How does this work in the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general comment: with very minor changes, this could be turned into ctapipe-calculate-stats
, i.e. the ability to compute stats for any column, not just pixel-wise ones.
- Expose TableLoader as a configurable component (needed anyhow, see above)
- minor modifications to drop assumption on data shape in
calculator.py
.
I would expect e.g. to be able to do:
ctapipe-calculate-pixel-statistics -i events-prod5.DL1.h5
--StatisticsAggregator.chunk_size=100
--StatisticsCalculatorTool.input_column_name hillas_length
-o length.h5
and get the stats on the length parameter. This is perhaps outside the scope of this PR, but should be kept in mind. It also relates to @maxnoe's comment that we could change the API to accept a mapping of columns to Aggragators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common error is to have too small a chunk size, but this now results in a very ugly error and a full trace-back and exception, along with an UnclosedFileWarning
(bug?)
- The former (Unexpected exception) should e caught and raises as a
ToolConfigurationError
, so the user gets a nice message. And please explain in the message what parameters controls this, i.e. sayChange --StatisticsAggregator.chunk_size to decrease this
. - The latter (unclosed file) seems to be a bug to fix.
2024-11-06 15:14:43,361 ERROR [ctapipe.StatisticsCalculatorTool] (tool.run): Caught unexpected exception: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
2024-11-06 15:14:43,361 ERROR [ctapipe.StatisticsCalculatorTool] (tool.run): Caught unexpected exception: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
Traceback (most recent call last):
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/core/tool.py", line 431, in run
self.start()
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/tools/calculate_pixel_stats.py", line 134, in start
aggregated_stats = self.stats_calculator.first_pass(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/monitoring/calculator.py", line 169, in first_pass
aggregated_stats = aggregator(
^^^^^^^^^^^
File "/Users/kkosack/Projects/CTA/Working/ctapipe/src/ctapipe/monitoring/aggregator.py", line 86, in __call__
raise ValueError(
ValueError: The length of the provided table (853) is insufficient to meet the required statistics for a single chunk of size (2500).
2024-11-06 15:14:43,377 INFO [ctapipe.StatisticsCalculatorTool] (tool.write_provenance): Output:
/Users/kkosack/miniconda3/envs/ctapipe-0.21/lib/python3.12/site-packages/tables/file.py:113: UnclosedFileWarning: Closing remaining open file: /Users/kkosack/Projects/CTA/PipeWork/v0.21.3/events-prod5.DL1.h5
warnings.warn(UnclosedFileWarning(msg))
This PR adds a generic stats-calculation tool utilizing the
PixelStatisticsCalculator
.Related #2542