-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate multiprocessing option #971
Automate multiprocessing option #971
Conversation
…t good code practice lol)
…t good code practice lol)
…ee1152/DataProfiler into feature/multiprocessor-option
# If options.multiprocess is enabled, auto-toggle multiprocessing | ||
auto_multiprocess_toggle = None | ||
if self.options.multiprocess.is_enabled: | ||
auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the default is set, no need to pass in as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -2779,6 +2779,34 @@ def _merge_null_replication_metrics(self, other: StructuredProfiler) -> dict: | |||
|
|||
return merged_properties | |||
|
|||
def _auto_multiprocess_toggle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a new function or instead do it in: suggest_pool_size
which is similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also notating this is a staticmethod wrt the class. If we kept the func, maybe should be in profiler_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue I see with doing in suggest_pool_size
is that it uses a different metric to measure dataset sizes for thresholding (number of rows vs. estimated data size in memory). If we were to do it in there, we would have to change the logic for this (example below).
if self.options.multiprocess.is_enabled and auto_multiprocess_toggle:
est_data_size = data[:50000].memory_usage(index=False, deep=True).sum()
est_data_size = (est_data_size / min(50000, len(data))) * len(data)
pool, pool_size = profiler_utils.generate_pool(
max_pool_size=None, data_size=est_data_size, cols=len(data.columns)
)
@@ -2869,7 +2874,7 @@ def tqdm(level: set[int]) -> Generator[int, None, None]: | |||
|
|||
# Generate pool and estimate datasize | |||
pool = None | |||
if self.options.multiprocess.is_enabled: | |||
if self.options.multiprocess.is_enabled and auto_multiprocess_toggle: | |||
est_data_size = data[:50000].memory_usage(index=False, deep=True).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outside scope of this PR, but just noticed and wondered if this 50000 should hardcoded?
Head branch was pushed to by a user without write access
# If options.multiprocess is enabled, auto-toggle multiprocessing | ||
auto_multiprocess_toggle = None | ||
if self.options.multiprocess.is_enabled: | ||
auto_multiprocess_toggle = profiler_utils.auto_multiprocess_toggle(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- keying on defaults
# If options.multiprocess is enabled, auto-toggle multiprocessing | ||
auto_multiprocess_toggle = None | ||
if self.options.multiprocess.is_enabled: | ||
auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dataprofiler/version.py
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
MAJOR = 0 | |||
MINOR = 10 | |||
MICRO = 0 | |||
MICRO = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why is this changing? makes me think you need a rebase 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend just reverting this
Head branch was pushed to by a user without write access
Based on analysis with @ksneab7 and @taylorfturner, the goal is to automate turning on
multiprocess
option inStructuredProfiler
if the number of rows exceeds 750,000, or if the number of columns exceeds 20.