Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate multiprocessing option #971

Closed
wants to merge 13 commits into from
Closed

Automate multiprocessing option #971

wants to merge 13 commits into from

Conversation

clee1152
Copy link
Contributor

Based on analysis with @ksneab7 and @taylorfturner, the goal is to automate turning on multiprocess option in StructuredProfiler if the number of rows exceeds 750,000, or if the number of columns exceeds 20.

@clee1152 clee1152 added the Work In Progress Solution is being developed label Jul 26, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) July 27, 2023 20:12
# If options.multiprocess is enabled, auto-toggle multiprocessing
auto_multiprocess_toggle = None
if self.options.multiprocess.is_enabled:
auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the default is set, no need to pass in as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -2779,6 +2779,34 @@ def _merge_null_replication_metrics(self, other: StructuredProfiler) -> dict:

return merged_properties

def _auto_multiprocess_toggle(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a new function or instead do it in: suggest_pool_size which is similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also notating this is a staticmethod wrt the class. If we kept the func, maybe should be in profiler_utils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue I see with doing in suggest_pool_size is that it uses a different metric to measure dataset sizes for thresholding (number of rows vs. estimated data size in memory). If we were to do it in there, we would have to change the logic for this (example below).

    if self.options.multiprocess.is_enabled and auto_multiprocess_toggle:
            est_data_size = data[:50000].memory_usage(index=False, deep=True).sum()
            est_data_size = (est_data_size / min(50000, len(data))) * len(data)
            pool, pool_size = profiler_utils.generate_pool(
                max_pool_size=None, data_size=est_data_size, cols=len(data.columns)
            )

@@ -2869,7 +2874,7 @@ def tqdm(level: set[int]) -> Generator[int, None, None]:

# Generate pool and estimate datasize
pool = None
if self.options.multiprocess.is_enabled:
if self.options.multiprocess.is_enabled and auto_multiprocess_toggle:
est_data_size = data[:50000].memory_usage(index=False, deep=True).sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside scope of this PR, but just noticed and wondered if this 50000 should hardcoded?

@tyfarnan tyfarnan self-requested a review July 28, 2023 14:05
tyfarnan
tyfarnan previously approved these changes Jul 28, 2023
auto-merge was automatically disabled July 31, 2023 13:33

Head branch was pushed to by a user without write access

@taylorfturner taylorfturner enabled auto-merge (squash) July 31, 2023 13:34
# If options.multiprocess is enabled, auto-toggle multiprocessing
auto_multiprocess_toggle = None
if self.options.multiprocess.is_enabled:
auto_multiprocess_toggle = profiler_utils.auto_multiprocess_toggle(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- keying on defaults

# If options.multiprocess is enabled, auto-toggle multiprocessing
auto_multiprocess_toggle = None
if self.options.multiprocess.is_enabled:
auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -2,7 +2,7 @@

MAJOR = 0
MINOR = 10
MICRO = 0
MICRO = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why is this changing? makes me think you need a rebase 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend just reverting this

auto-merge was automatically disabled July 31, 2023 18:25

Head branch was pushed to by a user without write access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Work In Progress Solution is being developed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants