Automate multiprocessing option #971

clee1152 · 2023-07-26T20:35:55Z

Based on analysis with @ksneab7 and @taylorfturner, the goal is to automate turning on multiprocess option in StructuredProfiler if the number of rows exceeds 750,000, or if the number of columns exceeds 20.

* force as boolean * version bump

…t good code practice lol)

…r-option

…ee1152/DataProfiler into feature/multiprocessor-option

JGSweets · 2023-07-28T05:54:07Z

dataprofiler/profilers/profile_builder.py

+        # If options.multiprocess is enabled, auto-toggle multiprocessing
+        auto_multiprocess_toggle = None
+        if self.options.multiprocess.is_enabled:
+            auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20)


since the default is set, no need to pass in as well.

JGSweets · 2023-07-28T06:02:48Z

dataprofiler/profilers/profile_builder.py

@@ -2779,6 +2779,34 @@ def _merge_null_replication_metrics(self, other: StructuredProfiler) -> dict:

        return merged_properties

+    def _auto_multiprocess_toggle(


Should we do a new function or instead do it in: suggest_pool_size which is similar.

Also notating this is a staticmethod wrt the class. If we kept the func, maybe should be in profiler_utils

An issue I see with doing in suggest_pool_size is that it uses a different metric to measure dataset sizes for thresholding (number of rows vs. estimated data size in memory). If we were to do it in there, we would have to change the logic for this (example below).

if self.options.multiprocess.is_enabled and auto_multiprocess_toggle: est_data_size = data[:50000].memory_usage(index=False, deep=True).sum() est_data_size = (est_data_size / min(50000, len(data))) * len(data) pool, pool_size = profiler_utils.generate_pool( max_pool_size=None, data_size=est_data_size, cols=len(data.columns) )

…lt values.

tyfarnan · 2023-07-28T14:04:40Z

dataprofiler/profilers/profile_builder.py

@@ -2869,7 +2874,7 @@ def tqdm(level: set[int]) -> Generator[int, None, None]:

        # Generate pool and estimate datasize
        pool = None
-        if self.options.multiprocess.is_enabled:
+        if self.options.multiprocess.is_enabled and auto_multiprocess_toggle:
            est_data_size = data[:50000].memory_usage(index=False, deep=True).sum()


outside scope of this PR, but just noticed and wondered if this 50000 should hardcoded?

taylorfturner · 2023-07-31T18:13:07Z

dataprofiler/profilers/profile_builder.py

+        # If options.multiprocess is enabled, auto-toggle multiprocessing
+        auto_multiprocess_toggle = None
+        if self.options.multiprocess.is_enabled:
+            auto_multiprocess_toggle = profiler_utils.auto_multiprocess_toggle(data)


nice -- keying on defaults

taylorfturner · 2023-07-31T18:13:13Z

dataprofiler/profilers/profile_builder.py

+        # If options.multiprocess is enabled, auto-toggle multiprocessing
+        auto_multiprocess_toggle = None
+        if self.options.multiprocess.is_enabled:
+            auto_multiprocess_toggle = self._auto_multiprocess_toggle(data, 750000, 20)


taylorfturner · 2023-07-31T18:15:10Z

dataprofiler/version.py

@@ -2,7 +2,7 @@

 MAJOR = 0
 MINOR = 10
-MICRO = 0
+MICRO = 1


hmm why is this changing? makes me think you need a rebase 🤔

I would recommend just reverting this

taylorfturner and others added 5 commits July 17, 2023 10:07

Hot Fix: .astype("bool") (#960)

6cb789a

* force as boolean * version bump

Fixed logic for suggested_pool_size.

54cb9f1

Reverted changes to suggest_pool_size

2e11bfc

Initial commit: mock and multiprocessor option logic in place (but no…

ff3891b

…t good code practice lol)

Initial commit: mock and multiprocessor option logic in place (but no…

b448c4f

…t good code practice lol)

clee1152 requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners July 26, 2023 20:35

clee1152 added the Work In Progress Solution is being developed label Jul 26, 2023

clee1152 added 5 commits July 27, 2023 15:39

Added automated toggling for multiprocessing if the option is on.

8d87fa2

Merge branch 'feature/multiprocess_option' into feature/multiprocesso…

5420389

…r-option

empty commit

20767a2

Merge branch 'feature/multiprocessor-option' of https://github.com/cl…

6465613

…ee1152/DataProfiler into feature/multiprocessor-option

Pass precommits

e5f406c

taylorfturner enabled auto-merge (squash) July 27, 2023 20:12

JGSweets reviewed Jul 28, 2023

View reviewed changes

Moved auto_multiprocess_toggle to profiler_utils.py, got rid of defau…

f9b5f96

…lt values.

tyfarnan reviewed Jul 28, 2023

View reviewed changes

tyfarnan self-requested a review July 28, 2023 14:05

tyfarnan previously approved these changes Jul 28, 2023

View reviewed changes

Sent test_auto_multiprocess_toggle to test_profiler_utils.

427a172

auto-merge was automatically disabled July 31, 2023 13:33
Head branch was pushed to by a user without write access

clee1152 dismissed tyfarnan’s stale review via 427a172 July 31, 2023 13:33

taylorfturner enabled auto-merge (squash) July 31, 2023 13:34

taylorfturner reviewed Jul 31, 2023

View reviewed changes

taylorfturner suggested changes Jul 31, 2023

View reviewed changes

Reverted rebase changes.

42f0db5

auto-merge was automatically disabled July 31, 2023 18:25
Head branch was pushed to by a user without write access

taylorfturner closed this Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate multiprocessing option #971

Automate multiprocessing option #971

clee1152 commented Jul 26, 2023

JGSweets Jul 28, 2023

taylorfturner Jul 28, 2023

tyfarnan Jul 28, 2023

taylorfturner Jul 31, 2023

JGSweets Jul 28, 2023

JGSweets Jul 28, 2023

clee1152 Jul 28, 2023

tyfarnan Jul 28, 2023

taylorfturner Jul 31, 2023

taylorfturner Jul 31, 2023

taylorfturner Jul 31, 2023

taylorfturner Jul 31, 2023

		@@ -2779,6 +2779,34 @@ def _merge_null_replication_metrics(self, other: StructuredProfiler) -> dict:

		return merged_properties

		def _auto_multiprocess_toggle(

Automate multiprocessing option #971

Automate multiprocessing option #971

Conversation

clee1152 commented Jul 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment