Explore solutions to memory and I/O bottlenecks #316

jpmckinney · 2019-07-19T22:42:59Z

Based on user feedback, I spent some time identifying bottlenecks in Flatten Tool, which are, at a high-level, in order of priority: memory, I/O and CPU. From the bottlenecks, I proposed some solutions in that document.

The memory bottleneck is the most critical. A user working with a JSON file larger than about 1.5 GB (likely even less) will exhaust their memory (assuming an average consumer level of 8GB RAM) just by running json.load – nevermind Flatten Tool's other memory usage. Running time degrades severely once RAM is full.

A few hundred thousand OCDS releases with at most 100 fields per release adds up to 1.5GB. In other words, performance issues will be an increasingly common problem for OCDS users.

The text was updated successfully, but these errors were encountered:

duncandewhurst · 2019-08-01T16:25:41Z

Explore solutions to memory and I/O bottlenecks

jpmckinney · 2020-08-12T02:13:44Z

Closing as this is no longer an issue for me (feel free to re-open a new issue). Will be fixed in a new OCDS-specific tool.

duncandewhurst · 2020-10-07T00:56:35Z

Reopening as this is causing issues where CoVE is integrated into the OC4IDS DRT.

See open-contracting/cove-oc4ids#74 and open-contracting/cove-oc4ids@686643f

Bjwebb · 2020-10-08T09:20:46Z

I would be interested in looking into this. I don't think I'm likely to get chance as part of the current OC4IDS work, but it could be something for the next OCDS Dev sprint?

jpmckinney · 2020-10-08T15:59:16Z

For OCDS, OCP's plan is to build an OCDS-specific flattening tool. Once that's ready, we can consider using it in the Data Review Tools.

Rob mentioned ODS might spend some time independently improving its tools. If that yields a clear plan for reducing Flatten Tool's memory usage, OCP can also consider that for dev work.

drkane · 2021-01-26T12:33:39Z

Might not be the right place for this, but I wonder if the tree library might have some useful functions. eg: https://tree.readthedocs.io/en/latest/api.html#tree.flatten_with_path

Seems to be designed with speed/performance in mind.

jpmckinney · 2021-01-26T17:45:22Z

Interesting find! Our implementations of similar methods follow the same design as in that library, and ours might actually be faster because that library is designed to handle more inputs and therefore has more method calls e.g. _yield_sorted_items whereas we just perform the loop within the same method, and don't sort unless the use case requires it.

The library does other interesting things that could be relevant to future tools.

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

#316

jpmckinney · 2021-12-14T15:09:43Z

Re-closing my issue as this is resolved.

jpmckinney closed this as completed Aug 12, 2020

duncandewhurst reopened this Oct 7, 2020

kindly added a commit that referenced this issue Jan 28, 2021

Flattening: Reduce memory Footprint.

1cb7f93

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly mentioned this issue Jan 28, 2021

Flattening: Reduce Memory Footprint. #376

Merged

kindly added a commit that referenced this issue Jan 28, 2021

Flattening: Reduce memory Footprint.

1b40acc

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly added a commit that referenced this issue Jan 28, 2021

Flattening: Reduce memory Footprint.

4e03fc6

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly added a commit that referenced this issue Jan 28, 2021

Flattening: Reduce memory Footprint.

afa91ad

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly added a commit that referenced this issue Jan 28, 2021

Flattening: Reduce memory Footprint.

b12fe74

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly added a commit that referenced this issue Mar 8, 2021

Flattening: Reduce memory Footprint.

4824df2

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly added a commit that referenced this issue Mar 8, 2021

Flattening: Add comments per review

123d981

#316

kindly added a commit that referenced this issue Mar 9, 2021

Flattening: Add comments per review

31b9399

#316

jpmckinney closed this as completed Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore solutions to memory and I/O bottlenecks #316

Explore solutions to memory and I/O bottlenecks #316

jpmckinney commented Jul 19, 2019 •

edited

Loading

duncandewhurst commented Aug 1, 2019

jpmckinney commented Aug 12, 2020

duncandewhurst commented Oct 7, 2020

Bjwebb commented Oct 8, 2020

jpmckinney commented Oct 8, 2020 •

edited

Loading

drkane commented Jan 26, 2021

jpmckinney commented Jan 26, 2021

jpmckinney commented Dec 14, 2021

Explore solutions to memory and I/O bottlenecks #316

Explore solutions to memory and I/O bottlenecks #316

Comments

jpmckinney commented Jul 19, 2019 • edited Loading

duncandewhurst commented Aug 1, 2019

jpmckinney commented Aug 12, 2020

duncandewhurst commented Oct 7, 2020

Bjwebb commented Oct 8, 2020

jpmckinney commented Oct 8, 2020 • edited Loading

drkane commented Jan 26, 2021

jpmckinney commented Jan 26, 2021

jpmckinney commented Dec 14, 2021

jpmckinney commented Jul 19, 2019 •

edited

Loading

jpmckinney commented Oct 8, 2020 •

edited

Loading