Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore solutions to memory and I/O bottlenecks #316

Closed
jpmckinney opened this issue Jul 19, 2019 · 8 comments
Closed

Explore solutions to memory and I/O bottlenecks #316

jpmckinney opened this issue Jul 19, 2019 · 8 comments

Comments

@jpmckinney
Copy link
Contributor

jpmckinney commented Jul 19, 2019

Based on user feedback, I spent some time identifying bottlenecks in Flatten Tool, which are, at a high-level, in order of priority: memory, I/O and CPU. From the bottlenecks, I proposed some solutions in that document.

The memory bottleneck is the most critical. A user working with a JSON file larger than about 1.5 GB (likely even less) will exhaust their memory (assuming an average consumer level of 8GB RAM) just by running json.load – nevermind Flatten Tool's other memory usage. Running time degrades severely once RAM is full.

A few hundred thousand OCDS releases with at most 100 fields per release adds up to 1.5GB. In other words, performance issues will be an increasingly common problem for OCDS users.

@duncandewhurst
Copy link
Contributor

@jpmckinney
Copy link
Contributor Author

Closing as this is no longer an issue for me (feel free to re-open a new issue). Will be fixed in a new OCDS-specific tool.

@duncandewhurst
Copy link
Contributor

Reopening as this is causing issues where CoVE is integrated into the OC4IDS DRT.

See open-contracting/cove-oc4ids#74 and open-contracting/cove-oc4ids@686643f

@duncandewhurst duncandewhurst reopened this Oct 7, 2020
@Bjwebb
Copy link
Member

Bjwebb commented Oct 8, 2020

I would be interested in looking into this. I don't think I'm likely to get chance as part of the current OC4IDS work, but it could be something for the next OCDS Dev sprint?

@jpmckinney
Copy link
Contributor Author

jpmckinney commented Oct 8, 2020

For OCDS, OCP's plan is to build an OCDS-specific flattening tool. Once that's ready, we can consider using it in the Data Review Tools.

Rob mentioned ODS might spend some time independently improving its tools. If that yields a clear plan for reducing Flatten Tool's memory usage, OCP can also consider that for dev work.

@drkane
Copy link

drkane commented Jan 26, 2021

Might not be the right place for this, but I wonder if the tree library might have some useful functions. eg: https://tree.readthedocs.io/en/latest/api.html#tree.flatten_with_path

Seems to be designed with speed/performance in mind.

@jpmckinney
Copy link
Contributor Author

Interesting find! Our implementations of similar methods follow the same design as in that library, and ours might actually be faster because that library is designed to handle more inputs and therefore has more method calls e.g. _yield_sorted_items whereas we just perform the loop within the same method, and don't sort unless the use case requires it.

The library does other interesting things that could be relevant to future tools.

kindly added a commit that referenced this issue Jan 28, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Jan 28, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Jan 28, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Jan 28, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Jan 28, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Mar 8, 2021
* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
kindly added a commit that referenced this issue Mar 8, 2021
kindly added a commit that referenced this issue Mar 9, 2021
@jpmckinney
Copy link
Contributor Author

Re-closing my issue as this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants