-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore solutions to memory and I/O bottlenecks #316
Comments
Closing as this is no longer an issue for me (feel free to re-open a new issue). Will be fixed in a new OCDS-specific tool. |
Reopening as this is causing issues where CoVE is integrated into the OC4IDS DRT. See open-contracting/cove-oc4ids#74 and open-contracting/cove-oc4ids@686643f |
I would be interested in looking into this. I don't think I'm likely to get chance as part of the current OC4IDS work, but it could be something for the next OCDS Dev sprint? |
For OCDS, OCP's plan is to build an OCDS-specific flattening tool. Once that's ready, we can consider using it in the Data Review Tools. Rob mentioned ODS might spend some time independently improving its tools. If that yields a clear plan for reducing Flatten Tool's memory usage, OCP can also consider that for dev work. |
Might not be the right place for this, but I wonder if the tree library might have some useful functions. eg: https://tree.readthedocs.io/en/latest/api.html#tree.flatten_with_path Seems to be designed with speed/performance in mind. |
Interesting find! Our implementations of similar methods follow the same design as in that library, and ours might actually be faster because that library is designed to handle more inputs and therefore has more method calls e.g. The library does other interesting things that could be relevant to future tools. |
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
Re-closing my issue as this is resolved. |
Based on user feedback, I spent some time identifying bottlenecks in Flatten Tool, which are, at a high-level, in order of priority: memory, I/O and CPU. From the bottlenecks, I proposed some solutions in that document.
The memory bottleneck is the most critical. A user working with a JSON file larger than about 1.5 GB (likely even less) will exhaust their memory (assuming an average consumer level of 8GB RAM) just by running
json.load
– nevermind Flatten Tool's other memory usage. Running time degrades severely once RAM is full.A few hundred thousand OCDS releases with at most 100 fields per release adds up to 1.5GB. In other words, performance issues will be an increasingly common problem for OCDS users.
The text was updated successfully, but these errors were encountered: