-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to parse the root file using rust and asking some questions. #1261
Comments
This is an interesting, though large, undertaking. We talked about "compiled Uproot" options at last year's PyHEP.dev: HSF/PyHEP.dev-workshops#15. In 2018, I knew that @cbourjau was working on a ROOT file reader in Rust, under alice-rs/root-io, and it looks like that work was still active last year. If you want to accelerate hotspots in Uproot with Rust, it will take some thinking to figure out where we can insert calls into compiled code, get meaningful results back to Python in a format Python can use, and have the whole process actually be faster. I'm not against this, but we'd have to be judicious about it or we might not have any bottom-line speed-up to show for it. One place where a similar speed-up was made was in complex TTree interpretation. TTrees with simple data types are just cast as NumPy or Awkward arrays (doesn't even need a traversal over the data, compiled or otherwise), but complex types used to be auto-generated Python code and are now auto-generated AwkwardForth code. AwkwardForth is another interpreted language (to allow for runtime code generation), but a faster one than Python (2102.13516 and 2303.02202). All of the past focus on performance in Uproot addressed the bulk TTree contents, the part of the problem that scales as
I can see how a complete rewrite of Uproot in Rust would work, using PyO3 to expose the same Python interface. That kind of project would be in the style of flake8 → ruff, pip → uv, etc., and it would be a good idea if it were not such a massive undertaking (as well as making it more difficult to add small quality-of-life features: Rust code is harder to change than Python code). It's harder to see how hotspot-fixes would work. Conda + libmamba is an example of that, in which only one part of conda, the expensive dependency resolution, was replaced. (Micromamba and pixi are different stories.) But there doesn't seem to be a good factorization point here: Python-Uproot will need Python classes to operate. Even if they can be interpreted much more quickly in Rust, they still have to be exposed as Python data to function. Here's an idea for an isolated project that could be useful: the uproot.num_entries function does the absolute minimum needed to open a ROOT file containing a TTree and then report the number of entries in that TTree. It's something that we often need to do for large datasets, so it needs a fast-path. It's not tightly coupled to the rest of Uproot; it could be replaced by a Rust function without upsetting much. The implementation is here: uproot5/src/uproot/models/TTree.py Lines 944 to 973 in 90039ea
What do you think of using this as a starter project? You'll need to learn how to interpret TFile headers, TDirectory data, TKeys, and only a little TTree metadata (by examining Uproot, UnROOT.jl, groot (Go), and asking questions). |
@jpivarski I want to convert the parsed data into pola-rs format because I want to process and filter the data with the help of pola-rs functionality. Because sometimes the root file is very big and not can't be parsed at once, it can be processed with the help of rust's iterative parsing, this is my primary idea, what do you think, any good suggestions? |
alice-rs is build using |
@cbourjau Okay, thank you for your help. |
I've looked into this a bit after finding @cbourjau 's work on alice-rs. There is another project called oxyroot by @m-dupont which I have used with varying levels of success (but it usually works fine). The major difficulty is of course the streamers - these are very difficult to implement in the same way uproot or unroot.jl does because Rust is not dynamically typed (among other reasons). It's also not that easy to write libraries like this because, as far as I can tell, the actual format of a ROOT file is poorly documented. Even the docs (click the dropdown near the top that says "ROOT file data format specification") doesn't really tell you much and in fact includes some errors (there is an undocumented two-byte gap between |
We should also mention hep/groot (implementation in Go). The code quality of this project is excellent, and it's where I learned enough about the ROOT format to write Uproot. It's true that expressing data with TStreamers is inherently dynamic. The question of static types versus dynamic types is fundamental; it can't be brushed away with some tool. You either know enough about the data types at compile time to generate specialized code or you don't know until runtime. Uproot and UnROOT effectively do things like JIT-compilation (although Uproot's AwkwardForth is more of a fast interpreter than true JIT-compilation), and many earlier projects involved a phase in which you point a code-generator at a file and made some code to compile. ROOT itself uses JIT-compilation, which is why ROOT I/O can't be packaged separately from LLVM. Without dynamic TStreamers, it wouldn't be possible to encode arbitrary data types. That's essential for some applications, though end-stage analysis workflows have been moving toward simple data types like NanoAOD and PHYSLITE. A ROOT I/O project has to define some boundaries. Perhaps it can focus on NanoAOD/PHYSLITE-like files exclusively—then it doesn't need arbitrary data types, just the versions of TTree, TBranch, etc. from the last few years. At the start of the Uproot project (2017), I compiled a set of ROOT versions up to 5 years old (so, starting in 2012):
and found that this only spanned a few versions of the TTree class:
So Uproot now hard-codes them, just to avoid the time needed to read TStreamers and generate code for the common cases. A Rust-based implementation could target these TTree (and TBranch, ...) versions and just raise an error when an unrecognized class version is encountered. The rate at which these things are added is not high. |
Hard coding them seems like the way to go, I agree. I believe alice-rs had the interesting approach of using a separate script to generate rust code using a ROOT file as input and some fancy macros, but it's a bit buggy and as mentioned not under active development anymore (but a great place to learn a bit!). I've been trying my hand at making my own parser, if it ever comes to fruition I'll let you all know about it haha |
It depends on the use case, but anything parsed with a dynamic streamer is essentially an object without any member functions. After all, if you were to know enough about the layout to implement member functions, you could just as well parse the data. Thus, I found for myself the streamers to be at best useful as hints for the layout, but ultimately, you need to know something about the layout to do anything useful with the parsed data anyhow. Also, if cannot recommend nom enough for writing binary parsing code in Rust! It was a joy to use! Alice-rs got a little of the rails. with the later commits, I must admit :D. I was trying some async stuff with the goal of running an analysis on Alice Open Data in the browser using WASM. But it made everything rather complicated... |
Mostly. There's also a case in which you know about the class but you don't know a particular version of that class. For instance, suppose a new TTree class version comes out tomorrow, and it adds one more data member, Uproot does this by splitting member functions and deserialization routines into "behaviors" and "models", which are essentially traits and concrete classes. I wonder if it would be possible to write dynamic deserialization routines in Rust, which fill a dynamic structure such as a |
In Rust you can do something like this with a |
The Models in Uproot are called "models" because they don't reproduce the OOP structure from C++ in Python, even though both are OOP languages. (It's because I didn't want the Python classes to have the same inheritance structure as the C++ classes. In earlier versions, it led to accidental inheritance of TH1 methods into TH2, for instance.) In Uproot, the C++ superclasses of a modeled C++ class are in a Python list. It's a list to allow for multiple inheritance—distant ancestors are nested in these lists. Thus, we have a whole tree of method-resolution-order. That technique would port to Rust easily, though it means dynamically navigating a list for each access, rather than immediate access. Since the Rust implementation is motivated by speed of metadata handling, all of these little traversals would add up. It's probably best to make the Rust implementation static, based on the few class-version combinations that one would find in NanoAOD/PHYSLITE files. Remember that an uproot.num_entries implementation would be a limited-scope first project that would bring immediate value. Optimizing the time to get the number of TTree entries from a file is useful for setting partition boundaries in a Dask job or computing a quantity that would be a denominator or normalizing factor in some calculation. The Speaking of which, for large-scale data processing, we've been finding that it would be better if Uproot were split into two parts: one to crawl through the metadata and get TBasket positions and another to read, decompress, and interpret TBaskets in bulk. Tiled-Uproot is intended to put a database in between the two steps, so that a collection of ROOT files can be indexed for faster processing later. The uproot.num_entries project described above is 90% of the way toward the first half of a split Uproot. The second half is not dependent on metadata handling (and @lgray has been exploring options to handle this half exclusively in GPUs). |
Okay, thank you very much for your suggestions. My original intention was to use Rust to parse root files, then encapsulate them into Python libraries, and combine them with Python's PyTorch for machine learning. That's why I had this idea.好的,非常感谢大家的建议,我的初衷是使用rust解析root文件,然后封装为python库,与python的pytorch结合,进行机器学习,因此有这样一个想法。 |
I found that python parsing the root file is a bit slow, I would like to try to use rust to parse it, and then give it back to python bindings, but I don't know how to start parsing the root file, what is the recommended information. (I'm not a deep root scholar)
The text was updated successfully, but these errors were encountered: