Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: async support #44

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dignifiedquire
Copy link

@dignifiedquire dignifiedquire commented Jan 10, 2023

Took a first stab at async support, just encoding for now.

A big issue I am running into right now, is that I want to use this the Encoder write to a network stream without buffering much, but unfortunately that requires Read + Seek, both of which are not available on an outgoing stream connection.

Any thoughts on how/how hard it would be to build an encoder that does not require reading back any content from the underlying target?

Ref #43

@dignifiedquire dignifiedquire changed the title feat: first pass at async encoding feat: async support Jan 10, 2023
@oconnor663
Copy link
Owner

oconnor663 commented Jan 10, 2023

Apologies, I probably won't have time to read through all the code today.

Any thoughts on how/how hard it would be to build an encoder that does not require reading back any content from the underlying target?

The fundamental challenge here is that all the nodes in the Bao stream are laid out in pre-order. So the very first thing that needs to hit the wire is effectively the root hash of the entire input, and that means you need to process the entire input before even a single byte of encoded output is ready. (Well, the first 8 bytes are just the length, so it's actually the ninth byte that requires the entire input.)

The intended way to work around this is bao slice and bao decode-slice. The assumption there is that you run the entire bao encode locally and save the result on disk. If you save the "combined" encoded file, you can just serve that verbatim. If you save the "outboard" encoded file, to avoid the space overhead of an entire copy of the original input, then you can use bao slice together with the original and the outboard encoding to reproduce the "combined" encoding. That does work in streaming-friendly way, and it doesn't require seek support in the output.

Currently Bao always saves the entire hash tree, which is ~6% the size of the original input. (So a "combined" encoding is 106% of the input size, and an "outboard" encoding is 6%.) But a big TODO for me is to support configurable "chunk group" sizes, where Bao would omit the lower levels of the tree and recompute those on the fly as needed. Then you could tune the size of the outboard tree, saving space on disk in exchange for needing to buffer more input during decoding and seeking.

@dvc94ch
Copy link
Contributor

dvc94ch commented Jan 10, 2023

so the preorder/postorder thing is an artifact of doing a two pass encoding of the bao format using two buffers.

for my purposes I wrote a hasher that produces a Vec<Insertion> and writes the contents atomically into a kv store so that pre/post order traversal can be performed efficiently. The main reason for not using the bao crate was that decoding all slices of the file should be able to reproduce the "hash tree" without having to rehash the data. this allows peer to provide the slices they have without having to first sync the entire file and rehash it's contents.

See https://github.com/dvc94ch/blake-tree/blob/master/core/src/hasher.rs and https://github.com/dvc94ch/blake-tree/blob/master/core/src/tree.rs for the details

@dvc94ch
Copy link
Contributor

dvc94ch commented Jan 10, 2023

But with regards two this PR I have two questions.

  1. does making it async provide a measurable difference in any benchmark
  2. is this work intended to be used in iroh? that would be an exciting break from existing ipfs orthodoxy

@dignifiedquire
Copy link
Author

  • does making it async provide a measurable difference in any benchmark

it's less about specific perf improvements, more about easier integration into existing pipelines/code

  1. is this work intended to be used in iroh?

Eventually, yes that is the goal, or at least sth like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants