Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream implementation #9

Open
LaurensRietveld opened this issue Mar 14, 2017 · 29 comments
Open

Stream implementation #9

LaurensRietveld opened this issue Mar 14, 2017 · 29 comments

Comments

@LaurensRietveld
Copy link
Collaborator

Have you every considered a stream implementation, instead of a callback? Any interest in that direction?

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Mar 14, 2017

Node streams for objects come with a terrible performance bottleneck (see for instance rdfjs/N3.js#6), but iteration by itself would makes sense. However, there is a performance penalty associated with crossing the JS/C++ barrier, so the current system of limit/offset is probably as efficient as it gets.

@RubenVerborgh
Copy link
Member

…but it would make sense to implement a JavaScript-land iterator interface on top of the limit/offset system, for instance such that triples are fetched in groups of 100.

@LaurensRietveld
Copy link
Collaborator Author

Interesting discussion there, haven't had such a horrible experience with streams myself. Things probably changed quite a bit in node-js land since that discussion. I reran your perf script on a file of mine, and these are the results:

The N3 parser perf scripts in master (uses N3.Parser):

$ node N3Parser-perf.js input.nt
- Parsing file /home/lrd900/triply/N3.js/perf/input.nt: 48275.659ms
* Triples parsed: 25650502
* Memory usage: 64MB

A script I quickly wrote that uses fs.createReadStream(filename).pipe(N3.StreamParser({documentIRI: base})):

$ node N3Parser-stream-perf.js input.nt
- Parsing file /home/lrd900/triply/N3.js/perf/input.nt: 54330.664ms
* Triples parsed: 25650502
* Memory usage: 63MB

Ran on node v7.7.2.

Ps. an iterator would be great to have as well of course ;)

@RubenVerborgh
Copy link
Member

Not bad, not bad at all… I wonder how much better RubenVerborgh/AsyncIterator performs on that.
Native iterators are to be tested as well, definitely.

In any case, a more than appropriate feature request.

@LaurensRietveld
Copy link
Collaborator Author

What would you think about the implementation details ruben? I was thinking about:

  • Adding a pause and resume function to the c code, that stops/resumes the iterator.
  • Sending each statement via a callback to nodejs
  • Glue everything together in either a stream and/or something like an async iterator on the JS side.

Any thoughts?

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 23, 2017

Adding a pause and resume function to the c code, that stops/resumes the iterator.
Sending each statement via a callback to nodejs

I wouldn't do that, you want to reduce the number of times you cross the C/JavaScript barrier for performance reasons.

Glue everything together in either a stream and/or something like an async iterator on the JS side.

I would implement all of this in JS-land, with a stream/iterator that communicates with the C code in batches of N triples (with N for instance 100).

@LaurensRietveld
Copy link
Collaborator Author

I've already implemented a stream-wrapper as you describe myself, and notice a performance penalty there as well (i.e., fetching 1000 query results is cheaper than fetching 10x100 query results).
Guess we have to quantify the performance cost of both approaches, I'll see whether I can cook something up

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 24, 2017

We probably should have an optional bufferSize parameter with a sensible default. Maybe 1024 or so is more sensible than 100 (but for TPF I'd like to set it to 100 explicitly to avoid unneeded work).

Also note optimizations such as this one that minimize the number of JavaScript objects being allocated. If multiple triples in the same batch share the same subject (or predicate or object), then they will be the exact same in-memory string. We should carefully think about these as well.

@LaurensRietveld
Copy link
Collaborator Author

A quick experiment where I 1) fetch 1 million query results in 1 go (i.e. 1 million results in single callback), and 2) fetch 1 million results in batches of 100 (i.e. 10.000 results per callback):

$ node test 1
fetching 1000000 query results in 1 loops
done: 1377.444ms

$ node test 100
fetching 1000000 query results in 100 loops
done: 5852.993ms

The difference in performance greatly depends on the query. A null, null,null query performance only has a small performance penalty when divided over multiple callbacks. The above result is from a null, <p>. null query.
There seems to be a performance penalty with an offset. Querying a single triple (with a bound predicate in the query) is a lot (100x) slower with an offset of 1000000 compared to an offset of 1.

To me, it seems that implementing this in batches (all on the js side) comes at a cost.
A middleground would be to still implement the pause / resume functionality in the c++ code (which should avoid calculating the offset for each chunk), while still sending the results in bulk (indeed something like 1024 sounds sensible). That way, we'd still be able to use your optimization as well to some extent.

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 24, 2017

To me, it seems that implementing this in batches (all on the js side) comes at a cost.

Yes, but everything depends on the access pattern! What if only the first 10 results are ever used?

For the scenario where we know that a) all triples will be read and b) they all fit in memory, my assumption would indeed be that fetching all is more performant.

Can you try with different buffer sizes to see the effect? I'm thinking that around 1024 is a sensible default. Note that we could incrementally adapt the buffer size etc.

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 24, 2017

Also, the ?s <p> ?o pattern is a very exotic one, you won't see it in a lot of SPARQL queries (and if you do, it is unlikely to be the first to be resolved).

@LaurensRietveld
Copy link
Collaborator Author

Yes, but everything depends on the access pattern! What if only the first 10 results are ever used?

Agreed, it depends on the access pattern. What is your usecase only fetching the first 10 results? If you know you're interested in 10, you can just use the callback mechanism and add a limit of 10 right? In a stream scenario you're often interested in all the results (but without the memory costs of course).

For the scenario where we know that a) all triples will be read and b) they all fit in memory, my assumption would indeed be that fetching all is more performant.

It's b) that I'd like to fix here with a stream implementation. Maybe I wasn't clear, but I think we can get the best of both worlds (low memory usage and fast result streaming) by:

  • Adding a pause/resume function to the c++ code
  • Sending results to js in batches in size bufferSize (with optionally a dynamic buffer / batch size)
  • Making a stream implemention in JS that pauses/resumes the iterator function in C++ to avoid backpressure.

Can you try with different buffer sizes to see the effect? I'm thinking that around 1024 is a sensible default. Note that we could incrementally adapt the buffer size etc.

That would only get worse right? My example was 1 million triples divided over 100 chunks, i.e. a buffer size of 10.000. Lowering that even more would greatly degrade performance (takes ~50 seconds). Or is there something lost in translation here? ;)

Also, the ?s <p> ?o is a very exotic one.

True, with ?s ?p <o> we get a smaller effect, but still quite large:

$ node test 1
fetching 1000000 query results in 1 loops
done: 2124.442ms

$ node test 100
fetching 1000000 query results in 100 loops
done: 1401.727ms

@RubenVerborgh
Copy link
Member

What is your usecase only fetching the first 10 results? If you know you're interested in 10, you can just use the callback mechanism and add a limit of 10 right? In a stream scenario you're often interested in all the results (but without the memory costs of course).

Nah, a stream means that I can stop whenever I want. Maybe the 10th result has a characteristic that I'm interested in. Maybe my downstream stops pulling after 10 results. In general, a consumer does not always know in advance how many results to pull.

It's b) that I'd like to fix here with a stream implementation. Maybe I wasn't clear, but I think we can get the best of both worlds (low memory usage and fast result streaming) by:

Adding a pause/resume function to the c++ code

That means maintaining state though, and carefully cleaning up. You should especially be careful with keeping JavaScript objects alive (such as the strings I mentioned above as an optimization).

Making a stream implemention in JS that pauses/resumes the iterator function in C++ to avoid backpressure.

Indeed, avoiding backpressure is key here; but backpressure also exists when I'm slowly pulling 10 results and thousands are already in memory.

That would only get worse right? My example was 1 million triples divided over 100 chunks, i.e. a buffer size of 10.000.

My bad, I misread. Serious performance penalty indeed. I'm surprised offsetting is not cheaper. Are we sure the problem is offsetting BTW? What is the overhead of 100 times fetching the first 10.000 results versus fetching 100 consecutive pages of 10.000 results?

True, with ?s ?p we get a smaller effect, but still quite large:

Wait, isn't that result better?

@LaurensRietveld
Copy link
Collaborator Author

Nah, a stream means that I can stop whenever I want. Maybe the 10th result has a characteristic that I'm interested in. Maybe my downstream stops pulling after 10 results. In general, a consumer does not always know in advance how many results to pull.

Fair enough. Still, we can simply optimize the buffer size in that case. A buffer size of 1024 still sounds reasonable, as the performance difference in fetching 10 or 1024 is negligible. Ideally, with a dynamic buffer this can be improved.

That means maintaining state though, and carefully cleaning up. You should especially be careful with keeping JavaScript objects alive (such as the strings I mentioned above as an optimization).

Good points

Indeed, avoiding backpressure is key here; but backpressure also exists when I'm slowly pulling 10 results and thousands are already in memory.

That's right, but guess we have to find a middle-ground between memory usage and performance here. And a buffer of 1024 still sounds reasonable right?

My bad, I misread. Serious performance penalty indeed. I'm surprised offsetting is not cheaper. Are we sure the problem is offsetting BTW? What is the overhead of 100 times fetching the first 10.000 results versus fetching 100 consecutive pages of 10.000 results?

Good point. The results are 1) 950ms for fetching the first 10.000 results for 100 times, vs. 2) 5.7 seconds for fetching 100 consecutive pages of 10.000 results. (using the <p> query here).

Wait, isn't that result better?

Ha! Completely misread, you're right. No clue why a chunked query would perform better than a single large query. You got any ideas?

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 24, 2017

That's right, but guess we have to find a middle-ground between memory usage and performance here. And a buffer of 1024 still sounds reasonable right?

+1

Good point. The results are 1) 950ms for fetching the first 10.000 results for 100 times, vs. 2) 5.7 seconds for fetching 100 consecutive pages of 10.000 results. (using the

query here).

I wonder if there's anything within hdt-cpp we can do for that. Intuitively, I'm surprised that offsetting is so expensive. The <p> query is a bad indicator though, as this is literally the very worst query for HDT.

No clue why a chunked query would perform better than a single large query. You got any ideas?

I'm not too surprised; I've seen similar results with the N3.js parser parsing a streaming file versus loading the entire file in memory first. There is a threshold at which streaming gets faster. I think the block of memory is simply becoming too large and inefficient. This is good evidence for streaming in blocks. We can even use this to find a good buffer size.

@LaurensRietveld
Copy link
Collaborator Author

I wonder if there's anything within hdt-cpp we can do for that. Intuitively, I'm surprised that offsetting is so expensive. The <p> query is a bad indicator though, as this is literally the very worst query for HDT.

Imo that makes it a good reason to have a good-performing stream implementation, so we're not making a bad performing query even worse.
I'll ask javier about why offsetting this query is so expensive. Perhaps there is a way to avoid the resume / pause functionality, and still avoid the cost of offsetting.

I'm not too surprised; I've seen similar results with the N3.js parser parsing a streaming file versus loading the entire file in memory first. There is a threshold at which streaming gets faster. I think the block of memory is simply becoming too large and inefficient. This is good evidence for streaming in blocks. We can even use this to find a good buffer size.

That's an interesting observation. Some quick experiments showed that a chunk-size of 2000 is the pivot point (on node v8.4.0).

@LaurensRietveld
Copy link
Collaborator Author

An addition: the <p> query is slow with offset values as that pattern does not allow for the skip operator to be used (see https://github.com/RubenVerborgh/HDT-Node/blob/master/lib/HdtDocument.cc#L155)

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Sep 24, 2017

So we might even need to treat the <p> case differently from the others then.

Regarding pivot point, it would be good to set up an experiment to determine this, as it can indeed evolve between Node versions. In general, a performance suite would be good, also w.r.t. changes to hdt-cpp.

@LaurensRietveld
Copy link
Collaborator Author

Oops, I tested this without updating the git submodule. The new hdt-cpp version does support offsetting the <p> query, i.e. there is a similar performance now compared to the <o> query.
I talked to javier, and it seems the only query that does not support offsetting is an <s> ? <o> query, though that one comes with little overhead (and a commit to hdt-cpp is pending to at least support to goTo operator for that pattern as well).

So, it seems there is no need to modify the c++ code for a streaming interface, and we can simply put all the streaming logic in the js code.

@RubenVerborgh
Copy link
Member

Perfect, happy to hear that!

My colleague @joachimvh is testing the performance of streams, iterators, and ES6 native iterators for another project; his findings will be useful to make a decision here as well.

@LaurensRietveld
Copy link
Collaborator Author

@joachimvh I'm interested, did you manage to do an analysis of streams, iterators and es6 native iterators? Did you observe anything interesting?

@RubenVerborgh
Copy link
Member

I remember @joachimvh working on this. What were our results? Can we submit them to WWW dev track?

@joachimvh
Copy link
Member

I only did some minor tests before getting distracted by something else, but first results seemed to indicate:

Fastest is the standard Node.js stream if you have a datatype that can easily be put into a Buffer, such as integers. With other datatypes the problem is that if you still try to use a Buffer you somehow have to put it in there (with JSON.stringify or a custom serialization function maybe) which drastically slows it down.

Asynciterator and sync generator functions had pretty much the same performance, which was a bit slower than the Buffer stream implementation for integers, but didn't increase for more complex objects.

A lot slower then were the Node.js object streams (i.e. objectMode = true) and even slower were the async generator functions (tested both babel and typescript transpile).

But again, these tests were not that extensive which is why I didn't want to put exact numbers on them (but might give an indication already).

@LaurensRietveld
Copy link
Collaborator Author

Thanks. Recently read this blog about async iterators, that might explain the bad performance of your node object streams: https://medium.com/netscape/async-iterators-these-promises-are-killing-my-performance-4767df03d85b
The conclusion, as I interpret it, is that the smaller the unit of change is in an iterator or stream, the larger the overhead. I.e., for HDT-Node, we shouldnt implement an iterator or stream that sends each individual triple via the iterator or object stream. Instead, for better performance, we should simple return the whole 'window' of returned results (via in either a stream of iterator)

@RubenVerborgh
Copy link
Member

@joachimvh That's good news for the generators. Any need still for https://www.npmjs.com/package/asynciterator then?

@joachimvh
Copy link
Member

Well they are still sync generators, so you lose the whole async advantage of asynciterator there. The moment you start having streams where you need to wait on data this would be a problem (or at least complicate things).

@RubenVerborgh
Copy link
Member

Ah yes, I misread, indeed. So we have:

  1. Node.js Buffer streams
  2. sync generator / AsyncIterator
  3. Node.js object streams
  4. async generator

Let's try to finish your tests somehow so we can keep track of async generator performance as engines start to support it natively. But for the moment, I think we should stick with AsyncIterator then.

@mielvds
Copy link
Collaborator

mielvds commented Dec 5, 2018

Sorry for digging this up, came across this issue by chance. What could be interesting is transferring (parts of) the dictionary to JS first (if that is possible). Then all data could be stored in Buffer and the buffer could be much larger.

@RubenVerborgh
Copy link
Member

@mielvds That would be less data to transfer, but creation of objects in JS land might (or might not!) be more expensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants