-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: make the msg hash time-sortable #3034
Comments
Q: will we use time+hash at the postgres level only? |
I would suggest changing the hash algorithm for Waku overall (i.e. a spec change). |
The original purpose is to enhance postgres performance. However, store clients might need to provide a list of hashes with that format in "queries by hash". ( cc @richard-ramos ) |
Considering that Status has launched IMO nwaku should support both message hashes format during a transition period, to ensure that users that have not updated to a newer waku version will still be able to retrieve messages from the store node using the 'old' message hash, and by the time the old message hash format is removed entirely from nwaku, most users of Status will have already installed a newer version of it that uses the new format. How would this be solved? would it mean that we'd have a Sounds like an interesting challenge and l'm looking forward to see how it is implemented to reproduce it in go-waku :) |
I disagree, you'll have to yet again explain what is the benefit here? Messages cannot be more uniquely identified than by their hashes and all messages already have timestamps. |
You have a good point. We need to weigh pros and cons carefully and I'm happy to be convinced otherwise as the impact is rather big. The main benefit is that we'll have a key for each message that does not only provide a globally unique index but also a global sorting order - this changes the abstraction of Waku Store from a key-value store to an ordered key-value store. I think Waku can benefit from such a data model, as we do indeed have a time dimension implicit in Waku routing (i.e. Waku is not just a random collection of keys and values, but a time series of such key-values). The fact that, for example, we use the message hash as a A symptom (not the cause) of the possibly deficient data model without natural key ordering, is that we've identified the need to include the timestamp alongside the key for every store query on message hash (including cursors) to achieve reasonable performance. In fact, the alternative suggestion might be to change the database key from |
What if this design is wrong? In query results we could sort by hash and use hash as cursor OR sort by time and use a timestamp as cursor. It's a matter of what order we want to use.
Not true we have total ordering with hashes and partial ordering with timestamp.
💯 Postgres related and has nothing to do with the Store protocol. The principled approach would be to change postgres. I can understand that using postgres give us short term benefits BUT it already works OKish, benefit achieved, let's move on. IMO all the benchmarking and tuning is 100% wasted efforts long term. |
But I think you're highlighting exactly why the design of cursors/message hashes may be deficient: we want a cursor that is unique - only achievable with hash - and a cursor for which "before" and "after" means "happening before" and "happening after" - only achievable with timestamp. Getting both features only comes with a timestamp + hash combination.
Not sure I follow - we have total ordering with timestamp + hash and partial ordering with timestamp, or am I misinterpreting your point?
Of course, if we have already achieved "good enough" for message hash lookups, I would agree that the cost of making hashes sortable is too high. |
I agree that this is only worth it if the benefits from this change are sufficiently high. @Ivansete-status and @richard-ramos may understand better how big of a difference this would make, but as a principle I really like the idea of time-sortable hashes
Yes! So the main issue here is how to maintain backward compatibility for some time. The question is, where do we need consensus between protocols on the message hash calculation? Please correct me if I'm wrong, I have the idea that the message hash is computed locally by each node after receiving a Having a query protocol for it is an option. Other options I thought of are:
Probably I'm missing lots of details and cases, happy to hear opinions and corrections :) |
Why?
Which to me indicate that we need a cursor in the form of a timestamp not a hash.
I thought your point was that we don't have ordering ATM, maybe I'm the one who didn't understand? I'm not against changing the Store protocol I just don't understand why do we care about postgres not being performant/suited to our use case. |
After discussion in Waku-PM call, we decided to re-evaluate necessity of this change (due to high impact of breaking change) against possible improvements. We're currently tackling the store performance issue from multiple angles (understanding nwaku API performance, selecting DB technology, query timing analyses, etc) including getting extra-team help from an expert. We should also expect a dramatic change in Store requirements when:
Next step for this issue specifically is to confirm with external DB expert what the potential performance gain could be, but to deprioritise if not urgent and we have reached "good enough" with current postgresql configuration. |
Well, because timestamps are not unique and a cursor is used as the traversal index over the decentralised store service. Not exactly related, but for a proper decentralised design cursors should also, eventually, work across multiple store nodes: you should be able to ask for the same page of history from any store node in the network. Since multiple messages can have the same timestamp, you cannot traverse a page of history without a unique cursor. If for example you have 101 messages with the same timestamp (quite possible, given that timestamps should be coarse), reading that page with nothing but the timestamp as cursor and with 100 messages per page will result in an infinite loop of reading the same page over and over again.
We don't have ordering of the keys without input from the timestamp field inside the value. In other words, we cannot traverse the keys, which seems necessary for the paging concept. Read differently, our key-value store is meant to be traversed since it represents history. No matter what technology we use, the steps to reading a page seems to roughly be (assuming forward reads):
Note that this proposal is exactly to prevent us from changing the Store protocol by adding timestamp qualifiers to indices - cursors/message hashes should be self-sufficient indices as is. The question is whether the index concept as already "understood" by the Store protocol truly matches what's implemented. It should not primarily be a question of technology - it's the fact that fast lookup of a key doesn't seem enough, because you always need to follow that by a separate timestamp query to understand the ordering of the keys. This may be a fundamental data model issue.
This is a very good question and, indeed, switching technology may solve most or all of our data model issues. The question is whether the technology change is a workaround for a deficient data model or whether changing the data model is a workaround for a limited technology (i.e. postgresql). For now (see my previous comment) we'll likely deprioritise this change exactly to prevent unnecessary effort. 😃 |
Yes good points. We do need unique cursors, I understand.
My point was that we could page by hash and not by time. I guess it's not super useful in our case.
In my mind, an ideal Waku messages DB has all the message fields normalized and minimum 2 search indices, hash and time. For queries by hash it's easy just use the hash index to find the data. For time queries and pagination first use the time index to find the hash cursor for that page then return the data plus cursor, repeat for other pages. For content topics, pubsub topics other indices can be created and joined to search the DB. I just don't understand that doing something so simple can be so inefficient. |
Background
This is inspired by @jm-clius 🙌
We are currently having performance issues in PostgreSQL. One of the reasons is that the queries became a bit complex, especially when dealing with message hashes.
We want to have time-sortable message hashes.
Details
The message hash is currently obtained from:
nwaku/waku/waku_core/message/digest.nim
Line 38 in 148a48a
We need to keep the same logic to calculate that random identifier but we need to add time information in the first 6 bytes of the
array[32, byte]
.For example, if we consider the following msg hash:
a07e281d525a8dc3541d58539b8c7e0136baabc64cedd44d9f861529c0d8b8e4
The new msg hash type should be formed so that its hexadecimal representation is:
2406122310288dc3541d58539b8c7e0136baabc64cedd44d9f861529c0d8b8e4
Then, in the hex form of the msg hash we can see the timestamp
YYMMDDHHMMSS
.With that, we will be able to simplify the store queries a little.
The text was updated successfully, but these errors were encountered: