feat(data/hashmap): initial implementation #812

spl · 2019-03-11T12:29:07Z

I would like to see this (or something like it) eventually replace the existing data/hash_map.lean. This is just the beginning: there are plenty more definitions and theorems to be added, but I think the foundation is good and ready for review.

Some thoughts:

I think the name hashmap fits better than hash_map with the current naming scheme found in the core Lean library as well as mathlib.
I've included notes on alternative approaches to structuring hashmap that I explored. While it's not exactly documentation on the existing work, I think it does provide a useful perspective.
I changed list.nodupkeys_join to simplify proofs. With the existing theorem, there ends up being several extra pairwise_maps, both in list.nodupkeys_join itself and in theorems that use it.

reviewed and applied the coding style: coding, naming
make sure definitions and lemmas are put in the right files
make sure definitions and lemmas are not redundant

cipher1024 · 2019-03-11T20:19:42Z

Why does it promise to be better than the existing implementation?

spl · 2019-03-12T06:18:59Z

Why does it promise to be better than the existing implementation?

Simpler:
- doesn't involve bucket_array, which I couldn't find a use for
- doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary
- doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary
Better definitions and proofs (at least, I hope):
- builds on the definitions, theorems, and concepts of list.sigma and alist
- uses hash: α → fin n instead of hash_fn: α → nat
- doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist
- unbundles the hash map size

digama0 · 2019-03-12T07:03:11Z

While I'm not completely against the idea of a rewrite (I wrote this library a long time ago and the proofs are more convoluted then they would be if I wrote them today), it was actually written with performance and correctness in mind, and I would want some strong evidence that the replacement is better on all fronts first.

Why does it promise to be better than the existing implementation?

doesn't involve bucket_array, which I couldn't find a use for

It's just an internal implementation thing. I see you did basically the same thing without the name.

doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary

It's definitely necessary. It is there so that hash_map.size is O(1)

doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary

You can't do the mod without it. I'll get back to this with the other points.

builds on the definitions, theorems, and concepts of list.sigma and alist

doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist

I'm fine with this. Obviously this wasn't around when I wrote it, and incorporating it is a good way to tie the library together.

uses hash: α → fin n instead of hash_fn: α → nat

unbundles the hash map size

This is a problem. I notice that you didn't write the most important function, insert. What is the type of insert? You will end up forcing the map to never reallocate, which is very bad for performance.

I think you have identified that there are two data structures involved here: a fixed size hashmap, and a growable hashmap, built on the fixed hashmap much like buffer builds on array. As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible. The growable hashmap does need the bucket size to be positive, because otherwise it can't provide a A -> fin 0 hash function for the fixed hashmap.

digama0 · 2019-03-12T07:12:57Z

An easy solution to the bucket numbering problem is to require that the number of buckets is always a power of 2, i.e. 2^n for some n in nat. This works with the automatic resizing algo because it always doubles the length, and it also makes the modulo operation more efficient (it's just a bit mask).

spl · 2019-03-12T08:04:07Z

Thanks for the in-depth response, @digama0.

While I'm not completely against the idea of a rewrite (I wrote this library a long time ago and the proofs are more convoluted then they would be if I wrote them today), it was actually written with performance and correctness in mind, and I would want some strong evidence that the replacement is better on all fronts first.

doesn't involve bucket_array, which I couldn't find a use for

It's just an internal implementation thing. I see you did basically the same thing without the name.

Of course. There are quite a few lines spent on bucket_array in hash_map.lean, and I think it helped to simplify things by removing them.

doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary

It's definitely necessary. It is there so that hash_map.size is O(1)

Why is it important that hash_map.size be O(1)? That's something that should definitely be documented.

doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary

You can't do the mod without it. I'll get back to this with the other points.

I'm not sure what you mean. I have a wrapper that allows you to use it, but I don't see why it's necessary for the structure. I'm probably missing something.

def fin.lift_mod_pnat (n : ℕ+) {α} (f : α → ℕ) (a : α) : fin n :=
⟨f a % n, nat.mod_lt _ n.property⟩

def mk_empty_pos (n : ℕ+) {α} (β : α → Type v) (f : α → ℕ) : hashmap n β :=
mk_empty β (fin.lift_mod_pnat n f)

builds on the definitions, theorems, and concepts of list.sigma and alist

doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist

I'm fine with this. Obviously this wasn't around when I wrote it, and incorporating it is a good way to tie the library together.

Yep.

uses hash: α → fin n instead of hash_fn: α → nat

unbundles the hash map size

This is a problem. I notice that you didn't write the most important function, insert. What is the type of insert? You will end up forcing the map to never reallocate, which is very bad for performance.

I don't follow you. I have written insert in lean-finmap, but, yeah, I haven't ported over here, yet. Why do you need to reallocate? Are you referring to the length of the association lists?

I think you have identified that there are two data structures involved here: a fixed size hashmap, and a growable hashmap, built on the fixed hashmap much like buffer builds on array. As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible. The growable hashmap does need the bucket size to be positive, because otherwise it can't provide a A -> fin 0 hash function for the fixed hashmap.

Ah, I see what you mean. That makes sense, and I'd be happy to go with two hash maps. So, consider this to be a fixed-size, and we can build the dynamically sized hash map on top of it.

spl · 2019-03-12T08:05:54Z

An easy solution to the bucket numbering problem is to require that the number of buckets is always a power of 2, i.e. 2^n for some n in nat. This works with the automatic resizing algo because it always doubles the length, and it also makes the modulo operation more efficient (it's just a bit mask).

I get the efficiency of using powers of 2 for the array size, but I'm not sure what the “bucket numbering problem” is.

spl · 2019-03-12T08:34:41Z

As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible.

I think I'm with you on the difference between the fixed- and dynamically-sized hash map sizes. In the former, the size never changes, which means it's really a parameter, not a changeable value. In the latter, the size changes and, therefore, the value should be kept.

As for the hash function, I'm not sure. I have vacillated between whether it should be a parameter or a field. Since it doesn't change, it feels like a parameter is more appropriate; however, I left it as a field for convenience sake. I'd be happy to make a parameter, though. But I'm not clear why it should be treated differently for fixed- and dynamically-sized hash maps.

digama0 · 2019-03-12T10:15:28Z

I get the efficiency of using powers of 2 for the array size, but I'm not sure what the “bucket numbering problem” is.

I'm referring to the inconvenience of dealing with pnats. If they are powers of two the size parameter can be a nat, and the bucket array is always nonempty.

As for the hash function, I'm not sure. I have vacillated between whether it should be a parameter or a field. Since it doesn't change, it feels like a parameter is more appropriate; however, I left it as a field for convenience sake. I'd be happy to make a parameter, though. But I'm not clear why it should be treated differently for fixed- and dynamically-sized hash maps.

You could have the function external for the global hashmap as well, but you will also have to externalize the validity predicate, which will make it less ergonomic. Leaving stuff out of the structure will make it easier to reason about, but less encapsulated. So the user facing version (the growable hashmap) should be encapsulated, while the internal version (the fixed hashmap) should make its assumptions clear. Additionally, externalizing the function (and leaving it out of the type of the hashmap too) means that you aren't storing unnecessary copies of it in the growable hashmap and the fixed hashmap, which is good both because it saves on space and because it avoids an unnecessary coherence condition asserting the relation between the two functions.

Why is it important that hash_map.size be O(1)? That's something that should definitely be documented.

Probably it should be documented. But as for why to expect it, this is something I would expect of any decent hashmap implementation for a general purpose programming language. It's cheap to keep track of the information, and the alternative is crazy expensive (it's the same cost as a fold over the whole map) so it seems like a good investment. But if we have two versions of the data structure for internal and external use, then I'm fine with this only being in the user-facing data structure.

spl · 2019-03-12T10:59:23Z

Why is it important that hash_map.size be O(1)?

this is something I would expect of any decent hashmap implementation for a general purpose programming language. It's cheap to keep track of the information, and the alternative is crazy expensive (it's the same cost as a fold over the whole map) so it seems like a good investment. But if we have two versions of the data structure for internal and external use, then I'm fine with this only being in the user-facing data structure.

I'm not convinced that storing the size is needed for a hash map in which the size is always determined by its structure and never used by the hash map itself. You could also make the argument that each of list, alist, finset, finmap, etc. should also keep track of its length/size, but none of them does. If you want to efficiently keep track of size, you can do it outside of the data type. It seems to add unnecessary complication to the structure itself.

On the other hand, with a hash map whose structure depends on its size, such as the dynamically-sized hash map, in which the size needs to be calculated for each insert operation (as an example), it does seem reasonable to store the size.

digama0 · 2019-03-12T11:16:12Z

I'm not convinced that storing the size is needed for a hash map in which the size is always determined by its structure and never used by the hash map itself. You could also make the argument that each of list, alist, finset, finmap, etc. should also keep track of its length/size, but none of them does.

All of these examples are built on list, so your question boils down to "why doesn't list track sizes". I don't think list is actually a very good data structure for general programming use, but its usefulness in writing specifications and proving theorems cannot be denied. But for storing lists of things it is far from the best available data structure. I would recommend buffer as a drop in replacement, and it does actually store the length of the list.

never used by the hash map itself

What do you mean "used"? Data is used if it is accessed by methods from the public API, and size is definitely part of the public API. As another data point, ordmap and rb_map both store the size of the tree alongside the data, for no other reason than to make size reporting fast.

If you want to efficiently keep track of size, you can do it outside of the data type. It seems to add unnecessary complication to the structure itself.

Are you saying that in order for me (the user) to get the size of a hashmap I have to wrap it with a number which tracks the size and prove equality myself? That's ludicrous. Show me a hashmap API in any programming language where size is not O(1). Yes it makes the data structure more complicated, but if you think that's the optimization criterion then you are mistaken. This data structure is all about "I did the work so you don't have to". Don't cut things out of the design that were put there for a reason.

spl · 2019-03-12T11:25:56Z

Show me a hashmap API in any programming language where size is not O(1).

Haskell

digama0 · 2019-03-12T11:47:32Z

Well okay then, color me surprised. The Haskell implementation is pretty sophisticated too, so it's not for simplicity reasons. Maybe they don't think it gets used enough to be worth it? I'm still not convinced though. Too bad it doesn't say why it was done this way.

spl · 2019-03-12T12:35:41Z

The Haskell implementation is pretty sophisticated too, so it's not for simplicity reasons. Maybe they don't think it gets used enough to be worth it?

I think performance is the key. There's some discussion at haskell-unordered-containers/unordered-containers#138.

digama0 · 2019-03-12T12:48:54Z

Thanks for the link. Note that the Haskell implementation is recursive, so this would put the size in many places in Haskell when we only have to store it once. If the performance overhead becomes too much we can split it off, but if anything the discussion there indicates that this is actually an unusual choice for Haskell. We don't have nearly as many compiler optimizations that would make the relative performance worse. In particular find is exactly as fast with or without size, and that's the primary use case.

Anyway, stop trying to remove functionality. As I said, this won't be merged unless it's better in every way, no regressions allowed.

cipher1024 · 2019-04-01T20:50:27Z

What is the next step here?

kim-em · 2019-04-01T22:18:44Z

My understanding of the above is that this implementation is intended as a fixed-size hash map, and there is still going to be a dynamically-sized analogue (which bundles .size), which hasn't been implemented yet. I think Mario is saying that there's no point merging this part until the user-facing dynamically-sized hash map is written as well.

Is that correct?

spl · 2019-04-02T07:42:40Z

I currently have no plans to continue this work. Please feel free to do with it what you will.

cipher1024 · 2019-04-02T12:06:51Z

@spl What changed your mind about this PR?

feat(data/hashmap): initial implementation

5a2269b

cipher1024 requested a review from digama0 March 11, 2019 20:14

cipher1024 self-assigned this Mar 11, 2019

kim-em closed this Apr 15, 2019

spl deleted the hashmap branch April 16, 2019 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data/hashmap): initial implementation #812

feat(data/hashmap): initial implementation #812

spl commented Mar 11, 2019

cipher1024 commented Mar 11, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

spl commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

cipher1024 commented Apr 1, 2019

kim-em commented Apr 1, 2019

spl commented Apr 2, 2019

cipher1024 commented Apr 2, 2019

feat(data/hashmap): initial implementation #812

feat(data/hashmap): initial implementation #812

Conversation

spl commented Mar 11, 2019

cipher1024 commented Mar 11, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

spl commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

spl commented Mar 12, 2019

digama0 commented Mar 12, 2019

cipher1024 commented Apr 1, 2019

kim-em commented Apr 1, 2019

spl commented Apr 2, 2019

cipher1024 commented Apr 2, 2019