Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

feat(data/hashmap): initial implementation #812

Closed
wants to merge 1 commit into from
Closed

feat(data/hashmap): initial implementation #812

wants to merge 1 commit into from

Conversation

spl
Copy link
Collaborator

@spl spl commented Mar 11, 2019

I would like to see this (or something like it) eventually replace the existing data/hash_map.lean. This is just the beginning: there are plenty more definitions and theorems to be added, but I think the foundation is good and ready for review.

Some thoughts:

  • I think the name hashmap fits better than hash_map with the current naming scheme found in the core Lean library as well as mathlib.
  • I've included notes on alternative approaches to structuring hashmap that I explored. While it's not exactly documentation on the existing work, I think it does provide a useful perspective.
  • I changed list.nodupkeys_join to simplify proofs. With the existing theorem, there ends up being several extra pairwise_maps, both in list.nodupkeys_join itself and in theorems that use it.

  • reviewed and applied the coding style: coding, naming
  • make sure definitions and lemmas are put in the right files
  • make sure definitions and lemmas are not redundant

@cipher1024 cipher1024 requested a review from digama0 March 11, 2019 20:14
@cipher1024 cipher1024 self-assigned this Mar 11, 2019
@cipher1024
Copy link
Collaborator

Why does it promise to be better than the existing implementation?

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

Why does it promise to be better than the existing implementation?

  • Simpler:
    • doesn't involve bucket_array, which I couldn't find a use for
    • doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary
    • doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary
  • Better definitions and proofs (at least, I hope):
    • builds on the definitions, theorems, and concepts of list.sigma and alist
    • uses hash: α → fin n instead of hash_fn: α → nat
    • doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist
    • unbundles the hash map size

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

While I'm not completely against the idea of a rewrite (I wrote this library a long time ago and the proofs are more convoluted then they would be if I wrote them today), it was actually written with performance and correctness in mind, and I would want some strong evidence that the replacement is better on all fronts first.

Why does it promise to be better than the existing implementation?

  • doesn't involve bucket_array, which I couldn't find a use for

It's just an internal implementation thing. I see you did basically the same thing without the name.

  • doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary

It's definitely necessary. It is there so that hash_map.size is O(1)

  • doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary

You can't do the mod without it. I'll get back to this with the other points.

  • builds on the definitions, theorems, and concepts of list.sigma and alist
  • doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist

I'm fine with this. Obviously this wasn't around when I wrote it, and incorporating it is a good way to tie the library together.

  • uses hash: α → fin n instead of hash_fn: α → nat
  • unbundles the hash map size

This is a problem. I notice that you didn't write the most important function, insert. What is the type of insert? You will end up forcing the map to never reallocate, which is very bad for performance.

I think you have identified that there are two data structures involved here: a fixed size hashmap, and a growable hashmap, built on the fixed hashmap much like buffer builds on array. As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible. The growable hashmap does need the bucket size to be positive, because otherwise it can't provide a A -> fin 0 hash function for the fixed hashmap.

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

An easy solution to the bucket numbering problem is to require that the number of buckets is always a power of 2, i.e. 2^n for some n in nat. This works with the automatic resizing algo because it always doubles the length, and it also makes the modulo operation more efficient (it's just a bit mask).

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

Thanks for the in-depth response, @digama0.

While I'm not completely against the idea of a rewrite (I wrote this library a long time ago and the proofs are more convoluted then they would be if I wrote them today), it was actually written with performance and correctness in mind, and I would want some strong evidence that the replacement is better on all fronts first.

  • doesn't involve bucket_array, which I couldn't find a use for

It's just an internal implementation thing. I see you did basically the same thing without the name.

Of course. There are quite a few lines spent on bucket_array in hash_map.lean, and I think it helped to simplify things by removing them.

  • doesn't have valid.len : bkts.as_list.length = sz, which doesn't seem necessary

It's definitely necessary. It is there so that hash_map.size is O(1)

Why is it important that hash_map.size be O(1)? That's something that should definitely be documented.

  • doesn't have the hash_map.nbuckets : ℕ+: the pnat here isn't necessary

You can't do the mod without it. I'll get back to this with the other points.

I'm not sure what you mean. I have a wrapper that allows you to use it, but I don't see why it's necessary for the structure. I'm probably missing something.

def fin.lift_mod_pnat (n : ℕ+) {α} (f : α → ℕ) (a : α) : fin n :=
⟨f a % n, nat.mod_lt _ n.property⟩

def mk_empty_pos (n : ℕ+) {α} (β : α → Type v) (f : α → ℕ) : hashmap n β :=
mk_empty β (fin.lift_mod_pnat n f)
  • builds on the definitions, theorems, and concepts of list.sigma and alist
  • doesn't have all of the “aux” proofs, since many things are now common with list.sigma/alist

I'm fine with this. Obviously this wasn't around when I wrote it, and incorporating it is a good way to tie the library together.

Yep.

  • uses hash: α → fin n instead of hash_fn: α → nat
  • unbundles the hash map size

This is a problem. I notice that you didn't write the most important function, insert. What is the type of insert? You will end up forcing the map to never reallocate, which is very bad for performance.

I don't follow you. I have written insert in lean-finmap, but, yeah, I haven't ported over here, yet. Why do you need to reallocate? Are you referring to the length of the association lists?

I think you have identified that there are two data structures involved here: a fixed size hashmap, and a growable hashmap, built on the fixed hashmap much like buffer builds on array. As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible. The growable hashmap does need the bucket size to be positive, because otherwise it can't provide a A -> fin 0 hash function for the fixed hashmap.

Ah, I see what you mean. That makes sense, and I'd be happy to go with two hash maps. So, consider this to be a fixed-size, and we can build the dynamically sized hash map on top of it.

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

An easy solution to the bucket numbering problem is to require that the number of buckets is always a power of 2, i.e. 2^n for some n in nat. This works with the automatic resizing algo because it always doubles the length, and it also makes the modulo operation more efficient (it's just a bit mask).

I get the efficiency of using powers of 2 for the array size, but I'm not sure what the “bucket numbering problem” is.

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

As an internal component of a growable hashmap, the fixed size hashmap should probably have the size and the hash function external, so that it doesn't need to take up space in the structure. The growable hashmap should be completely bundled, with only the key/value types visible.

I think I'm with you on the difference between the fixed- and dynamically-sized hash map sizes. In the former, the size never changes, which means it's really a parameter, not a changeable value. In the latter, the size changes and, therefore, the value should be kept.

As for the hash function, I'm not sure. I have vacillated between whether it should be a parameter or a field. Since it doesn't change, it feels like a parameter is more appropriate; however, I left it as a field for convenience sake. I'd be happy to make a parameter, though. But I'm not clear why it should be treated differently for fixed- and dynamically-sized hash maps.

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

I get the efficiency of using powers of 2 for the array size, but I'm not sure what the “bucket numbering problem” is.

I'm referring to the inconvenience of dealing with pnats. If they are powers of two the size parameter can be a nat, and the bucket array is always nonempty.

As for the hash function, I'm not sure. I have vacillated between whether it should be a parameter or a field. Since it doesn't change, it feels like a parameter is more appropriate; however, I left it as a field for convenience sake. I'd be happy to make a parameter, though. But I'm not clear why it should be treated differently for fixed- and dynamically-sized hash maps.

You could have the function external for the global hashmap as well, but you will also have to externalize the validity predicate, which will make it less ergonomic. Leaving stuff out of the structure will make it easier to reason about, but less encapsulated. So the user facing version (the growable hashmap) should be encapsulated, while the internal version (the fixed hashmap) should make its assumptions clear. Additionally, externalizing the function (and leaving it out of the type of the hashmap too) means that you aren't storing unnecessary copies of it in the growable hashmap and the fixed hashmap, which is good both because it saves on space and because it avoids an unnecessary coherence condition asserting the relation between the two functions.

Why is it important that hash_map.size be O(1)? That's something that should definitely be documented.

Probably it should be documented. But as for why to expect it, this is something I would expect of any decent hashmap implementation for a general purpose programming language. It's cheap to keep track of the information, and the alternative is crazy expensive (it's the same cost as a fold over the whole map) so it seems like a good investment. But if we have two versions of the data structure for internal and external use, then I'm fine with this only being in the user-facing data structure.

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

Why is it important that hash_map.size be O(1)?

this is something I would expect of any decent hashmap implementation for a general purpose programming language. It's cheap to keep track of the information, and the alternative is crazy expensive (it's the same cost as a fold over the whole map) so it seems like a good investment. But if we have two versions of the data structure for internal and external use, then I'm fine with this only being in the user-facing data structure.

I'm not convinced that storing the size is needed for a hash map in which the size is always determined by its structure and never used by the hash map itself. You could also make the argument that each of list, alist, finset, finmap, etc. should also keep track of its length/size, but none of them does. If you want to efficiently keep track of size, you can do it outside of the data type. It seems to add unnecessary complication to the structure itself.

On the other hand, with a hash map whose structure depends on its size, such as the dynamically-sized hash map, in which the size needs to be calculated for each insert operation (as an example), it does seem reasonable to store the size.

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

I'm not convinced that storing the size is needed for a hash map in which the size is always determined by its structure and never used by the hash map itself. You could also make the argument that each of list, alist, finset, finmap, etc. should also keep track of its length/size, but none of them does.

All of these examples are built on list, so your question boils down to "why doesn't list track sizes". I don't think list is actually a very good data structure for general programming use, but its usefulness in writing specifications and proving theorems cannot be denied. But for storing lists of things it is far from the best available data structure. I would recommend buffer as a drop in replacement, and it does actually store the length of the list.

never used by the hash map itself

What do you mean "used"? Data is used if it is accessed by methods from the public API, and size is definitely part of the public API. As another data point, ordmap and rb_map both store the size of the tree alongside the data, for no other reason than to make size reporting fast.

If you want to efficiently keep track of size, you can do it outside of the data type. It seems to add unnecessary complication to the structure itself.

Are you saying that in order for me (the user) to get the size of a hashmap I have to wrap it with a number which tracks the size and prove equality myself? That's ludicrous. Show me a hashmap API in any programming language where size is not O(1). Yes it makes the data structure more complicated, but if you think that's the optimization criterion then you are mistaken. This data structure is all about "I did the work so you don't have to". Don't cut things out of the design that were put there for a reason.

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

Show me a hashmap API in any programming language where size is not O(1).

Haskell

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

Well okay then, color me surprised. The Haskell implementation is pretty sophisticated too, so it's not for simplicity reasons. Maybe they don't think it gets used enough to be worth it? I'm still not convinced though. Too bad it doesn't say why it was done this way.

@spl
Copy link
Collaborator Author

spl commented Mar 12, 2019

The Haskell implementation is pretty sophisticated too, so it's not for simplicity reasons. Maybe they don't think it gets used enough to be worth it?

I think performance is the key. There's some discussion at haskell-unordered-containers/unordered-containers#138.

@digama0
Copy link
Member

digama0 commented Mar 12, 2019

Thanks for the link. Note that the Haskell implementation is recursive, so this would put the size in many places in Haskell when we only have to store it once. If the performance overhead becomes too much we can split it off, but if anything the discussion there indicates that this is actually an unusual choice for Haskell. We don't have nearly as many compiler optimizations that would make the relative performance worse. In particular find is exactly as fast with or without size, and that's the primary use case.

Anyway, stop trying to remove functionality. As I said, this won't be merged unless it's better in every way, no regressions allowed.

@cipher1024
Copy link
Collaborator

What is the next step here?

@kim-em
Copy link
Collaborator

kim-em commented Apr 1, 2019

My understanding of the above is that this implementation is intended as a fixed-size hash map, and there is still going to be a dynamically-sized analogue (which bundles .size), which hasn't been implemented yet. I think Mario is saying that there's no point merging this part until the user-facing dynamically-sized hash map is written as well.

Is that correct?

@spl
Copy link
Collaborator Author

spl commented Apr 2, 2019

I currently have no plans to continue this work. Please feel free to do with it what you will.

@cipher1024
Copy link
Collaborator

@spl What changed your mind about this PR?

@kim-em kim-em closed this Apr 15, 2019
@spl spl deleted the hashmap branch April 16, 2019 12:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants