Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed uuidtools #344

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iainbeeston
Copy link
Contributor

Right now, we cache schemas, and the key we use to cache them is either the URL we got them from, or if there's no URL, we generate a UUID from the json text of the schema. We're basically using a v5 UUID as a hashing function. What's more, json-schema includes it's own UUID library for doing this.

I've swapped that out, for generating a SHA1 hash instead. It should be fast, although in my own benchmarks, the speed is comparable to the current implementation (real world speed may depend on the size of the schemas being cached). At the very least, I believe it's clearer to use SHA1 (rather than UUID) and it also means we don't need an external ruby file copied into the project.

@iainbeeston
Copy link
Contributor Author

@RST-J Do you have any thoughts about this? It will allow us to remove another dependency (although this one is included inside json-schema itself)

@@ -54,6 +55,11 @@ def self.file_uri(uri)
Addressable::URI.convert_path(parsed_uri.path)
end

def self.fake_uri(string)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to name this hash_uri

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should rather have hash in the name than fake. But actually I would not call it hash_uri because the method itself does not hash URIs but is used to hash a schema. So maybe hash_schema would be a better name. But now it actually does not care about whether the input is a schema or not, it takes a string and returns a URI created from the hash of that string. So maybe uri_from_sha1 oder uri_from_sha1_hash or something like that would be more precise. But also more verbose. I'm personally getting more and more a fan of being as precise as possible with naming methods as it eases reasoning about code (we likely have some worse candidates here). What do you think?

@RST-J
Copy link
Contributor

RST-J commented Oct 5, 2016

Removing the dependency is a good idea I think, we didn't use anything else from that and your replacement fits the purpose.
In terms of speed, MD5 is faster, so maybe we could take just this? I mean, there is no security concern here (or am I missing something?). But there is a comment in the removed library which suggests not to do that:

UUID generation using SHA1. Recommended over create_md5.

As I dont't know why and there is no reason given, I'm not sure whether this performance improvement is worthwhile or not. What do you know and think about that?

@iainbeeston
Copy link
Contributor Author

I'm actually starting to have doubts about this now.

Perhaps we should be using String.hash, which would be an order of magnitude faster. So far as I'm aware String.hash should realistically not have collisions (although I'm struggling to find good evidence of that)

Also, it might be better to generate not just a hex string but a uri with a fake scheme that is immediately identifiable as an internally calculated value (something like "ref://af562bbc" or "internal://539daf54dd"?)

@iainbeeston
Copy link
Contributor Author

Incidentally, @mjc did suggest using String.hash in a previous pull request (any credit for the idea should go to him!)

@iainbeeston
Copy link
Contributor Author

I've now changed my mind entirely - I think we should simply use String.hash.to_s(16) (with an appropriate scheme to make it look like a URI).

String.hash in MRI ruby uses siphash-2-4, which should be good enough for our needs and much faster than SHA1

@RST-J
Copy link
Contributor

RST-J commented Nov 8, 2016

I think, using a custom scheme is a good idea. It prevents accidental collisions, no matter how improbable they are.
And, if String.hash turns out to be not collision resistant enough, again no matter how improbable, one can always circumvent the immediate problem by assigning an URI and we can mitigate the problem if it ever gets reported. As this is internal, I assume no one should rely on the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants