-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve caching by comparing file hashes as fallback for mtime and size #3821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a nice feature!
|
||
with open(path, "rb") as fp: | ||
data = fp.read() | ||
return hashlib.sha256(data).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it should be fine to use sha1 here, which is about 1.4x faster for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sha256
works well for mypy. I would stick with it in this case.
https://github.com/python/mypy/blob/v1.5.1/mypy/util.py#L501-L510
* Skip hashing if file sizes don't match * Use is_changed in filtered_cached * Combine update and write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Looks good, but I'd like for Jelle to take a look before merging
) as write_cache: | ||
cmd = [str(src), "--diff"] | ||
if color: | ||
cmd.append("--color") | ||
invokeBlack(cmd) | ||
cache_file = get_cache_file(mode) | ||
assert cache_file.exists() is False | ||
read_cache.assert_called_once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behaviour from eagerly reading the cache. Doesn't seem like a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a result of initializing the cache with cache = Cache.read(mode)
. Technically it did one unnecessary read in some cases.
Changing it wasn't too difficult. I just pushed 85b4a91 to delay the cache read.
Co-authored-by: Shantanu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure I like this impl of delayed cache read, it seems too easy to forget to call cache.read()
. I think I'd prefer one of:
a) make file_data
a property that reads the cache file if self._file_data
is not set
b) initialise file_data
to None, so we get errors if we try to use the Cache without read
-ing first
c) revert to before the last commit
True, even though that is how the current implementation (on main) does it. Always initialize
I don't like (a). Doing the I reverted the commit. Long term it might make sense to use the cache in |
@JelleZijlstra this looks good to me, but I'd like to wait until you have a chance to take a look! |
return hashlib.sha256(data).hexdigest() | ||
|
||
@staticmethod | ||
def get_file_data(path: Path) -> FileData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a global function? staticmethods often feel a bit useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it here as it helps to group these methods nicely. Obviously personal preference. Though, if you want me to change it, I can do that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just leave it as is, thanks!
Hmm we'll see if #3843 fixes CI |
Any indication when the next release will be? I wouldn't ask normally, but the change will improve CI times drastically for larger projects when combined with caching. Would love to start using it. |
It's been about two months, so I'll start the release process over the next few days. |
Description
Rewrite and improve caching implementation to use file hashes as fallback for mtime and size comparisons. Especially for CI systems, comparing just based on mtime makes caching practically useless. With each new run and git checkout it changes and the cache will miss even if the file didn't change.
This PR adds a fallback to compare file hashes if the mtime changed to resolve that. It's not as fast, but still much faster than formatting the file outright. This approach is used by other tools as well, like mypy.
https://github.com/python/mypy/blob/v1.4.1/mypy/fswatcher.py#L80-L88
For the initial caching implementation comparing hashes was dismissed because the benefit was seen as not worth it (at least at first) #109 (comment). However, as mentioned above and later in the issue #109 (comment), it's necessary for CI systems.
A quick performance comparison for https://github.com/home-assistant/core, roughly 10.000 files, run with pre-commit and Github actions.
23.7.0
: ~3:30 - 4:00minChecklist - did you ...
CHANGES.md
if necessary?