Improve caching by comparing file hashes as fallback for mtime and size #3821

cdce8p · 2023-07-29T16:50:27Z

Description

Rewrite and improve caching implementation to use file hashes as fallback for mtime and size comparisons. Especially for CI systems, comparing just based on mtime makes caching practically useless. With each new run and git checkout it changes and the cache will miss even if the file didn't change.

This PR adds a fallback to compare file hashes if the mtime changed to resolve that. It's not as fast, but still much faster than formatting the file outright. This approach is used by other tools as well, like mypy.
https://github.com/python/mypy/blob/v1.4.1/mypy/fswatcher.py#L80-L88

For the initial caching implementation comparing hashes was dismissed because the benefit was seen as not worth it (at least at first) #109 (comment). However, as mentioned above and later in the issue #109 (comment), it's necessary for CI systems.

A quick performance comparison for https://github.com/home-assistant/core, roughly 10.000 files, run with pre-commit and Github actions.

23.7.0: ~3:30 - 4:00min
With PR: ~10s

Checklist - did you ...

Add an entry in CHANGES.md if necessary?
Add / update tests if necessary?
Add new / update outdated documentation?

github-actions · 2023-07-29T19:37:40Z

diff-shades reports zero changes comparing this PR (6e1a57b) to main (793c2b5).

What is this? | Workflow run | diff-shades documentation

hauntsaninja

Thanks, this is a nice feature!

src/black/cache.py

hauntsaninja · 2023-08-12T21:48:27Z

src/black/cache.py

+
+        with open(path, "rb") as fp:
+            data = fp.read()
+        return hashlib.sha256(data).hexdigest()


nit: I think it should be fine to use sha1 here, which is about 1.4x faster for me

sha256 works well for mypy. I would stick with it in this case.
https://github.com/python/mypy/blob/v1.5.1/mypy/util.py#L501-L510

src/black/cache.py

* Skip hashing if file sizes don't match * Use is_changed in filtered_cached * Combine update and write

hauntsaninja

Thanks for the updates! Looks good, but I'd like for Jelle to take a look before merging

hauntsaninja · 2023-08-15T07:25:11Z

tests/test_black.py

            ) as write_cache:
                cmd = [str(src), "--diff"]
                if color:
                    cmd.append("--color")
                invokeBlack(cmd)
                cache_file = get_cache_file(mode)
                assert cache_file.exists() is False
+                read_cache.assert_called_once()


This is a change in behaviour from eagerly reading the cache. Doesn't seem like a big deal

Yeah, this was a result of initializing the cache with cache = Cache.read(mode). Technically it did one unnecessary read in some cases.

Changing it wasn't too difficult. I just pushed 85b4a91 to delay the cache read.

tests/test_black.py

Co-authored-by: Shantanu <[email protected]>

hauntsaninja

Hmm, I'm not sure I like this impl of delayed cache read, it seems too easy to forget to call cache.read(). I think I'd prefer one of:
a) make file_data a property that reads the cache file if self._file_data is not set
b) initialise file_data to None, so we get errors if we try to use the Cache without read-ing first
c) revert to before the last commit

cdce8p · 2023-08-15T09:23:22Z

Hmm, I'm not sure I like this impl of delayed cache read, it seems too easy to forget to call cache.read().

True, even though that is how the current implementation (on main) does it. Always initialize cache = {} and do read_cache(mode) if necessary. But I agree, it's easy to accidentally forget the read.

I think I'd prefer one of:
a) make file_data a property that reads the cache file if self._file_data is not set
b) initialise file_data to None, so we get errors if we try to use the Cache without read-ing first
c) revert to before the last commit

I don't like (a). Doing the read when accessing the file_data is not something one would expect. It should either be in the constructor or a separate method. That's one thing I liked about the initial implementation Cache.read(mode) as classmethod was quite explicit and you immediately knew what would happen.

I reverted the commit. Long term it might make sense to use the cache in --diff mode as well in which case it's always needed anyway.

hauntsaninja · 2023-08-18T19:43:46Z

@JelleZijlstra this looks good to me, but I'd like to wait until you have a chance to take a look!

src/black/cache.py

JelleZijlstra · 2023-08-18T20:47:43Z

src/black/cache.py

+        return hashlib.sha256(data).hexdigest()
+
+    @staticmethod
+    def get_file_data(path: Path) -> FileData:


Why not a global function? staticmethods often feel a bit useless.

I like it here as it helps to group these methods nicely. Obviously personal preference. Though, if you want me to change it, I can do that too.

Let's just leave it as is, thanks!

hauntsaninja · 2023-08-19T00:50:40Z

Hmm we'll see if #3843 fixes CI

cdce8p · 2023-09-07T09:26:22Z

Any indication when the next release will be? I wouldn't ask normally, but the change will improve CI times drastically for larger projects when combined with caching. Would love to start using it.

JelleZijlstra · 2023-09-07T13:58:26Z

It's been about two months, so I'll start the release process over the next few days.

Fixes psf#4116 This logic was introduced in psf#3821, I believe as a result of copying logic inside mypy that I think isn't relevant to Black

Fixes #4116 This logic was introduced in #3821, I believe as a result of copying logic inside mypy that I think isn't relevant to Black

cdce8p added 2 commits July 29, 2023 18:14

Improve caching by comparing file hashes as fallback for mtime and size

d908b98

Fix tests

7dd6427

JelleZijlstra self-requested a review July 30, 2023 03:08

hauntsaninja reviewed Aug 12, 2023

View reviewed changes

Code review

07fe5b1

* Skip hashing if file sizes don't match * Use is_changed in filtered_cached * Combine update and write

cdce8p requested a review from hauntsaninja August 15, 2023 06:04

hauntsaninja approved these changes Aug 15, 2023

View reviewed changes

cdce8p and others added 2 commits August 15, 2023 09:33

Update tests/test_black.py

64bbbce

Co-authored-by: Shantanu <[email protected]>

Delay cache read

85b4a91

hauntsaninja reviewed Aug 15, 2023

View reviewed changes

Revert "Delay cache read"

088ea2f

JelleZijlstra reviewed Aug 18, 2023

View reviewed changes

Code review

4278c86

JelleZijlstra approved these changes Aug 19, 2023

View reviewed changes

Merge branch 'main' into improve-caching

6e1a57b

hauntsaninja merged commit c6a031e into psf:main Aug 19, 2023
29 checks passed

cdce8p deleted the improve-caching branch August 19, 2023 07:56

This was referenced Sep 8, 2023

Add hash comparison for pyc cache files pytest-dev/pytest#11418

Open

Add black caching [ci] home-assistant/core#99967

Merged

JelleZijlstra mentioned this pull request Sep 11, 2023

Run wheel build on fewer PRs? #3879

Closed

hauntsaninja mentioned this pull request Dec 25, 2023

Changes to files happing quickly after running black are not detected #4116

Closed

hauntsaninja added a commit to hauntsaninja/black that referenced this pull request Dec 25, 2023

Do not round cache mtimes

ed5e79c

Fixes psf#4116 This logic was introduced in psf#3821, I believe as a result of copying logic inside mypy that I think isn't relevant to Black

hauntsaninja mentioned this pull request Dec 25, 2023

Do not round cache mtimes #4128

Merged

hauntsaninja added a commit that referenced this pull request Dec 28, 2023

Do not round cache mtimes (#4128)

bf6cabc

Fixes #4116 This logic was introduced in #3821, I believe as a result of copying logic inside mypy that I think isn't relevant to Black

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve caching by comparing file hashes as fallback for mtime and size #3821

Improve caching by comparing file hashes as fallback for mtime and size #3821

cdce8p commented Jul 29, 2023

github-actions bot commented Jul 29, 2023 •

edited

Loading

hauntsaninja left a comment

hauntsaninja Aug 12, 2023

cdce8p Aug 13, 2023

hauntsaninja left a comment

hauntsaninja Aug 15, 2023

cdce8p Aug 15, 2023

hauntsaninja left a comment •

edited

Loading

cdce8p commented Aug 15, 2023

hauntsaninja commented Aug 18, 2023

JelleZijlstra Aug 18, 2023

cdce8p Aug 19, 2023

JelleZijlstra Aug 19, 2023

hauntsaninja commented Aug 19, 2023

cdce8p commented Sep 7, 2023

JelleZijlstra commented Sep 7, 2023

Improve caching by comparing file hashes as fallback for mtime and size #3821

Improve caching by comparing file hashes as fallback for mtime and size #3821

Conversation

cdce8p commented Jul 29, 2023

Description

Checklist - did you ...

github-actions bot commented Jul 29, 2023 • edited Loading

hauntsaninja left a comment

Choose a reason for hiding this comment

hauntsaninja Aug 12, 2023

Choose a reason for hiding this comment

cdce8p Aug 13, 2023

Choose a reason for hiding this comment

hauntsaninja left a comment

Choose a reason for hiding this comment

hauntsaninja Aug 15, 2023

Choose a reason for hiding this comment

cdce8p Aug 15, 2023

Choose a reason for hiding this comment

hauntsaninja left a comment • edited Loading

Choose a reason for hiding this comment

cdce8p commented Aug 15, 2023

hauntsaninja commented Aug 18, 2023

JelleZijlstra Aug 18, 2023

Choose a reason for hiding this comment

cdce8p Aug 19, 2023

Choose a reason for hiding this comment

JelleZijlstra Aug 19, 2023

Choose a reason for hiding this comment

hauntsaninja commented Aug 19, 2023

cdce8p commented Sep 7, 2023

JelleZijlstra commented Sep 7, 2023

github-actions bot commented Jul 29, 2023 •

edited

Loading

hauntsaninja left a comment •

edited

Loading