Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ignoring input files for symlink reasons #4222

Merged
merged 5 commits into from
Feb 12, 2024

Conversation

hauntsaninja
Copy link
Collaborator

@hauntsaninja hauntsaninja commented Feb 11, 2024

This relates to #4015, #4161 and the behaviour of os.getcwd()

Black is a big user of pathlib and as such loves doing .resolve(), since for a long time it was the only good way of getting an absolute path in pathlib. However, this has two problems:

The first minor problem is performance, e.g. in #3751 I (safely) got rid of a bunch of .resolve() which made Black 40% faster on cached runs.

The second more important problem is that always resolving symlinks results in unintuitive exclusion behaviour. For instance, a gitignored symlink should never alter formatting of your actual code. This kind of thing was reported by users a few times.

In #3846, I improved the exclusion rule logic for symlinks in gen_python_files and everything was good.

But gen_python_files isn't enough, there's also get_sources, which handles user specified paths directly (instead of files Black discovers). So in #4015, I made a very similar change to #3846 for get_sources, and this is where some problems began.

The core issue was the line:

root_relative_path = path.absolute().relative_to(root).as_posix()

The first issue is that despite root being computed from user inputs, we call .resolve() while computing root (likely unecessarily). Which means that path may not actually be relative to root. So I started off this PR trying to fix that, when I ran into the second issue. Which is that os.getcwd() (as called by os.path.abspath or Path.absolute or Path.cwd) also often resolves symlinks!

>>> import os
>>> os.environ.get("PWD")
'/Users/shantanu/dev/black/symlink/bug'
>>> os.getcwd()
'/Users/shantanu/dev/black/actual/bug'

This also meant that the breakage often would not show up when input relative paths.

This doesn't affect gen_python_files / #3846 because things are known to be relative to root and always absolute.

Anyway, it looks like #4161 fixed the crash by just swallowing the error and ignoring the file. Instead, we should just try to compute the actual relative path. I think this PR should be quite safe, but we could also consider reverting some of the previous changes; the associated issues weren't too popular.

At the same time, I think there's still behaviour that can be improved and I kind of want to make larger changes, but maybe I'll save that for if we do something like #3952

Fixes #4205, fixes #4209, original report in #4077

This relates to psf#4015, psf#4161 and the behaviour of os.getcwd()

Black is a big user of pathlib and as such loves doing `.resolve()`,
since for a long time it was the only good way of getting an absolute
path in pathlib. However, this has two problems:

The first minor problem is performance, e.g. in psf#3751 I (safely) got rid
of a bunch of `.resolve()` which made Black 40% faster on cached runs.

The second more important problem is that always resolving symlinks
results in unintuitive exclusion behaviour. For instance, a gitignored
symlink should never alter formatting of your actual code. This kind of
thing was reported by users a few times.

In psf#3846, I improved the exclusion rule logic for symlinks in
`gen_python_files` and everything was good.

But `gen_python_files` isn't enough, there's also `get_sources`, which
handles user specified paths directly (instead of files Black
discovers). So in psf#4015, I made a very similar change to psf#3846 for
`get_sources`, and this is where some problems began.

The core issue was the line:
```
root_relative_path = path.absolute().relative_to(root).as_posix()
```
The first issue is that despite root being computed from user inputs, we
call `.resolve()` while computing it (likely unecessarily). Which means
that `path` may not actually be relative to `root`. So I started off
this PR trying to fix that, when I ran into the second issue. Which is
that `os.getcwd()` (as called by `os.path.abspath` or `Path.absolute` or
`Path.cwd`) also often resolves symlinks!
```
>>> import os
>>> os.environ.get("PWD")
'/Users/shantanu/dev/black/symlink/bug'
>>> os.getcwd()
'/Users/shantanu/dev/black/actual/bug'
```
This also meant that the breakage often would not show up when input
relative paths.

This doesn't affect `gen_python_files` / psf#3846 because things are always
absolute and known to be relative to `root`.

Anyway, it looks like psf#4161 fixed the crash by just swallowing the error
and ignoring the file. Instead, we should just try to compute the actual
relative path. I think this PR should be quite safe, but we could also
consider reverting some of the previous changes; the associated issues
weren't too popular.

At the same time, I think there's still behaviour that can be improved
and I kind of want to make larger changes, but maybe I'll save that for
if we do something like psf#3952

Hopefully fixes psf#4205, fixes psf#4209, actual fix for psf#4077
@@ -734,6 +734,7 @@ def get_sources(
"""Compute the set of files to be formatted."""
sources: Set[Path] = set()

assert root.is_absolute(), f"INTERNAL ERROR: `root` must be absolute but is {root}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar assert in gen_python_files (which this usually calls)

root_relative_path = get_root_relative_path(path, root, report)

if root_relative_path is None:
if resolves_outside_root_or_cannot_stat(path, root, report):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you may have guessed from the name, the difference between this and get_root_relative_path (from #4161) is the .resolve()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helped me located where the filtering happens and I just added sources.add(path) before the old "continue"in line 755/756 and my problem was solved.
Tho this, if you need, I can still test it against my case where my home folder ~ was set to a symlink like /homes/user -> /external/homes/user. Please don't hesitate to tell me if you need any help! Thanks for the great work again!

return path.absolute().relative_to(root)
except ValueError:
pass
root_parent = next((p for p in path.parents if _cached_resolve(p) == root), None)
Copy link
Collaborator Author

@hauntsaninja hauntsaninja Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added _cached_resolve mainly in case this loop x many input files is expensive

@@ -339,7 +348,8 @@ def gen_python_files(

assert root.is_absolute(), f"INTERNAL ERROR: `root` must be absolute but is {root}"
for child in paths:
root_relative_path = child.absolute().relative_to(root).as_posix()
assert child.is_absolute()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new, this should be guaranteed by path = root / (path.resolve().relative_to(root)) in get_sources

Copy link

github-actions bot commented Feb 11, 2024

diff-shades reports zero changes comparing this PR (43aad90) to main (a201003).


What is this? | Workflow run | diff-shades documentation

src=["-"],
expected=[],
stdin_filename=stdin_filename,
)

@patch("black.find_project_root", lambda *args: (THIS_DIR.resolve(), None))
def test_get_sources_with_stdin(self) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related, but while I was here I removed some monkeypatching since it's not needed. We probably do a little too much mocking, so the new tests touch the file system

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should prefer to avoid mocking.

@PabloLION
Copy link

(fake@)hauntsaninja, Thank you for your continuous and beautiful work, I was waiting for this for a while, and I thought #4221 solves this problem. But sadly, no.
An idea: How about adding a work around flag like --experimental-force-ignore-symlink-check, since it's blocking many including me myself? If approved, I'll be glad to PR on this.

@hauntsaninja
Copy link
Collaborator Author

PabloLION, thanks for taking a look at this PR. Does this PR solve the thing blocking you? If it does, we can make a release relatively quickly once it's merged.

@PabloLION
Copy link

I replied between the code #4222 (comment)
please let me know if assist is needed.

def test_get_sources_with_stdin_symlink_outside_root(
self,
) -> None:
path = THIS_DIR / "data" / "include_exclude_tests"
stdin_filename = str(path / "b/exclude/a.py")
outside_root_symlink = Path("/target_directory/a.py")
root = Path("target_dir/").resolve().absolute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_dir/ or target_directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_dir is right. Rename doesn't affect semantics, I did it to make the test clearer, the relevant difference is /target_directory is absolute and so is outside root

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it would have been slightly clearer if you used two names that didn't look like one was a variant of the other, e.g. target_dir1 and target_dir2. What you have now is fine too though.

src=["-"],
expected=[],
stdin_filename=stdin_filename,
)

@patch("black.find_project_root", lambda *args: (THIS_DIR.resolve(), None))
def test_get_sources_with_stdin(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should prefer to avoid mocking.

Copy link
Collaborator

@JelleZijlstra JelleZijlstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. Some optional nits.

src/black/files.py Outdated Show resolved Hide resolved
Co-authored-by: Jelle Zijlstra <[email protected]>
@hauntsaninja hauntsaninja merged commit 23dfc5b into psf:main Feb 12, 2024
46 checks passed
@hauntsaninja hauntsaninja deleted the normalise-simplify branch February 12, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants