Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move src/parser.c to git LFS #273

Open
tamasvajk opened this issue Jan 11, 2023 · 24 comments
Open

Move src/parser.c to git LFS #273

tamasvajk opened this issue Jan 11, 2023 · 24 comments

Comments

@tamasvajk
Copy link
Collaborator

I started receiving the below warning on pushes:

remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File src/parser.c is 50.90 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.

I'm not sure what would be the impact of moving to LFS. Maybe some documentation would need to be updated.

@Sjord
Copy link
Collaborator

Sjord commented Jan 11, 2023

We could reconsider whether parser.c should be committed at all. It is auto-generated with tree-sitter generate, so I am not sure it should be in the repository.

@tamasvajk
Copy link
Collaborator Author

Not committing it would be even better. But I was under the impression that this is a de facto requirement of a tree sitter grammar repo, because this way all you have to do is check out the grammar’s git repository into a well-known location1. I'm not sure if this is a hard requirement or if we can deviate from it.

Footnotes

  1. https://dcreager.net/tree-sitter/getting-started/#installing-a-grammar

@tamasvajk
Copy link
Collaborator Author

tamasvajk commented Jan 11, 2023

@damieng I just checked the recent size increases of src/parser.c. It looks like 168390c increased it from 30 to 50MB. It's most probably because of the [$.assignment_expression, $._expression], conflict. If I remove it, this is the message I get from tree-sitter:

❯ tree-sitter generate                 
Unresolved conflict for symbol sequence:

  '*'  _lvalue_expression  •  '='  …

Possible interpretations:

  1:  '*'  (_expression  _lvalue_expression)  •  '='  …
  2:  '*'  (assignment_expression  _lvalue_expression  •  assignment_operator  _expression)  (precedence: 0, associativity: Right)

Possible resolutions:

  1:  Specify a higher precedence in `assignment_expression` than in the other rules.
  2:  Specify a higher precedence in `_expression` than in the other rules.
  3:  Specify a left or right associativity in `_expression`
  4:  Add a conflict for these rules: `assignment_expression`, `_expression`

Why is there a conflict? Why isn't '*' _lvalue_expression reduced to _pointer_indirection_expression which has higher precedence than assignment_expression?

EDIT: For future reference, I asked the same question in tree-sitter/tree-sitter#2024.

@damieng
Copy link
Contributor

damieng commented Jan 11, 2023

I don't know enough about how tree-sitter resolves these precedence rules to be helpful here unfortunately (every time I read the docs I feel like I've got a good handle on it but then once I dig into our non-trivial rule set it escapes me)

Whatever we do here it might be worth adding some kind of file size echo to a text file so that we can see at a glance on PR's how a change affects large generated file sizes.

@sogaiu
Copy link

sogaiu commented Jan 12, 2023

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src (src/parser.c and friends) fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

@theHamsta
Copy link

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src/parser.c and friends fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

We mostly use tar-balls of individual commits and download them via curl. With LFS, I don't know whether src/parser.c will end up in the tar archive (https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/shell_command_selectors.lua#L215-L216). If src/parser.c is missing, then we would mark the parser as requires_generate_from_grammar and users then need to have the tree-sitter-cli installed, which is already the case for some parsers.

@sogaiu
Copy link

sogaiu commented Jan 12, 2023

It looks like difftastic uses tree-sitter-c-sharp and assumes src/parser.c is available.

Not really sure though...but I would be surprised if @Wilfred didn't know :)

@sogaiu
Copy link

sogaiu commented Jan 12, 2023

I'm not sure how cursorless does its building, but I think it also transitively(?) uses tree-sitter-c-sharp.

There's a custom web-tree-sitter mentioned so may be Emscripten is involved in compiling parser.c and friends to produce .wasm.

May be @pokey knows the details.

@pokey
Copy link

pokey commented Jan 12, 2023

Here's where we build the wasm:

https://github.com/cursorless-dev/vscode-parse-tree/blob/main/Makefile#L43

So, as long as parser.c is in the npm package, I think we're ok. Fwiw we do intend to migrate all our tree-sitter language dependencies to use GitHub commit sha's instead of npm versions, because the npm versions tend to be stale. I believe that means we'd run into some issues if you stopped including parser.c in the repo

That being said, I think we could probably add a step to generate parser.c before we compile the wasm, so should be ok.

But I do believe every other tree-sitter grammar repo keeps the generated parser.c in source control, so it may come as a surprise to other downstream consumers if it does not

@damieng
Copy link
Contributor

damieng commented Jan 12, 2023

I think this should be a discussion across tree-sitter grammars in general. Having tree-sitter-c-sharp be different/unusual from other grammars in this respect will not help adoption.

Unfortunately we don't really have a broad tree-sitter discussion channel. Perhaps we should start one.

I've create a Discord one if anyone is interested https://discord.gg/KrRa5ATb

@pokey
Copy link

pokey commented Jan 13, 2023

Agreed. A lot of discussion seems to happen on https://github.com/tree-sitter/tree-sitter/discussions, so it's probably worth asking there. Agreed a Discord server could be useful, though. I just asked to see if there already was one (tree-sitter/tree-sitter#2027)

@theHamsta
Copy link

grafik

There is an option to include Git LFS in archives, which would maybe a possibility for everyone using tree-sitter-c-sharp via tarball. When anyone is using Git, the Git LFS requirement would propagate to the consuming repo.

@tamasvajk
Copy link
Collaborator Author

I found this related discussion: tree-sitter/tree-sitter#1243.

@sogaiu
Copy link

sogaiu commented Jan 26, 2023

I tried tree-sitter generate for tree-sitter-c-sharp and it really takes quite a while doesn't it:

$ time tree-sitter generate

real	4m2.854s
user	3m52.457s
sys	0m10.372s

Memory usage exceeded 50GB here too...

@damieng
Copy link
Contributor

damieng commented Jan 26, 2023

Yes, it has grown quite considerably lately. We're going to have to make some trade-offs between handling every single scenario possible in C# vs performance I suspect.

@sogaiu
Copy link

sogaiu commented Feb 14, 2023

Somewhat related, I noticed this issue at the tree-sitter-swift repository: alex-pinkus/tree-sitter-swift#149

@damieng
Copy link
Contributor

damieng commented Feb 14, 2023

That's a pretty interesting solution, I'd be up for doing that here. I wonder if we can proactively reach out to dependencies we know will be affected.

@sogaiu
Copy link

sogaiu commented Feb 14, 2023

The issue of it being non-trivial to contact users of one's grammar seems to be the sort of thing that many grammars will / do face.

I wonder if there is a good approach to this.

We started a sticky issue where we have began to announce potential upcoming changes and asked the users we know about to subscribe to it and let us know if things we're planning on could be a problem: sogaiu/tree-sitter-clojure#33

We tried to ascertain who was using us and then went around to the folks we came up with, but later it turned out there were other folks.

From the perspective of a "user" (e.g. nvim-treesitter, Emacs 29+, difftastic, helix-editor, etc.) though, I would imagine that there would be a preference to not have different methods of being up-to-date about upcoming changes. Some kind of unified way seems like a better deal.

I wonder if there's something that can be done at the tree-sitter repository...like a dedicated thread(?) / discussion per grammar? (I haven't found the Discussions UI / UX to be that great though -- stuff seems to get easily hidden and hard to search -- though possibly that has changed over time.)

@damieng
Copy link
Contributor

damieng commented Feb 14, 2023

I wonder then if we flip what tree-sitter-swift did on its head and instead move development to a new dev branch that PR's need to be made against that has no generated files.

We'd then adopt the tree-sitter-swift approach that when a release is tagged we generate the files and merge them into master (rather than with-generated-files)

That way everyone who is relying on master and the generated artefacts today is unaffected other than lagging behind active development.

Given there's only a small handful of contributors to the repo the work in us changing from developing against master to developing against dev or develop would be quite small.

@Thaodan
Copy link

Thaodan commented Jul 7, 2024

Why not the other way around? Unless the user knows they can expect generated files to be inside the repository they don't expect it. Also updating a generated file adds major noise to the commit history.

@Thaodan
Copy link

Thaodan commented Jul 7, 2024

Improving the memory used to generate this grammar would be great. I'm at 26.5GB right now for some architectures I don't even have the memory to generate the grammar (Arm and PPC).

@damieng
Copy link
Contributor

damieng commented Jul 8, 2024

We'd love to be able to improve the memory usage but the only options I'm aware of there are either:

  1. Reduce the amount of C# syntax we support
  2. Make changes to the core tree-sitter engine

We really don't want to do option 1 as it would be a breaking change and make this library less useful. As for option 2 I wouldn't know where to start and I suspect the only people who do are really busy on other things.

@Thaodan
Copy link

Thaodan commented Jul 14, 2024

I get the point, I wanted to highlight the amount of memory required to generate the current grammar.

If it only would be possible to generate parts of the grammar in a separate process.

@damieng
Copy link
Contributor

damieng commented Jul 14, 2024

It's definitely something that needs to be done but that would happen in tree-sitter itself not the C# grammar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants