Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary branch of development is depended on by users #43

Closed
sogaiu opened this issue Feb 15, 2023 · 17 comments
Closed

Primary branch of development is depended on by users #43

sogaiu opened this issue Feb 15, 2023 · 17 comments

Comments

@sogaiu
Copy link
Owner

sogaiu commented Feb 15, 2023

From our research, it appears that our primary branch of development is being depended on by a number of users. It's become a sort of de-facto release branch.

I don't think we ever had a stated policy regarding our branches so perhaps that's not too surprising -- though even if we did, it's not clear how much difference that would have made :)

Possibly related to this is the inclusion of generated source in this branch. I've written up some notes on that here.

I noticed this issue at the tree-sitter-swift repository regarding inclusion of generated source and passed that along to an issue at tree-sitter-sharp. There was an interesting idea mentioned in a response by damieng:

I wonder then if we flip what tree-sitter-swift did on its head and instead move development to a new dev branch that PR's need to be made against that has no generated files.

We'd then adopt the tree-sitter-swift approach that when a release is tagged we generate the files and merge them into master (rather than with-generated-files)

That way everyone who is relying on master and the generated artefacts today is unaffected other than lagging behind active development.

Given there's only a small handful of contributors to the repo the work in us changing from developing against master to developing against dev or develop would be quite small.

May be this sort of thing is worth considering. I would likely opt for manually doing things though -- at least initially (I think tree-sitter-swift may have automation).

@dannyfreeman
Copy link
Collaborator

How would we maintain a separate branch without the generated files? Everytime we merge back from master->development we would have to create a new commit deleting the generated files. I think it would be fine to have a development branch, just not sure what we gain by removing the generated C source. I feel like leaving the C source in the dev branch would be fine

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 16, 2023

I think the topic of generated files has come up a few times at least at the tree-sitter repository. Here's one of the more notable ones: tree-sitter/tree-sitter#730 - regarding the "why", here are some points mentioned:

they hinder resolving diffs (even when resolved, it would need to be re-resolved in the future) across branches.

they hinder concurrent development (harder to solve diffs when more people work on code at the same time).

they hinder development by adding responsibility to developers (more things a dev must be aware of while working on something) when it can actually be deferred.

So some obvious questions are:

  • are all / some of these points correct?
  • do these points matter in our case?

I'm not sure :)

Here's some more from the tree-sitter-swift folks: https://github.com/alex-pinkus/tree-sitter-swift#where-is-your-parserc

This repository currently omits most of the code that is autogenerated during a build. This means, for instance, that grammar.json and parser.c are both only available following a build. It also significantly reduces noise during diffs.

The side benefit of not checking in parser.c is that you can guarantee backwards compatibility. Parsers generated by the tree-sitter CLI aren't always backwards compatible. If you need a parser, generate it yourself using the CLI; all the information to do so is available in this package. By doing that, you'll also know for sure that your parser version and your library version are compatible.

On a related note, there was this issue at the tree-sitter-swift repository created by someone involved with helix-editor.

Regarding:

Everytime we merge back from master->development

Naively (perhaps?), I was thinking that stuff might go from development to master, but not the other way. May be I'm missing something...

In the development -> master direction, I think it depends on how stuff is "transferred". It may be as you say though that there is some kind of friction that is a cost to be considered.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 17, 2023

I came across a tip about marking some files in src to signal to GitHub that they are generated via .gitattributes.

Apparently this:

excludes the files from the repository's language calculation (the colored language bar on the repo's main page) and prevents changes to those files from being shown in pull requests on GitHub

That seemed interesting.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 17, 2023

I also found this issue from about a year ago that touched a bit on security and generated files.

The idea mentions a "central parser registry" -- I was thinking the other day that it could be handy for there to be a single site that could host per-parser discussion / announcement sections for something like what we're doing for the "potential changes announcement".

I think users (e.g. nvim-treesitter, Emacs 29+, etc.) would find it much more to their liking to interface with one site rather than many different places (e.g. like we have our thing, and someone else might do some other thing, etc.).

If enough of these sorts of ideas are mentioned, perhaps someone will start to think it's worth running a single location that addresses some of these points...(I considered GitHub discussions for this type of thing but I don't think it works very well from a UI / UX perspective.)

@dannyfreeman
Copy link
Collaborator

Everytime we merge back from master->development

Naively (perhaps?), I was thinking that stuff might go from development to master, but not the other way. May be I'm missing something...

In the development -> master direction, I think it depends on how stuff is "transferred". It may be as you say though that there is some kind of friction that is a cost to be considered.

I think merging from master->development only applies if we need to do some kind of "hotfix" to master, which I don't think is relevant. Nothing is so important here that it needs to skip the development branch. Sorry if that was confusing!

@dannyfreeman
Copy link
Collaborator

If enough of these sorts of ideas are mentioned, perhaps someone will start to think it's worth running a single location that addresses some of these points...(I considered GitHub discussions for this type of thing but I don't think it works very well from a UI / UX perspective.)

Right now we have the pinned issue and anyone is free to submit normal github issues, which seems good enough for these purposes. Until something like that registry exists I think we're doing pretty good with what we have.

@dannyfreeman
Copy link
Collaborator

dannyfreeman commented Feb 24, 2023

As for the swift parser.c handling, I wonder how things like Emacs handle this. The default grammar installer now downloads a git repository and assumes the parser.c is checked in. I bet it would fail for this parser. Going against convention seems like it will making everyone else's tooling more difficult to write and maintain.

That's kind of the worst part about this checked in generated C code, everyone is dependent on it right now, and it's going to take a big coordinated effort to move away from it, and I'm not sure it will be worth the trouble for the entire community to try it. A lot of things will break for users if that happens.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 25, 2023

I looked at one of the comments in the issue I linked at tree-sitter-swift's repository and it looks like there is a dedicated branch that provides the generated files. Here is the comment (has a link to the branch too).

As I mentioned earlier (last bit), if we were to try something similar (which I'm not sure is worth it), I would prefer to do this manually. I'm not inclined to increasing reliance on this particular platform ATM.

Anyway, for me, bringing some of these things up is more about increasing awareness of what's being tried and why. For this particular issue of generated files in a repository currently I agree with what you mentioned:

Going against convention seems like it will making everyone else's tooling more difficult to write and maintain.

If the generated files in the repository haven't been bothering us, it seems that sticking with what we've been doing is fine for the moment.

Regarding the original point about our primary branch of development being depended on, this also seems like it would be impractical to get others to change away from.

Having a "development" branch seems convenient to me as it can signal to other folks (including us) where things might be going. What we've been doing seems to be creating separate branches for individual bits of work and then merging these to the primary branch. Folks becoming aware of these separate branches seems impractical, but one "development" branch doesn't seem too bad.

@dannyfreeman
Copy link
Collaborator

If the generated files in the repository haven't been bothering us, it seems that sticking with what we've been doing is fine for the moment.

It never really bothered me when working on the namespace tokens, and I don't see it becoming an issue anytime soon. I even managed a very lengthy rebase where I regenerated the parser.c output for every commit without much issue.

I'd say having a development branch is fine and won't cause any problems between us. We can even open PRs against the development branch itself if we feel it's necessary (I probably will because I rewrite commit history on my branches very frequently).

If we run into problems with the development branch I don't think we will have much issue fixing them between the two of us.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 27, 2023

Thanks for your comments.

I've gone ahead and made a dev branch.

I've started the branch with what was described around here regarding using a wrapper function so that we express our regular expressions using literal strings (which can be concatenated if necessary) instead of regex literals.

Other commits include:

  • using 0.20.7 to generate src content -- I used --no-bindings and --api 13 in the generate subcommand invocation
  • removal of *ependencies info from package.json as described here
  • removal of scripts from package.json -- this is meant to be partially along the lines of not being misleading
  • addition of bb.edn with supporting files along the lines described here and here
  • an attempt to address Does not handle metadata that is a tagged literal #46 as described here

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 28, 2023

May be it makes sense to mention the dev branch in #33 and possibly in the README [1].

Things that might be worth mentioning might include:

  1. get some sense of upcoming changes by examining commits but also trying things out
  2. don't be surprised at breakage on the dev branch

Also, I suspect we don't want to accumulate too much in the dev branch (for too long) without having the parts we want to keep make it to our primary branch.


[1] Thinking to rework the README in light of the changes on dev.

@dannyfreeman
Copy link
Collaborator

I personally don't think the permanent issue is worth mentioning the dev branch in, at least not right now.

I think we just use it to say we are about to merge changes into the master branch from the dev branch when the time is ready. At that time we can say, try the changes out on the dev branch if you are interested. Does that make sense? No need to ping everyone when there isn't really anything for them act on.

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 1, 2023

Thanks for the feedback.

I think we just use it to say we are about to merge changes into the master branch from the dev branch when the time is ready. At that time we can say, try the changes out on the dev branch if you are interested.

Ok, let's do it that way.

Over time the existence and purpose of the dev branch should percolate and the content of the updated README mentioned in #47 might help a bit along those lines too.

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 1, 2023

Since we now have a dev branch, perhaps we can view this issue as resolved :)

@sogaiu sogaiu closed this as completed Mar 1, 2023
@sogaiu
Copy link
Owner Author

sogaiu commented Mar 12, 2023

I personally don't think the permanent issue is worth mentioning the dev branch in, at least not right now.

I think we just use it to say we are about to merge changes into the master branch from the dev branch when the time is ready. At that time we can say, try the changes out on the dev branch if you are interested. Does that make sense? No need to ping everyone when there isn't really anything for them act on.

These ideas turned out to be unexpectedly useful today -- I ended up making a pre-0.0.12 branch because some the dev branch had bits on it that I didn't want to preserve (because of the Babashka tasks stuff being split off) and it was easier to create a new branch and add to existing issues than to change what was on the dev branch and go back and edit past comments.

I think it might possibly be more useful to have a pre-<version> branch going forward than to have a dedicated dev branch. Not sure, but rather than commit to something, I removed mention of a policy about the dev branch from our upcoming README -- this is to retain more flexibility.

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 13, 2023

Regarding:

our primary branch of development is being depended on by a number of users. It's become a sort of de-facto release branch.

Another way to view the current situation is that some of our "artifacts" (src/parser.c and friends) are in some sense "binaries" and that more than a few of our users depend on this repository just to get these latter bits.

So the repository is serving more than one purpose -- as a home for the "source" (i.e. grammar.js) and a home for "binaries" (i.e. src/parser.c and friends).

Perhaps in Clojure terms, this might be expressed as a form of complecting.

On a more serious note though, this makes a grammar repository with generated files in it more of a juicy target security-wise. This point was elaborated on here in 2022-02 by theHamsta (of the nvim-treesitter project):

The above scenario is not ideal, since we effectively have to trust all those 98 repositories. If we consume grammar.js directly, we have to trust that no malicious JS code is executed. Same holds for installing dependencies via npm. It's also to hide obfuscated code in one of the generated parser.c or manipulate one of the included headers in some way which could trick some CI that tries to verify a generated source file. We have the chance to review the changes that are performed, since we have to approve every version bump generated by a GH actions workflow. But it would still be easy to sneak in malicious code via a npm dependency or in the per-generated C files that we currently don't very, especially given the high number of parsers we are distributing.

I had missed his point about npm dependencies when I had read the above text before, but now I'm glad we're moving in the direction of removing our dependency on it.

Our grammar.js is very straight-forward and I think it would be rather challenging to get it to do questionable stuff (at least not without being detected). From what I understand, other folks' grammar.js files can get "fancier" though, for example, by requireing other files.

I wasn't too concerned about having src/parser.c and friends in the repository before but now I'd be happier for tree-sitter grammar repositories to be moving away from having those bits living in the repositories. I doubt very many people scrutinize how src/parser.c changes as it can get quite large. Perhaps there could be some kind of linting / scanning tool to check for weirdness?

I think the Rust and Node.js bindings are also potential targets so it might make sense to consider phasing those out (I've added instructions in the upcoming README regarding how to generate them so people who need to do so can, in theory).

Having stated these things, I imagine there are plenty of other juicier looking tree-sitter grammar repositories compared to tree-sitter-clojure. In this case, perhaps there is more safety in smaller numbers...


On a side note, I started thinking it might be useful for there to be a C program that could generate src/parser.c and friends from grammar.js or grammar.json. The primary benefit of this would be that "consumers" of grammars such as Emacs 29+ would be in a better position to generate src/parser.c and friends.

I believe the current situation is not ideal (at least not for Emacs 29+), because to use the tree-sitter cli's generate subcommand, one needs to somehow arrange for the cli to be available and ATM it seems like one needs Rust tooling. I doubt Emacs will add Rust tooling as a dependency.

I wonder if this sort of thing couldn't be applied usefully in this situation.

@sogaiu sogaiu reopened this Mar 13, 2023
@sogaiu
Copy link
Owner Author

sogaiu commented May 8, 2023

I'm going to close this for the moment, we can reopen this issue later if necessary.

@sogaiu sogaiu closed this as completed May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants