Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What if canonical source representation was binary encoded like wasm itself #2

Open
Gozala opened this issue Jan 15, 2018 · 5 comments

Comments

@Gozala
Copy link

Gozala commented Jan 15, 2018

Hi,

I just came across forst-lang as I was exploring some of my own programming language ideas. Idea of decoupling syntax choice from from language choice especially resonates with me, at the end of the day it's not more important that choice of an editor or pallet for syntax highlighting. In fact John McCarthy envisioned Lisp as such and expected different representation could have being used based on domain of the problem program would be aimed to solve. Sadly that vision did not materialized.

Another angle at which I have being thinking about this is of a language where modules could directly be saved / hosted in content addressable distributed network like https://ipfs.io/ from that angle it would make so much sense for language representation to be agnostic of spacing or definition order to make content addressing free of human factors like sense of esthetics. For example unfortunately exactly same JSON data could end up having different content address depending on how it was formatted or sorted.

In my own exploration I end up concluding that choosing binary representation over textual would organically make syntax a user choice and alternative presentation of it could be created. I thought I throw this out here especially since concept of forst-lang seems to align closely with my what I'd like from the language.

@Widdershin
Copy link
Member

Hey, just saw this!

Thanks for taking the time to make this issue.

First of all, I really like this idea, and it aligns very well with what I'm going for.

In my brain, the canonical syntax would be something that the project can choose. You could create a binary syntax, and then use it as your canonical representation.

One reason why I wouldn't necessarily want to always require a binary representation as canon is that I want the canon source to be the only actual code files checked into the repository, with the user's preferred syntax available through a FUSE virtual directory.

The intention here is to support stuff like git diff and pull requests, which would not work well with a straight binary representation. I want to teach git and github how to project code into the desired syntax, but I still want it to work in the fallback case.

Is that a problem you've considered for your use case?

Storing packages in IPFS is an amazing idea that I hadn't considered. I was probably just going to piggyback off npm in the short term, as they have really great tools, but longer term if we look into hosting infrastructure for a package manager I love that idea.

I'm interested to see what you're working on as well, could you send me a link?

@Gozala
Copy link
Author

Gozala commented Jan 25, 2018

Thanks for responding @Widdershin, I'll respond inline below:

Hey, just saw this!

Thanks for taking the time to make this issue.

Of course! Forest seems well aligned with what I'd like to get myself so starting a discussion seemed like a right think to do, if there is enough overlap in goals we could possibly share our efforts.

First of all, I really like this idea, and it aligns very well with what I'm going for.

👍

In my brain, the canonical syntax would be something that the project can choose. You could create a binary syntax, and then use it as your canonical representation.

Here is my argument:

  1. If there is a human readable canonical representation it (unintentionally) will be first class citizen and all the other representations will be second class. In my opinion only way to truly make syntax a choice is to develop it separately as it would ensure that alternate syntax-es would have exactly the same base to start with. In fact alternative syntax would probably just start as fork of pre-existing one.
  2. Getting rid of actual human readable syntax makes work put into a parser unnecessary, or rather move that work into a separate effort. In other words you can bootstrap really easy. Very simple human readable syntax like lisp s-expressions based could be trivially added with a tiny effort and again separate from the language itself.
  3. If you remove human readability from the requirements there are lot of IMO more important things you can optimize representation for: WASM streaming parser is a good illustration of one such optimization. Making human targeted changes be no change is another. As I mentioned I love to optimize presentation in regards to hash-ability and possibly even store name mapping separate from the actual code. For instance if you library did bunch of renaming all the users of it won't need to change anything as their code would refer to things by the hash so essentially updating dependency would automatically reflect new names in user's code without any changes. It could also allow users to override naming they wish to use.

One reason why I wouldn't necessarily want to always require a binary representation as canon is that I want the canon source to be the only actual code files checked into the repository, with the user's preferred syntax available through a FUSE virtual directory.

I don't think use of binary representation makes in any more difficult here in fact it makes it easier as alternative syntax would not need to parse canonical source and then translate it, it would just read binary AST representation and project it in alternative syntax, in other words would avoid parse phase and by consequence becomes free of haskell dependency or whatever language parser might be rewritten in.

The intention here is to support stuff like git diff and pull requests, which would not work well with a straight binary representation. I want to teach git and github how to project code into the desired syntax, but I still want it to work in the fallback case.

I agree that having human readable syntax and files is important for tools like git, which is really unfortunate and it seems like it would be really difficult to move past human readable files without reinventing a toolchain.

On the other hand I suspect that users would either choose to use canonical presentation so that existing tools in form of git would work as expected or they would just keep syntax they want to use in source control system so they can see diffs in the syntax they understand.

In other words it's a 🐔 🥚 problem, you can't expect syntax be truly a user choice unless tools they use fundamentally support this. Sadly git does not do that and there is no way around that. I think what would make sense is to develop aforest-diff tool that would produce AST diff that one could pipe into chosen syntax and there for project changes in a chosen syntax. forest-diff --syntax my-forst-syntax could then be configured as git diff-tool hence provide somewhat reasonable support.

Sure that would not work across all the tools but that kind of supports "no second class syntax" goal, meaning that tools that don't work would not work across all the tools and fixing support for some tool would fix it across all the syntax flavors.

Is that a problem you've considered for your use case?

I'm kind of hoping to move from "human readable files" to a "content addressable code" and I'm not entirely sure how git fits there. I think it's worth mentioning https://ipld.io/ here which I believe is inspired by Plan 9 file system what's interesting about it is it allows bridging different content addressable systems and already supports git and ipfs so I was hoping it could allow bridging this theoretical language with git. In fact it would allow presenting individual language entities as separate entries (think of navigating a package like a filesystem where modules are directories and functions are files and imports are symlinks)

Storing packages in IPFS is an amazing idea that I hadn't considered. I was probably just going to piggyback off npm in the short term, as they have really great tools, but longer term if we look into hosting infrastructure for a package manager I love that idea.

I'm interested to see what you're working on as well, could you send me a link?

Right now it's just a set of notes describing my wish list for this and some ideas how those can be realized. I could possibly compile those into some document describing this hypothetical language that I'd like to build.

Here is also somewhat abstract vision that touches on some of the topics above:

I envision something like Mathematica Notebook interface in browser where you describe problem and solution in markdown format and have code blocks in a syntax which fits best problem domain (I have this early experimenting in terms of this interface https://gozala.github.io/allusion/). That document is essentially your package and code blocks are saved in that binary AST representation so you could switch syntax live in place. You could reference other such documents / packages as you would refer to other papers except they are content-addressed and there for reader could navigate to those as well. I would imagine there to be a canonical package registry along with other domain specific ones, as registry is essentially just a set of content-addresses (I imagine it would be stored in ipfs network). Textual syntax representation of the code is going to be just one of the projections that I expect to become less relevant over time. I am interested in having a visual representations in the vein of Flow based programming that I think would be more relevant explosion of new mediums tablets, VR, AR where keyboard input is inconvenient. Most recently I discovered http://www.luna-lang.org/ which is more or less how I imagined it, there is also https://noflojs.org/ but I think it's far less interesting given the lack of type system which is essential IMO. I think wasm as a compile target is a natural choice. I want inferred static types, Automatic and deterministic memory management but via Rust like ownership system rather than RC which can be build on in fact that's what RC is in Rust. I hope build a higher level layer with Immutable data structures on top so you could have something like Elm language where you don't deal with memory management since you only work with immutable data and no way of creating cycles. I am leaning towards type system found in [Carp][https://github.com/carp-lang/Carp/blob/master/docs/Presentation.md) that is fully inferred and where choice is ambiguous for compiler it just asks to be more specific and specify which of the compatible entities you meant (I think that presents some really compelling opportunities in visual editor). I am also somewhat inspired by pony language that marries ADTs and Actor model, if you throw ownership system & content-addressing of the program constructs in to the mix would allow treating any function as an actor but is far from being fleshed out in my mind.

Sadly I lack expertise in many areas to pull this off, most pressing one is lack of one in type theory, which is what I'm mostly trying to learn about now. I did some experiments in generating wasm modules by adapting scheme compiler base on https://github.com/namin/inc where there I created JS API to build up an AST that compiler would generate wasm moudle with binaryen idea was that that AST could be then encoded into binary format with either flatbuffers or protocol-buffers library (which should make creating a syntax easy as both libraries have pretty wide language support and that would essentially eliminate need for writing a binary representation parser / encoder as both libraries just generate one from schema definition). But as I was exploring this I realized that type checker would significantly affect generated wasm code so I'm trying to learn enough type theory to be able to write a type checker (any references would be more then welcome).

I also would very much like to team up with someone more experienced in this 😅 Given some overlap in goals with forest I thought I'd see if we could converge if nothing else I could probably get some informative feedback.

Thanks & sorry it end up quite long

@Widdershin
Copy link
Member

If there is a human readable canonical representation it (unintentionally) will be first class citizen and all the other representations will be second class. In my opinion only way to truly make syntax a choice is to develop it separately as it would ensure that alternate syntax-es would have exactly the same base to start with. In fact alternative syntax would probably just start as fork of pre-existing one.

Sorry, I think I was not clear. There will be no blessed canonical syntax for Forest as a language. No one syntax will be held above others. I think that a forest --init would help you choose a default syntax, or that possibly a configuration file on your computer would contain your preferred canonical and other syntaxes.

As much as possible, syntax is a userland concern. The syntax currently in development will eventually be a package, as will every other syntax.

I'm currently designing with the assumption that the user will be using source control similar to git, SVN or Mercurial. This is because I use source control for all of my work, personally and professionally.

From this assumption, I conclude that we should have exactly one representation of the source code on disk, in whatever syntax the collaborators choose. If you want, you can choose to use a minimal binary representation of the source code.

I would be open to storing packages in a bytecode syntax with name mappings, but when it comes to source code, I think that we're not yet ready to move away from human readable files in source control.

I agree that there is value in moving fully away from textual representation in source control, but I think this would hamper adoption. I think that a progress enhancement strategy is more pragmatic. If people start using Forest in part because it doesn't require a huge workflow change, but they see that there is power in representations other than text, then that's a huge win. Once we're there, we can think about abandoning text on disk altogether.

I think what would make sense is to develop aforest-diff tool that would produce AST diff that one could pipe into chosen syntax and there for project changes in a chosen syntax. forest-diff --syntax my-forst-syntax could then be configured as git diff-tool hence provide somewhat reasonable support.

Yep, totally agree that we should build these tools, along with web versions for code snippets and pull requests.

Making human targeted changes be no change is another. As I mentioned I love to optimize presentation in regards to hash-ability and possibly even store name mapping separate from the actual code. For instance if you library did bunch of renaming all the users of it won't need to change anything as their code would refer to things by the hash so essentially updating dependency would automatically reflect new names in user's code without any changes. It could also allow users to override naming they wish to use.

This is a feature I have been planning on but have yet to document. I want libraries to support translation into different languages, so storing names separately makes sense. I like the idea of using code hashes as the name in the representation, I hadn't considered that.

This feature ties in strongly with some plans I have for the type system, but I will write that up another time.

in other words would avoid parse phase and by consequence becomes free of haskell dependency or whatever language parser might be rewritten in.

I'm not sure I understand. How will the representation be reprinted without first parsing into a common data structure? Would the printer take the bytecode?

For the record, my current plan is to eventually reimplement the Forest compiler and syntaxes in Forest, so they can more easily be used in a web browser.

I also would very much like to team up with someone more experienced in this 😅 Given some overlap in goals with forest I thought I'd see if we could converge if nothing else I could probably get some informative feedback.

I'm not sure that I could say I have any experience in this field. This is the first traditional compiler I've written, and I'm also quite new to Haskell. However, I'd still love to collaborate, even if it's just feedback and bouncing ideas off one another for the moment.

Thanks & sorry it end up quite long

No worries, your enthusiasm is infectious! Thanks for taking the time to respond 😄

@Gozala
Copy link
Author

Gozala commented Jan 3, 2019

I have recently discovered http://unisonweb.org/ which might be interesting as they use Abstract Binding Trees for language representation. And there is a lot of other overlaps with the goals of the forest.

@fkohlgrueber
Copy link

I just discovered this project and this great conversation in particular and wanted to let you know that I enjoyed reading it. There's a lot of ideas in it that I'd like to see come to life!

There's a community of people interested in projects and ideas like this at https://futureofcoding.org/community and if you like, you can join and discuss with us.

I'm looking forward to see where this is going, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants