Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Splitting the Repo (Discussion) please don't delete. #217

Open
oblodgett opened this issue Feb 28, 2017 · 24 comments
Open

Splitting the Repo (Discussion) please don't delete. #217

oblodgett opened this issue Feb 28, 2017 · 24 comments
Assignees

Comments

@oblodgett
Copy link
Member

oblodgett commented Feb 28, 2017

I would like to try something different and that is have a discussion under this issue, about splitting the repo into areas of responsibilities, (UI, API, Indexer, Loader).

This way everyone can have well thought out answers and are not totally on the spot in a phone call, please don't delete until after 3/10/2017. The aim for this issue is to come to a conclusion collectively on wether we split or not.

Please state if you are for or against and the reasons.

@oblodgett
Copy link
Member Author

Since I am for splitting this here are the reasons for doing so.

  1. Git does does not scale well (as much as we like it).
    a. Number of files
    b. Number of Commits
    c. Number of Refs (Branches, Tags, etc)
    d. Size of files.

As these grow it will get slower, and the .git directory will continually grow. This makes cloning the repo slower. I worked on a repo once that was nearly 1GB in size it took 10+ minutes to clone from github.
Right now agr takes 26 seconds to clone on a (15mb connection) and is 170MB.
Continuous integration clones the repo and checks out HEAD on the branch that it watches, storing all the old versions as history of past builds.
We have already run out of space on the main drive on the dev server. Due to clones into home directories. With CI and developers working on /vol1 its only a matter of time. Even if the decision is made to keep one big repo we will need to remove the data files from the repo and scrub (rewrite) the git history to get that space out of the repo.

  1. Areas of responsibility, as this repo grows, there is going to be a lot of code that exists in the areas I have listed. Typically its easier to work on a repo that only has code relevant to what you are working on. Directories are straight forward there is not a bunch of what looks like clutter due to all the other stuff the repo does that does not pertain. We will conceptually be managing four projects in one repo. Splitting gives us repo its own issue tracking (granted some issues will have to be dupped) each repo will have its own wiki. This issue would go away once we get a JIRA or other work management system in place. Right now the README file is getting large and it's not as clear, what you need to do to get everything up and running.

  2. Proper project setup, typically when you setup (python, ruby, js, etc) project there is a certain file structure convention of where things are located. We have both (js and python) going on in the same repo. We have a Js application, Flask Application and a python command line application all living in the same space sharing NO code with each other. The convention I see on github is at least projects are split on programming language boundaries.

@adamjohnwright
Copy link
Member

I agree with all the points listed above.

I would like to add that I believe that splitting the repos would also make it easier to and somewhat facilitate writing proper documentation. I think it would be relatively clear is a part of the project was not documented correctly if its documentation was not mixed in with the rest of the project. I am sure that it would be possible to write accurate and complete documentation with a single repo as well. I just think that breaking it apart would make it so that it would be clear where to read about the part you are working with. The README was a great example. If we split out the indexer the README would be the clear place to go look to determine how to use the tool.

The second point I would like to add is that the AGR is supposed to be creating reusable code. I think that it would be significantly more likely to have a tool that could be used by another project if we were to split the tools up into their own repos. Someone coming into the AGR organization will likely have a harder time navigating our work if we have it all in on monolithic repo.

These two points aside I can see how a single repo could have its advantages. It would be good for people to list the downsides of splitting as well. This way we can make a fully informed decision.

@cmpich
Copy link
Contributor

cmpich commented Feb 28, 2017

I value the pros for splitting the repository. If designed well you could have the UI and the backend piece two independent 'apps'.
My point for keeping it in a single repository is: We still have too much interaction between UI and backend. The so-called API between them, a REST API, is still so much in flux as we add new data types and mature the basic gene detail page. If I make changes to the API I might want to make the UI changes at the same time. That is how we work at ZFIN, we try to keep the complete path, front to end in mind. Everybody is involved in front and back end. Having two repos mean you have to check out twice and check in twice, creating two branches, creating two pull requests, creating two production branches. It doubles those efforts. There are cases when you make changes to the front end that would not require and backend re-built. Yes. But it's harder to keep the two repos in sync. You don't know which version of backend repo goes with which version of UI repo!

In regards to Olin's first point: Here too: If my change affects both front and backend I have to clone both repos and then splitting has not gained anything as you have to download all files including history from two places that make the full application. Once we are in a position where most changes to one does not affect the other you would gain from this split.

Other con: It's harder to write integration tests. Each repo can test it's own functionality but the underlying assumption is the agreed API... But the this API may not be that well defined (including all edge case, such as empty vs null)

A pro is: You are more forced to think about the API with which the UI talks to the backend.

@oblodgett
Copy link
Member Author

Just looking at all commits in the repo I was trying to figure out how much impact this would have from a administrative point of view, with the duplicate entries.

Total commits: 609
Mixed commits: 20

Mixed commits are when there are py files and js or css in the same commit.

dcf5c3b - Travis Sheppard Sep 27, 2016 js => 4 py => 1
221c5b9 - Travis Sheppard Sep 28, 2016 css => 4 js => 13 py => 1
86637ea - Travis Sheppard Sep 28, 2016 js => 1 py => 1
e628e3a - Travis Sheppard Sep 29, 2016 js => 2 py => 1
07df9c0 - Travis Sheppard Oct 24, 2016 js => 10 py => 1
9875168 - Travis Sheppard Oct 25, 2016 js => 1 py => 1
92e3f6c - Travis Sheppard Sep 26, 2016 js => 1 py => 1
881cb31 - Travis Sheppard Oct 27, 2016 css => 1 js => 6 py => 1
1f81832 - Travis Sheppard Oct 28, 2016 js => 1 py => 1
69c5abc - Travis Sheppard Oct 28, 2016 js => 3 py => 1
4fe1117 - Travis Sheppard Oct 31, 2016 js => 1 py => 1
40b63c9 - Travis Sheppard Nov 1, 2016 js => 5 py => 1
f05863c - Travis Sheppard Nov 1, 2016 js => 1 py => 1
2e07668 - ragingsquirrel3 Jan 16, 2017 css => 1 js => 3 py => 1
b0b59b1 - paaatrick 28 days ago css => 1 js => 3 py => 1
4ff1739 - Travis Sheppard 25 days ago js => 8 py => 2
c6eed4e - kevinschaper 20 days ago js => 3 py => 13
c625cf4 - kevinschaper 18 days ago js => 2 py => 2
f229a8e - kevinschaper 6 days ago js => 1 py => 1
3cf731e - kevinschaper 5 days ago js => 3 py => 3

Most of Travis's commits were during the prototype phase, and only involved server.py.
Kevin's commits included mapping.py which effectively is our API at the moment.

@paaatrick
Copy link
Contributor

My two thoughts:

  1. I can see (I think) how this simplifies the backend development workflow. You run your Flask application, request your API endpoints, make sure the JSON response looks good, call it a day. No messing around with npm or javascript nonsense. But, what does this world look like for a frontend developer? Do I need to check out the backend repo and run it? Am I somehow pointing my local development site to a shared backend? What version is it running and is it always available? (To be clear, I'm not looking for answers to these questions here; I'm looking for the person(s) splitting the repo to solve these problems and document the solution.)

  2. This has already been briefly mentioned, but let me put a finer point on it: I can't see how we can keep using GitHub for issue tracking with separate repos. Sure, developers will have a pretty good idea of which repo to put issues in. But even still there will be mistakes, or issues that at first look like frontend issues that turn out to be backend or vice-versa. And then there are curators! How would they ever know which repo to use for issues?

@oblodgett
Copy link
Member Author

More perspective... coming from a Java / JBoss server environment I didn't understand Flask, I couldn't just do the things I was accustomed to with JBoss. I tried to understand Flask in terms of JBoss. And here is how I would explain it.

It would be like building 1 war file with Jboss and your code inside it. You would start JBoss by java -jar myapp.war. If you wanted to make a change to the code you would have to stop the running war, rebuild the war, and start it again via java -jar myapp.war. The war being the REST endpoints and connections to DB and ES.

From a frontend developer perspective, I rebuild my js file, into the directory that is being served by the running Flask server, and hit refresh in the browser and it works just like deploying a WAR file to JBoss server. And I agree with @paaatrick splitting out the repo might make this process more cumbersome for the frontend devs, and having a good process and documentation should make development be no different then it is now.

Now on the backend side of things Flask does watch some of the code directories and will reload when it sees changes but if there is an error, the server just crashes aka stops running when in development mode. When in production mode, the errors just go to the browser or to log files and the server keeps running.

Yesterday however the code changed under the running server it hung and the kill process for whatever reason did not kill the running server. Which then the start process fails because there is already a server running on that port. If this was in the JBoss realm the server sits outside of the CI process and just acts as a resource for deployed code. Whereas the Flask server IS the deployed code. JBoss is an application server that can serve many applications. Whereas Flask is the application that happens to start a server.

MGI has our Flask install split into 3 repos. Flask and the server code / API pages etc. The models and libraries repo (DB connector API) so other python projects in the future could make use of. JS repo for all the JavaScript that is used on the site. Loads are totally independent repos to this stack.

JS <- Flask Server <- DB Connector / Models <- Database

In our case when the Models change the server has to be restarted so those two repos (Server / Modes) really should be one.

I'm sorry if this information is already known by the group, I had to try to wrap my head around it, when we started using Flask at MGI, just trying to share what I learned.

@christabone
Copy link
Contributor

Also just wanted to chime in regarding issues/tickets real quick.

There has been talk over an over-arching issue tracking system (a la JIRA) to be implemented on top of GitHub and everything else. We could direct curators / PIs / anyone to submit tickets which would then be associated with GitHub issues by developers. In terms of project management, it feels like a necessity at some point (and I believe this is the direction we're heading with AGR? Please feel free to correct me if I'm mistaken).

@LucyHut
Copy link
Contributor

LucyHut commented Mar 1, 2017

The three points Olin listed in favor to splitting the agr repos into areas of responsibilities are valid and straight forward but a little bit simplistic for frontend developers point of view. As a frontend developer, splitting agr repos into multiple repos will add another layer of complexity into the process - making it harder and harder to setup a fully functional instance of agr in dev environment.

Patrik already made some good points on how life would be like for frontend developers if we were to adopt the multiple repos solution – I would add that even if I were to clone all these repos in my development environment, I could still be faced with the overhead of software dependencies and version conflicts.

To Olin’s first point on running out of space as the files number/size grow – it looks like the size of the repos grows more locally than remotely– since the index files are created locally to the dev/prod server – we could still have that issue even if we were running multiple repos - As for the reduced speed of cloning a large repos, github also gives developers the option to clone a single branch within a repos – so you don’t have to clone the entire repos.

@stuartmiyasato
Copy link

There has been talk over an over-arching issue tracking system (a la JIRA) to be implemented on top of GitHub and everything else. We could direct curators / PIs / anyone to submit tickets which would then be associated with GitHub issues by developers. In terms of project management, it feels like a necessity at some point (and I believe this is the direction we're heading with AGR? Please feel free to correct me if I'm mistaken).

Just wanted to tag @kltm and see if he has any comments about this, given that the GO project has just done the exact opposite -- they are dropping JIRA to move exclusively to GitHub for tracking! (Now speaking in the third person since as of today I am no longer officially affiliated with GO...) :)

@kltm
Copy link

kltm commented Mar 1, 2017

Also tagging @cmungall .
We tried this (had this tried with JIRA) in another project, and it did not go very well; moreover, nobody used it. One of the issues is the sync and lag between the two. Having a single definitive resource is very useful, and if that ended up being JIRA, it reather defeats the purpose of using GitHub.
If people are interested in higher-level management, it might be worth really exploring what GitHub projects offers, or the use of more "overlay" type systems like waffle.io.
What exactly is the use cases here, except to have more "management"?

@oblodgett
Copy link
Member Author

This topic is kinda going in different directions but I wonder what the GO group is for using JIRA and if JIRA lends better to larger or smaller groups and if so how does AGR compare considering there are a lot of people involved.

@kltm
Copy link

kltm commented Mar 2, 2017

Well, the GO group has experimented with JIRA at least two times that I participated in.

The first, and most successful, instance was as an email controlled issue tracker for the GO Helpdesk. The interface was a bit clunky and hard to script, but it did it's job very well (besides clunkiness), with the only downsides being maintenance and overhead.

The other instance was run as an internal management tool, as an attempt to coordinate and manage projects at a higher level. This ended up being...not a good match. There was also the issue of coordinating between the actual issues (say GitHub or Trello) and JIRA.

There is also a third JIRA experience with a non-GO project, where it was to be used as a "managerial wrapper" around GitHub. While it sorta worked, there was little interest in following through due to the reasons above #217 (comment) and a number of edges that were problematic in sanding off.

I don't think it's so much an issue of a large or small group, but more about management style and needs. I also think that at the end of the day there needs to be a single definitive source for actions and history, no matter what is chosen. For us, GitHub is working quite well as both a management and VC system. Also, in my experience, every time there have been multiple ticketing/management systems that are used for different things within an org, there ends up being confusion and dangling ends. YMMV :)

@ragingsquirrel3
Copy link
Contributor

I'm pretty against this, based on my experience. We have a split for SGD, and it causes lots of pain. While in some cases, a frontend/backend developer can more easily focus on their portion of the stack, most of the time a given feature requires some code across the stack, and keeping the FE/BE features on a single branch is a great simplification. With 2 repos, you have to coordinate which branch on which repo, manage 2 runtime environments, deal with multiple configurations and deployments.

tl;dr 1 repo is simpler and better in this case.

@oblodgett
Copy link
Member Author

So I understand the manage 2 branches across multiple repos. Here at MGI in the last release of software, we had 1 branch across 30 different repo's and the current project we are up to 1 branch across 14 repos. In our case it is a little more difficult to manage but worth it in the long run because we didn't touch the full stack only 30 out of 175 of the stack.

Help me understand the 2 runtime environments? Right now without the split we have the npm environment and the python environment all files mixed together. This proposal is to split those two environments into separate repos, so that each repo only has to manage 1 runtime environment. Maybe we differ on the definition of an environment?

While the project is small I understand that its easier to make full stack commits, however based on the report I ran earlier in this post. That happens only 1 in 30 commits, and 65% of those were done by Travis, as this project grows we can already see the trend for "mixed commits" will be less and less.

Also I don't understand the multiple configurations? Every piece will need a config as it does now, however in splitting the repo the configuration is now simplified for each repo. The frontend for instance no longer needs to know about the ES index or ES Host. The makefiles have become a lot simpler, each repo only needs a "make" and "make run".

So help me understand where the "lots of pain" comes from? My aim for working on this split is to make it easier for the developers to develop, and less knowledge needed to setup the repo, and more time coding.

@ragingsquirrel3
Copy link
Contributor

I didn't mean to be so brash in my previous response. Let me respond further in the frame of "here are some pros and cons of working with split repos, in my experience," and if the group decides one way or the other, that's fine.

In the SGD split repos (2 repos), there is a lot of duplication. The frontend part doesn't have a database, but still needs a version of python, a version of pyramid, and other server side pieces to be able to render a basic webpage, in addition to JS. The backend must repeat some of those dependencies in order to run its own server. In this case, "2 runtime environments" means 2 servers running, 2 sets of python dependencies (in addition to JS dependencies). I'm sure there's a better way to do it, but as it stands, that is source of much unnecessary work.

For the AGR code, I don't really see how the frontend could have 0 python dependencies, unless it was converted to run a node.js server.

I understand the goal of making it easier. However, in our case, it adds a lot of mental overhead. The backend runs as an independent server with JSON endpoints, and the frontend reads from them. It can point to a local backend instance, or the production version. When developing, I am always asking "Which version of the backend do I want? Which one am I using? Does that version have the feature I need for this frontend branch?" I would prefer for a git branch to handle that logic as much as possible. On the other hand, the split version has some benefits, avoiding the headache of "Do I need to reindex? Do I need to update my data for this feature I'm working on?"

I don't think that limited numbers of "split commits" are a good reason to split. That has more to do with the frequency of commits than workflow. It's really nice to keep related features on the same branch of a single repo. Multiple people can work on the same branch, and commit to different parts of the application without ever making a split commit. Once a feature is ready, there's 1 merge, 1 deployment.

At some point, the size of an application will certainly merit the effort of splitting. This application, however, is still pretty simple and I just don't think it's worth the complexity.

The single application requires a little more knowledge for everyone, but I would rather compensate by having a single set of documentation and trying to simplify setup scripts. This has additional benefits of increasing the number of people who know how to install and run the full application. If we split, it will be interesting to see how many people will know how to run the entire application.

@oblodgett
Copy link
Member Author

oblodgett commented Mar 9, 2017

So... I have been working on a split repo branch (issue_231) for the last few days.

About the multiple python environments, I was ready to scrap this idea if this didn't work, because I agree with you, maintaining both doesn't make sense. Why does the JS need to have python? However I found a way to remove the python requirement completely from the UI so that you don't have to maintain python. It runs with: npm start which runs: webpack-dev-server --history-api-fallback --hot --inline. This serves the webpack from port 2992. Then next issue I had to over come was the UI goes to /api for all the calls to the "API". I found an option for webpack.config.babel.js where you can proxy certain parts of the request to another host. So localhost:2992/api points to API_URL which defaults to localhost but could be changed to point to dev or staging for instance. With this setup the UI runs without python. You do have to start 2 servers now one for the UI and one for the API, but one nice thing is now the API server only logs requests for API calls and not everything, and so the API server starts on localhost:5000.

Where to point for the data this is always a trade off. Where's my data? or I just need to get this bug fixed and I am glad I don't need the whole stack.

I think splitting early will be better then trying to split later when it's orders of magnitude harder. But I guess this might just be a subjective measure.

So to your last point, I develop locally, I like running the whole stack local, my laptop (16gb) doesn't like this, but I find I can be most efficient this way. I can use all the local tools that are installed on my laptop to develop code. So I realize that its vitally important to make sure a full local setup is no more difficult then setting this up on a server or with multiple repos. What I have done is split the documentation out for each, setups have become much simpler. However in doing that I realized this makes it easy for each individual setup, but removes the larger picture on how the whole system works. So another idea is to have a "root" repo that holds the documentation for a full setup and diagrams on how the system pieces together, and maybe some helper scripts to setup the full stack.

The default configs will be for a local setup, so if you want to run everything you would clone the main repo and then run the setup script which would go and clone the other repos, then run all the make install's make build's, runs the indexer and downloads all the files from S3, etc, etc.

So @kevinschaper has been suggesting testing and I agree, lets say you are working on a feature in the API, and you push code and everything is good. But it breaks something in the UI. We need to have integration testing somewhere in the pipeline preferably on the pull request, that would test this to make sure nothing broke.

So the branch is available for anyone to checkout: https://github.com/alliance-genome/agr/tree/issue_231

Please let me know what you think. api, webapp, indexer, and test would go into their own repos as the next step.

@kevinschaper
Copy link
Contributor

I still prefer the testing environment when we have a single repo, because you can get feedback from GitHub at the commit / pull request level that feature being implemented by the branch isn't done yet and it's a big red box that stops that code from moving forward.

It might mean that a back-end person hands the branch off to a front-end person, but it seems better than it getting pulled into the api repo only to have an integration test tell you afterwards that "the UI repo and the api repo stopped working together as of [latest commit hash for each repo]".

At that point we're in the "our stuff is broken and somebody needs to fix it" state, rather than the "I have a new feature and nobody will pull it until I fix the tests" state.

@oblodgett
Copy link
Member Author

Oh... I don't mean to take tests out of each repo, sorry I know thats what it looks like right now, "tests/*" needs to get moved into api/tests/. I think each repo needs to have its own "make tests" but there needs to be a larger tests repo that I would see the integration tests going into, testing the whole stack.

@kevinschaper
Copy link
Contributor

Sorry - that wasn't based on looking at the branch. Is there a practical way for each commit to any repo to run tests against all of them? so that you'll know if your api change broke an integration test in the UI?

@oblodgett
Copy link
Member Author

I don't know of a way that can be done at the commit level, other then the travis stuff, but even in a none split I don't know that you could get travis to test the whole stack.

I guess we would need to define what the whole stack means though. So in the data warehouse world does that mean loading all data fresh, creating a new index setting up the UI and API, and then running all integration tests? This could take a long time, just to catch a typo on a line somewhere.

So.... we could.... through gocd setup pipelines for each developer to control. This might get a little out of hand, but part of that pipeline would be to test the full stack rather then waiting until after the pull request to find out.

@kevinschaper
Copy link
Contributor

Good point, yeah, ultimately once you get to the "load all of the data and run a full stack test" it gets a lot less practical. It looks like you can have a real (not-mocked) data stores in travis, so we could bring up Elasticsearch, but I can imagine that it would require subsetting the data, so that it wasn't an hour or more to test each commit.

@oblodgett
Copy link
Member Author

Maybe this needs to happen at the GoCD level and we setup notifications if those tests fail. I mean its less ideal, but we would still know about them, then integration bugs at least would get fixed when they happened. Maybe we only do a full stack with data test once a day?

@ragingsquirrel3
Copy link
Contributor

I'm pretty sure you can have Travis test python and node. For the SGD split testing scenario, the frontend tests assume there is a backend running somewhere that is publicly available to travis CI, an assumption that has problems when testing new features. Again, I'm sure there's a better way, but this is a problem we have encountered with our split repos.

Just to clarify how the current application webpack server works:

The JS doesn't need python, but the interface uses python to at least render an HTML page. In the current AGR repo, yes, you can run npm start and it will serve javascript from port 2992 and do hot updates. However, this is only for dev mode, and it doesn't serve the HTML template itself. You could make webpack serve a static HTML file pretty easily, rely on the webpack dev server, and get rid of python on the frontend. The problem is that the webpack dev server is only for development, and isn't supported for production. In production, you need something else to render the HTML. The HTML needs to be rendered with some sort of scripting language, so to get rid of python, we would have to use node.js (or some other language). Or, we could assume the HTML is totally static.

@oblodgett
Copy link
Member Author

So I don't know if you looked at the branch, but the API / flask does not serve webpack anymore, the index page was moved into UI and is not longer a template. So yes the npm start serves it for dev mode. So yes the problem is what to do with a production install? Moved the HTML to be totally static and controlled by the UI.

Now in production the setup is a bit more complicated, but developers don't need to worry about it. We would need to frontend the JS / HTML / index page with say a nginx and then have nginx proxy /api requests over to the flask server, the flask server connects to the AWS ES instance and we make them all play in the sandbox together.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants