Monday, November 8, 2010

Github: too many forks?

I recently created a Github account (check me out!) and have become familiar with the site because it hosts a lot of Django-based modules.  At first blush it seems like a programmer utopia:  the code is open source and easily viewable, forking is encouraged and there is a somewhat intuitive interface for following them, wikis and issues are automatically created for each project, and, thanks to git, forks are (usually) easy to merge.  There's a lot of users, a lot of activity, and a lot of prolific and talented programmers on github. 

So what's not to love?

Really, Github does seem to be the cat's meow of the open source world at the moment, with a nod to BitBucket, where Mercurial-based projects are hosted.  As I've said I've only recently begun to use Github, so I've discovered it has some significant oddities for someone just joining the ecosystem.  I'm surprised there hasn't been more discussion of this, a quick search of the interwebs only really turned up this post by Andrew Wilkinson.

I'll summarize his points and then add some spice.
  • Coders are "rock stars" that are emphasized over their projects and the interface is designed from a contributor point of view.  He goes on to later point out that projects of any decent size typically have more users than contributors, so the interface is a bit quirky for someone just wanting to use the project.
  • Each fork gets its own issues and wiki, making it confusing where to discuss the project.
  • Determining which fork to use is not trivial
The interface issues of the first I don't have much of an issue with because it sounds like Github is working on it; they've even made some recent improvements on this front.

The second is definitely a problem, but I think is a necessary one.  A fork potentially has its own features and bugs, so these need to be put somewhere.  However, anything not specific to the fork should be put in the wiki/issue tracker/whatever of the main project.  I think this would be less of an issue if the next point were fixed.

Finally, I come to the heart of my Github confusion:  which fork does one use?  By 'use' I mean either fork and contribute to or, as a user, download and install.  The developers have said they are working on a way to better identify the "main" project, but the solution is yet to be seen.

I'll show how I find the "main" project, and then examine the issue itself.  There are two "find main" methods that probably need to be used together.  The Network graph (read up on it here, you really need to understand these graphs to understand the my later figures) is not the place I start because it is relative to the current project, it will be used in a bit.  What I mean by "relative to the current project" is, if you're looking at a project that is say the fork of the original project, all the commits for the original up to the point that the fork occurred are put into the forked project's timeline.  Thus, I first try to find the "grandfather" or original project that started the chain.  I do this by following the "forked from" links until I get up to the one that is not a fork of another.

Currently looking at dcramer's version of the project, a fork of robhudson's, which happens to be the grandfather.




After getting to the grandfather, I then look at the Network graph, which I consider clearer now because all forks show only their own commits.  Typically the grandfather is the "main" version of the project, but occasionally a grandfather will become dormant and a fork will become the main line of development.  Look to see whether the grandfather is still being updated, forked, and other forks are merging back.  If not, see if another fork has taken over this position.  If not, find the fork with the bug fixes and improvements that seem best, as it will probably become the "main" version as other people make the same conclusion as you (hopefully).  Another thing that may be helpful is checking the number of watchers of a project/fork, this can give you an idea of its popularity.

Okay, so hopefully you can see this is about as clear as mud and a rather inexact science.  Wilkinson's "rock star" description is apt:  projects are first identified first by the coder--robhudson's django-debug-toolbar--rather than the project itself.  Admittedly this makes sense based on how git and forks work, but it leaves the interface muddied.  Whom do I trust?  robhudson or dcramer?  Side note: I'm glad people generally use their names or sensible nicknames as identifiers, if "l33tskillz393" had a fork I don't think I'd even give it the time of day.

To further clarify lets look at some pictures of the Network graphs for a couple of projects I've looked at recently.

(django-pagination)

What I have identified as the grandfather and main branch is the line on the top.  There are more forks not shown here, but none below hgrimelid's have any "recent" commits.  It is pretty clear here that the grandfather branch is the "main" version of the project:  past forks have either died or merged their changes back in (merges can be seen on the blue and neon green lines in the upper left).  There are some recent forks off the latest grandfather commit, possibly with important bug changes or features, so it makes the decision a bit less clear.  Go with the main branch and assume important changes will be merged in a future version, or go with a fork and hope it doesn't turn into a dead end?

Lets look at another with a slightly different situation.

(django-sorting)
The grandfather is again on top.  But this time the grandfather looks a bit outdated, no updates in months!  Looks like a very active project though, plenty of forks--and forks of forks!--being made and updated.  I do not think there is a clear choice.

Github has highlighted an unforeseen problem with distributed version control when the participants aren't under some guiding light, such as working for the same company.  The traditional model of a project is that there is some entity--a person, a committee, a company--that determines a project's versions, features, etc.  Distributed source control may be used to develop the project, but at some point someone says "this is the next version" and everyone trusts that authority.  This can be seen commercially in, say, how Microsoft releases new versions of Windows every so often.  In the more complicated world of distributed version control, look at what git was created for developing in the first place: the kernel.  It gets all kinds of forks and such but in the end the idea is that they get merged back in to the mainline kernel, Torvalds baptizes it, and distributions push this authoritative version out to users.  In this traditional model, users and programmers really only care about the project as a whole, not the forks that went into it.

Github flips this on its head, I assume because of the "rock star" approach.  Each programmer is given equal stage and there is no definitive project.  Without an authority people go on their merry way and we get these spider webs of Network graphs.

I'm going to eagerly await Github's "find the main branch" solution.  My admittedly rather warm-and-fuzzy suggestion is to strongly encourage merging forks.  Traditionally forks have been a big deal because they split a project's development due to legal reasons or vision disagreements or whatever.  But the assumption is there is no other recourse and no reconciliation.  Perhaps this trained behavior is one reason for the lack of merges?  Anyway, forks usually make significant changes to a project that are not necessarily meant to play nice with the original project.  In the Github world forks are THE way to update projects.  Thus forks are not splits, they are the way you do even small-scale things like fix bugs and add minor features.  These are things that should bubble up to the main project, not languish in a soon-to-be-forgotten fork.  It seems from my own experience that people fork, fix the bugs they need for their personal use, and then forget about the project altogether.  If you look at the figures I've provided you see an abundance of forks, but merges are rare.

If Github could convince all the forkers to be mergers it would be a much happier place.

No comments:

Post a Comment