Monday, November 29, 2010

Accessing guest's Apache server from the host in Virtualbox

To clarify the title of this post, I am running the latest Virtualbox (3.2.6) on a Windows 7 host with a linux (Kubuntu) guest.  I am using the guest as a development box for a website I am working on and would like to access it from Win7 to test with Internet Explorer (yuck).  This is something I've been wanting to get to work for a while but never really looked into it. Today I decided to break down and investigate.  Turns out it is a fairly simple process.

Option 1 - Port Forwarding

I first found this blog post that discusses setting up the proper port settings for Virtualbox.  I tried this and ran into some errors, including the one from the first comment.  Just copying and pasting from the site I started up VB and immediately got the errors.  I realized this is because I am using one of the Intel network adapters, not PCNet as in the example.  The secret here is figuring out the appropriate device name to use in the paths given by the post.  The way I found these was to look at the VBox.log file for the appropriate machine.  In it you will find many lines, but you're looking for level 2 devices that look something like this:

[/Devices/i8254/] (level 2)

In this case it is the device for the Intel network adapters.  So, adjusting the code from the blog, you would run each of these at the command prompt...

VBoxManage setextradata  YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/HostPort" 8888
VBoxManage setextradata YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/GuestPort" 80
VBoxManage setextradata YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/Protocol" TCPShutdown

One thing to be careful of here is the already mentioned errors.  These will prevent you from starting up the VM at all.  The most common error is verr_pdm_device_not_found, which stems from having the incorrect device for the current settings, so using pcnet in the above code but having and Intel adapter chosen in the settings, or having an incorrect name in the path, say you missed the 'i' in i8254.  The way to fix this is close out of VirtualBox--VMs and the program itself--and edit the xml configuration file for the machine.  It will have <ExtraDataItem> entries near the top to correspond to the settings you changed above.  Delete these three lines, startup VB, and your VM should now be able to start itself.

The blog goes on to say you should then use localhost:8888 to access the web server but I never got this to work.  I admit I did not try very hard, so the solution here could be a simple fix.  I'm guessing it is some issue with Windows 7 networking handling the localhost, which is not something I felt like messing with, so I found a different solution.

Option 2 - Bridged Adapter

The network settings on my guest were to set it to connect via NAT.  Changing to bridged, I was then able to browse to the guest's IP in my host (192.168.1.5 in this case) and everything worked fine.  Much simpler and no messing with VBoxManage.

Option 3 - Host-only Adapter

The Host-only Adpater will also work but it will not give outside internet access to your guest.  The plus side is that no physical network device is required, but for me it is hard to imagine a computer without a network device in this day and age.



Overall Option 2 is the best as I see it: simple and effective.

Monday, November 8, 2010

Github: too many forks?

I recently created a Github account (check me out!) and have become familiar with the site because it hosts a lot of Django-based modules.  At first blush it seems like a programmer utopia:  the code is open source and easily viewable, forking is encouraged and there is a somewhat intuitive interface for following them, wikis and issues are automatically created for each project, and, thanks to git, forks are (usually) easy to merge.  There's a lot of users, a lot of activity, and a lot of prolific and talented programmers on github. 

So what's not to love?

Really, Github does seem to be the cat's meow of the open source world at the moment, with a nod to BitBucket, where Mercurial-based projects are hosted.  As I've said I've only recently begun to use Github, so I've discovered it has some significant oddities for someone just joining the ecosystem.  I'm surprised there hasn't been more discussion of this, a quick search of the interwebs only really turned up this post by Andrew Wilkinson.

I'll summarize his points and then add some spice.
  • Coders are "rock stars" that are emphasized over their projects and the interface is designed from a contributor point of view.  He goes on to later point out that projects of any decent size typically have more users than contributors, so the interface is a bit quirky for someone just wanting to use the project.
  • Each fork gets its own issues and wiki, making it confusing where to discuss the project.
  • Determining which fork to use is not trivial
The interface issues of the first I don't have much of an issue with because it sounds like Github is working on it; they've even made some recent improvements on this front.

The second is definitely a problem, but I think is a necessary one.  A fork potentially has its own features and bugs, so these need to be put somewhere.  However, anything not specific to the fork should be put in the wiki/issue tracker/whatever of the main project.  I think this would be less of an issue if the next point were fixed.

Finally, I come to the heart of my Github confusion:  which fork does one use?  By 'use' I mean either fork and contribute to or, as a user, download and install.  The developers have said they are working on a way to better identify the "main" project, but the solution is yet to be seen.

I'll show how I find the "main" project, and then examine the issue itself.  There are two "find main" methods that probably need to be used together.  The Network graph (read up on it here, you really need to understand these graphs to understand the my later figures) is not the place I start because it is relative to the current project, it will be used in a bit.  What I mean by "relative to the current project" is, if you're looking at a project that is say the fork of the original project, all the commits for the original up to the point that the fork occurred are put into the forked project's timeline.  Thus, I first try to find the "grandfather" or original project that started the chain.  I do this by following the "forked from" links until I get up to the one that is not a fork of another.

Currently looking at dcramer's version of the project, a fork of robhudson's, which happens to be the grandfather.




After getting to the grandfather, I then look at the Network graph, which I consider clearer now because all forks show only their own commits.  Typically the grandfather is the "main" version of the project, but occasionally a grandfather will become dormant and a fork will become the main line of development.  Look to see whether the grandfather is still being updated, forked, and other forks are merging back.  If not, see if another fork has taken over this position.  If not, find the fork with the bug fixes and improvements that seem best, as it will probably become the "main" version as other people make the same conclusion as you (hopefully).  Another thing that may be helpful is checking the number of watchers of a project/fork, this can give you an idea of its popularity.

Okay, so hopefully you can see this is about as clear as mud and a rather inexact science.  Wilkinson's "rock star" description is apt:  projects are first identified first by the coder--robhudson's django-debug-toolbar--rather than the project itself.  Admittedly this makes sense based on how git and forks work, but it leaves the interface muddied.  Whom do I trust?  robhudson or dcramer?  Side note: I'm glad people generally use their names or sensible nicknames as identifiers, if "l33tskillz393" had a fork I don't think I'd even give it the time of day.

To further clarify lets look at some pictures of the Network graphs for a couple of projects I've looked at recently.

(django-pagination)

What I have identified as the grandfather and main branch is the line on the top.  There are more forks not shown here, but none below hgrimelid's have any "recent" commits.  It is pretty clear here that the grandfather branch is the "main" version of the project:  past forks have either died or merged their changes back in (merges can be seen on the blue and neon green lines in the upper left).  There are some recent forks off the latest grandfather commit, possibly with important bug changes or features, so it makes the decision a bit less clear.  Go with the main branch and assume important changes will be merged in a future version, or go with a fork and hope it doesn't turn into a dead end?

Lets look at another with a slightly different situation.

(django-sorting)
The grandfather is again on top.  But this time the grandfather looks a bit outdated, no updates in months!  Looks like a very active project though, plenty of forks--and forks of forks!--being made and updated.  I do not think there is a clear choice.

Github has highlighted an unforeseen problem with distributed version control when the participants aren't under some guiding light, such as working for the same company.  The traditional model of a project is that there is some entity--a person, a committee, a company--that determines a project's versions, features, etc.  Distributed source control may be used to develop the project, but at some point someone says "this is the next version" and everyone trusts that authority.  This can be seen commercially in, say, how Microsoft releases new versions of Windows every so often.  In the more complicated world of distributed version control, look at what git was created for developing in the first place: the kernel.  It gets all kinds of forks and such but in the end the idea is that they get merged back in to the mainline kernel, Torvalds baptizes it, and distributions push this authoritative version out to users.  In this traditional model, users and programmers really only care about the project as a whole, not the forks that went into it.

Github flips this on its head, I assume because of the "rock star" approach.  Each programmer is given equal stage and there is no definitive project.  Without an authority people go on their merry way and we get these spider webs of Network graphs.

I'm going to eagerly await Github's "find the main branch" solution.  My admittedly rather warm-and-fuzzy suggestion is to strongly encourage merging forks.  Traditionally forks have been a big deal because they split a project's development due to legal reasons or vision disagreements or whatever.  But the assumption is there is no other recourse and no reconciliation.  Perhaps this trained behavior is one reason for the lack of merges?  Anyway, forks usually make significant changes to a project that are not necessarily meant to play nice with the original project.  In the Github world forks are THE way to update projects.  Thus forks are not splits, they are the way you do even small-scale things like fix bugs and add minor features.  These are things that should bubble up to the main project, not languish in a soon-to-be-forgotten fork.  It seems from my own experience that people fork, fix the bugs they need for their personal use, and then forget about the project altogether.  If you look at the figures I've provided you see an abundance of forks, but merges are rare.

If Github could convince all the forkers to be mergers it would be a much happier place.