Monday, April 4, 2011

Ajax file uploads and CSRF (in Django 1.3, or possibly other frameworks)

To begin, this is an update of my old post AJAX Uploads in Django (with a little help from jQuery). This guide is specific to Django, but my version of the file uploader can (theoretically, it is untested) be used with other web frameworks that use CSRF, like Ruby on Rails. You should be able to follow along with the guide and make adjustments as appropriate for your framework.

Required Software

  • My version of Valum's file upload
  • Python 2.6+
  • Django 1.3+

If you are on an older version of Python and/or Django, reading the prior version of this post and especially this Stack Overflow question of mine may provide some help in adjusting the code. The only part that requires updated Python and Django is the save_upload function. The code uses buffered readers/writers and the 'with' keyword from Python 2.6+ (these parts can easily be changed I suspect) and reads from the raw HttpRequest, which comes with Django 1.3+. The Stack Overflow question has code I tried before moving up to requiring these newer software versions. It worked for small uploads below CD ISO size (700MB) and can probably be fixed to work with all uploads, I just found the Django 1.3+ solution easier and quicker at the time.

Overview

Ajax Upload handles the client-side very seamlessly and only gives one challenge to the programmer: it passes the file either as the raw request, for the "advanced" mode, or as the traditional form file for the "basic" mode. Thus, on the Django side, the receiving function must be written to process both cases. As the old post discusses, reading this raw request was a bit of trouble, and that is why I went with Django 1.3 as a requirement for my code.

Setup and Settings

First is to get AJAX Upload installed by downloading the latest version from my Github repo. This fork of Valum's original includes my changes as well as improvements from other forks that I need. As of this writing, I have added correct awareness in FileUploader of FileUploaderBasic's 'multiple' parameter and included David Palm's onAllComplete trigger. Once downloaded, grab fileuploader.js and fileuploader.css out of the client folder and place them wherever is appropriate for your setup. Finally, link them in your HTML via your Django templates.

The Web (Client) Side

HTML

This is the HTML code that will house the upload button/drag area so place it appropriately.

<div id="file-uploader">       
    <noscript>          
        <p>Please enable JavaScript to use file uploader.</p>
    </noscript>         
</div>

Javascript

You probably want to dump this in the same HTML/template file as the above, but it is up to you of course.

var uploader = new qq.FileUploader( {
    action: "{% url ajax_upload %}",
    element: $('#file-uploader')[0],
    multiple: true,
    onComplete: function( id, fileName, responseJSON ) {
      if( responseJSON.success )
        alert( "success!" ) ;
      else
        alert( "upload failed!" ) ;
    },
    onAllComplete: function( uploads ) {
      // uploads is an array of maps
      // the maps look like this: { file: FileObject, response: JSONServerResponse }
      alert( "All complete!" ) ;
    },
    params: {
      'csrf_token': '{{ csrf_token }}',
      'csrf_name': 'csrfmiddlewaretoken',
      'csrf_xname': 'X-CSRFToken',
    },
  } ) ;
}

Now, let's make some sense of that.

  • It is probably simplest to use the url template tag to fill in the action as I did above, but it could also be a hard-coded URL as a string. It is set here to match the URL config covered later in this guide.
  • The multiple option is not something that is not discussed in Valum's documentation that I found. Its purpose is to limit the uploader to allow you to determine whether it supports selecting/dragging multiple files for upload at a time. A value of true allows multiples, false will let it only do one at a time. In Valum's, this option is available to FileUploaderBasic, but not FileUploader, which is the class most people use. For my repo I chose to update FileUploader to be aware of the multiple option.
  • The onAllComplete callback is something added to my repo over Valum's that I got from David Palm's fork. It is called whenever the queue of uploads becomes empty. For example, if you drag/select 4 uploads, this will fire once all 4 have finished. If you then drag/select 2 more files for upload, this will fire again when those 2 are completed.
  • The params are set up so the uploader can interact with Django's CSRF framework properly. csrf_token is obviously the token itself, while csrf_name is the name of the input expected by Django for form submissions and csrf_xname is the HTTP header parameter it reads for AJAX requests. Why did I bother with making these last two parameters? Well, theoretically my version of the file uploader should work with other frameworks, which may expect different names for these. For example, Ruby on Rails will expect 'X-CSRF-Token' for AJAX requests and 'authenticity_token' for forms (I think).
  • jQuery is used to grab the appropriate part of the div. If you are not using jQuery use whatever method is appropriate for your system to get the file-uploader DOM element. Using regular Javascript you could do document.getElementById('file-uploader'), as Valum uses in the examples on his site.

The Server (Django) Side

Django URLs

It is best to have two views for this setup to work: one to display the upload page and one to process the upload file. The URLs need to be set in urls.py of course.

url( r'/project/ajax_upload/$', ajax_upload, name="ajax_upload" ),
url( r'/project/$', upload_page, name="upload_page" ),

Note that these may require some adjustments depending on how your urls.py is coded.

Views

First is the upload_page view, which is going to display the page with which the user interacts. This is a simple skeleton, add whatever your template needs.

from django.middleware.csrf import get_token
def upload_page( request ):
  ctx = RequestContext( request, {
    'csrf_token': get_token( request ),
  } )
  return render_to_response( 'upload_page.html', ctx )

Including the csrf_token in the context is very important, as earlier code depends on having this variable available. For some reason Django does not give you access to the token automatically in templates.

Next is the view to handle the upload. Remember that this code must handle two situations: the case of an AJAX-style upload for the "advanced" mode and a form upload for the "basic" mode. I split this code up into two functions: one to actually save the upload and the other the view.

def save_upload( uploaded, filename, raw_data ):
  ''' 
  raw_data: if True, uploaded is an HttpRequest object with the file being
            the raw post data 
            if False, uploaded has been submitted via the basic form
            submission and is a regular Django UploadedFile in request.FILES
  '''
  try:
    from io import FileIO, BufferedWriter
    with BufferedWriter( FileIO( filename, "wb" ) ) as dest:
      # if the "advanced" upload, read directly from the HTTP request 
      # with the Django 1.3 functionality
      if raw_data:
        foo = uploaded.read( 1024 )
        while foo:
          dest.write( foo )
          foo = uploaded.read( 1024 ) 
      # if not raw, it was a form upload so read in the normal Django chunks fashion
      else:
        for c in uploaded.chunks( ):
          dest.write( c )
      # got through saving the upload, report success
      return True
  except IOError:
    # could not open the file most likely
    pass
  return False

def ajax_upload( request ):
  if request.method == "POST":    
    if request.is_ajax( ):
      # the file is stored raw in the request
      upload = request
      is_raw = True
      # AJAX Upload will pass the filename in the querystring if it is the "advanced" ajax upload
      try:
        filename = request.GET[ 'qqfile' ]
      except KeyError: 
        return HttpResponseBadRequest( "AJAX request not valid" )
    # not an ajax upload, so it was the "basic" iframe version with submission via form
    else:
      is_raw = False
      if len( request.FILES ) == 1:
        # FILES is a dictionary in Django but Ajax Upload gives the uploaded file an
        # ID based on a random number, so it cannot be guessed here in the code.
        # Rather than editing Ajax Upload to pass the ID in the querystring,
        # observer that each upload is a separate request,
        # so FILES should only have one entry.
        # Thus, we can just grab the first (and only) value in the dict.
        upload = request.FILES.values( )[ 0 ]
      else:
        raise Http404( "Bad Upload" )
      filename = upload.name
    
    # save the file
    success = save_upload( upload, filename, is_raw )

    # let Ajax Upload know whether we saved it or not
    import json
    ret_json = { 'success': success, }
    return HttpResponse( json.dumps( ret_json ) )

The first thing you probably want to edit here is the use of filename in either ajax_upload or save_upload. The saving function as it stands assumes filename is a path. In my actual usage, I combine filename with a constant from settings.py that represents the path to where uploads should be saved. So, at the beginning of save_upload you could have something like filename = settings.UPLOAD_STORAGE_DIR + filename where UPLOAD_STORAGE_DIR it set to something like "/data/uploads/". Or, of course, you could skip the constant and hard code your path string, but that's bad right?

And that's it, go have some fun!

I have had many people ask me via comments or email about providing a demo of this system. From a user standpoint it looks/works no different from the demo on Valum's site. As of right now I cannot provide my own demo because my web host does not provide a Django environment. I'm trying to work with them on getting it available though. I will also work on getting my code in as the Django example with the uploader code in my github repo. When either of those happen I will update this post.

Thanks to everyone who commented on the last post, they helped immensely in creating my github repo and in fixing bugs on the post itself. If you find any mistakes here please comment or contact me directly (contact info can be found on my site).

Sunday, December 19, 2010

The Rights and Wrongs of Dynamic Pages

I was recently reading this article over at The Economist. The content of the article aside, it made one other thing come to mind: with great power comes great responsibility.

As I began reading the article, and thus scrolling, all of a sudden my screen began to look a cluttered mess. A full-screen-width bar dropped down from the top with please-for-the-love-of-God-share-this-article-on-all-your-social-networking-sites buttons and a search box. Besides being redundant since those features are all already embedded on the page, it was distracting. About the same time a square box slides up from the bottom telling me I need to subscribe to the magazine, again covering up the content of the article. Again, there is already an advertisement-like area near the top of the page offering four free issues and telling you to subscribe.

The passive versions of these features I am fine with, but the two slide-in boxes are too much because they happen as one has already begun to read, thus distracting your attention and covering up the content you're there to see in the first place. It is a very in-your-face type of pressure that most people do not approve, just like extremely loud commercials.

Okay, so if The Economist is the Comcast of internet news, what's an example of dynamic pages done right in that area? I think the New York Times does it right. Go read an article (this one I chose at random) or simply scroll through it. Nothing pops up to annoy you, everything happens on page load. The one exception is when you reach the bottom of the article a box slides in--and not over the content you're reading!--letting you know of related articles you may be interested in. This is actually helpful rather than self-serving like The Economists's dynamic content.

So, while we are all enamored with the eye candy of modern Ajax development remember to take a critical eye to it and note those who are using it well. I think I'll go read some more NYT.

Wednesday, December 1, 2010

AJAX Uploads in Django (with a little help from jQuery)

*** April 4, 2010: This post is a bit outdated and does not work with Django 1.3 final's stricter CSRF enforcement. I have a new post that is much more up-to-date, cleaner, and easier to follow, especially because it uses the github repo I created that holds the changes that need to be made to the file uploader's javascript.***

Part of my current project is creating an area where files need to be uploaded in a snazzier way that the normal "browse for a single file" sort of forms. The idea is to have a download button with the same OS chrome file dialog, but one that allows multiple files to be selected. Additionally, the even snappier part, is that HTML5's drag-and-drop functionality should also be available where supported.

After some searching I found a couple of sites that pointed me in the right direction.

  • AJAX Upload handles the client-side end of the upload process, including the required multiple-select and drag-and-drop (for browsers that support it). It even handles graceful fallback to iframe uploads for browsers that do not support those advanced features (Opera, IE, etc.).
  • On the Django side, I found this site, which gives some pointers to get it working with Django 1.2's CSRF token. The problem I encountered is simply passing the CSRF token to Ajax Upload via its params will not work, because Ajax Upload sends it in the querystring and Django expects it as POST data.

Because neither of those gave me the whole picture, I had to piece things together on my own. Here is the straight-skinny on my findings for getting Ajax Upload to work with Django.

Overview

Ajax Upload handles the client-side very seamlessly and only gives one challenge to the programmer: it passes the file either as the raw request, for the "advanced" mode, or as the traditional form file for the "basic" mode. Thus, on the Django side, the receiving function must be written to process both cases. In the raw request version, reading the request without Django blowing memory was a bit of a challenge. I at first tried reading the data into Django's SimpleFileUpload. It is an in-memory class though, so it runs into issues with large files. Next I tried reading/writing the data through Python functions, which had similar problems. The "works all the time" solution requires Django 1.3, which is to use its new "http request file-like interface" to read from the request. If you're using Django 1.2 and figure out another way to read the request data for all file sizes please comment! I discuss these solutions a little more in depth at this Stack Overflow question of mine.

Setup

First is to get AJAX Upload installed by getting the JS and CSS files wherever is appropriate and linking to them in your Django templates. Also make sure you have set up Django's file upload handlers. Next is setting it up on the site.

The Web (Client) Side

HTML

This is the HTML code that will house the upload button/drag area so place it appropriately.

<div id="file-uploader">       
    <noscript>          
        <p>Please enable JavaScript to use file uploader.</p>
        <!-- or put a simple form for upload here -->
    </noscript>         
</div>

Javascript

You probably want to dump this in the same HTML/template file, but it is up to you.

function generate_ajax_uploader( $url, $csrf_token, $success_func )
{
  var uploader = new qq.FileUploader( {
    action: $url,
    element: $('#file-uploader')[0],
    onComplete: function( id, fileName, responseJSON ) {
      /* you probably want to handle the case when responseJSON.success is false,
         which happens when the Django view could not save the file */
      if( responseJSON.success )
        $success_func( responseJSON ) ;
    },
    params: {
      'csrfmiddlewaretoken': $csrf_token, /* MUST call it csrfmiddlewaretoken to work with my later changes to Ajax Upload */
    },
  } ) ;
}

A little explanation is probably needed here.

  • I have wrapped the code inside of a function that generates the uploader because I use it on a couple different pages that require different URLs. You can easily strip off the function part and simply place it in a regular <script> block for use on a single page.
  • It is probably simplest to use the url template tag to fill in the action, which is the URL that gets the ajax data and does the server-side processing. I use the url template tag to construct the $url parameter to this function, but if you are removing the function part put the tag directly with action: or you can hard-code the URL you want as a string.
  • I use the onComplete callback to pass returned json to another "success function" that parses the json and add information to a table on the page. Again, this is not necessary, but I thought it would be useful to show how this could work. The upload plugin itself will say whether the file upload was a success based on the returned json.
  • jQuery is used to grab the appropriate part of the div. If you are not using jQuery use whatever method is appropriate for your system to get the file-uploader DOM element. Using regular Javascript you could do document.getElementById('file-uploader'), as Valum uses in the examples on his site.

Ajax Upload Modifications

I found the easiest way to get the CSRF token piece going was to modify Ajax Upload itself, unfortunately (I hate editing libraries since they need to be re-updated at each new release). Around line 1100 of fileuploader.js you will find the line "var form = ..." within UploadHandlerForm's _createForm method. Replace this line with the following:

var form = null ; 
if( params.csrfmiddlewaretoken )
{
  var csrf = '<div style="display:none"><input type="hidden" name="csrfmiddlewaretoken" value="' + params.csrfmiddlewaretoken + '" /></div>' ;
  form = qq.toElement('<form method="post" enctype="multipart/form-data">' + csrf + '</form>');
  delete params.csrfmiddlewaretoken
}
else
  form = qq.toElement('<form method="post" enctype="multipart/form-data"></form>');

All this code does is search for the CSRF token, and if it is present insert it into the form in the way Django expects to receive it.

The Server (Django) Side

Django URLs

It is best to have two views for this setup to work: one to display the upload page and one to process the upload file. First, the URLs

url( r'/project/ajax_upload/$', ajax_upload, name="ajax_upload" ),
url( r'/project/$', upload_page, name="upload_page" ),

Views

First is the upload_page view, which is going to display the page with which the user interacts. This is a simple skeleton, add whatever your template needs.

from django.middleware.csrf import get_token
def upload_page( request ):
  ctx = RequestContext( request, {
    'csrf_token': get_token( request ),
  } )
  return render_to_response( 'upload_page.html', ctx )

Next is the view to handle the upload. Remember that this code must handle two situations: the case of an AJAX-style upload for the "advanced" mode and a form upload for the "basic" mode.

def save_upload( uploaded, filename, raw_data ):
  ''' raw_data: if True, upfile is a HttpRequest object with raw post data
      as the file, rather than a Django UploadedFile from request.FILES '''
  try:
    from io import FileIO, BufferedWriter
    with BufferedWriter( FileIO( filename, "wb" ) ) as dest:
      # if the "advanced" upload, read directly from the HTTP request 
      # with the Django 1.3 functionality
      if raw_data:
        foo = uploaded.read( 1024 )
        while foo:
          dest.write( foo )
          foo = uploaded.read( 1024 ) 
      # if not raw, it was a form upload so read in the normal Django chunks fashion
      else:
        for c in uploaded.chunks( ):
          dest.write( c )
  except IOError:
    # could not open the file most likely
    return False

def ajax_upload( request ):
  if request.method == "POST":    
    # AJAX Upload will pass the filename in the querystring if it is the "advanced" ajax upload
    if request.is_ajax( ):
      # the file is stored raw in the request
      upload = request
      is_raw = True
      try:
        filename = request.GET[ 'qqfile' ]
      except KeyError: 
        return HttpResponseBadRequest( "AJAX request not valid" )
    # not an ajax upload, so it was the "basic" iframe version with submission via form
    else:
      is_raw = False
      if len( request.FILES ) == 1:
        # FILES is a dictionary in Django but Ajax Upload gives the uploaded file an
        # ID based on a random number, so it cannot be guessed here in the code.
        # Rather than editing Ajax Upload to pass the ID in the querystring, note that
        # each upload is a separate request so FILES should only have one entry.
        # Thus, we can just grab the first (and only) value in the dict.
        upload = request.FILES.values( )[ 0 ]
      else:
        raise Http404( "Bad Upload" )
      filename = upload.name
    
    # save the file
    success = save_upload( upload, filename, is_raw )

    # let Ajax Upload know whether we saved it or not
    import json
    ret_json = { 'success': success, }
    return HttpResponse( json.dumps( ret_json ) )

And that's it, go have some fun!

***Edit: A few errors in the source have been fixed, thank you for your comments!

Monday, November 29, 2010

Accessing guest's Apache server from the host in Virtualbox

To clarify the title of this post, I am running the latest Virtualbox (3.2.6) on a Windows 7 host with a linux (Kubuntu) guest.  I am using the guest as a development box for a website I am working on and would like to access it from Win7 to test with Internet Explorer (yuck).  This is something I've been wanting to get to work for a while but never really looked into it. Today I decided to break down and investigate.  Turns out it is a fairly simple process.

Option 1 - Port Forwarding

I first found this blog post that discusses setting up the proper port settings for Virtualbox.  I tried this and ran into some errors, including the one from the first comment.  Just copying and pasting from the site I started up VB and immediately got the errors.  I realized this is because I am using one of the Intel network adapters, not PCNet as in the example.  The secret here is figuring out the appropriate device name to use in the paths given by the post.  The way I found these was to look at the VBox.log file for the appropriate machine.  In it you will find many lines, but you're looking for level 2 devices that look something like this:

[/Devices/i8254/] (level 2)

In this case it is the device for the Intel network adapters.  So, adjusting the code from the blog, you would run each of these at the command prompt...

VBoxManage setextradata  YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/HostPort" 8888
VBoxManage setextradata YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/GuestPort" 80
VBoxManage setextradata YourGuestName "VBoxInternal/Devices/i8254/0/LUN#0/Config/apache/Protocol" TCPShutdown

One thing to be careful of here is the already mentioned errors.  These will prevent you from starting up the VM at all.  The most common error is verr_pdm_device_not_found, which stems from having the incorrect device for the current settings, so using pcnet in the above code but having and Intel adapter chosen in the settings, or having an incorrect name in the path, say you missed the 'i' in i8254.  The way to fix this is close out of VirtualBox--VMs and the program itself--and edit the xml configuration file for the machine.  It will have <ExtraDataItem> entries near the top to correspond to the settings you changed above.  Delete these three lines, startup VB, and your VM should now be able to start itself.

The blog goes on to say you should then use localhost:8888 to access the web server but I never got this to work.  I admit I did not try very hard, so the solution here could be a simple fix.  I'm guessing it is some issue with Windows 7 networking handling the localhost, which is not something I felt like messing with, so I found a different solution.

Option 2 - Bridged Adapter

The network settings on my guest were to set it to connect via NAT.  Changing to bridged, I was then able to browse to the guest's IP in my host (192.168.1.5 in this case) and everything worked fine.  Much simpler and no messing with VBoxManage.

Option 3 - Host-only Adapter

The Host-only Adpater will also work but it will not give outside internet access to your guest.  The plus side is that no physical network device is required, but for me it is hard to imagine a computer without a network device in this day and age.



Overall Option 2 is the best as I see it: simple and effective.

Monday, November 8, 2010

Github: too many forks?

I recently created a Github account (check me out!) and have become familiar with the site because it hosts a lot of Django-based modules.  At first blush it seems like a programmer utopia:  the code is open source and easily viewable, forking is encouraged and there is a somewhat intuitive interface for following them, wikis and issues are automatically created for each project, and, thanks to git, forks are (usually) easy to merge.  There's a lot of users, a lot of activity, and a lot of prolific and talented programmers on github. 

So what's not to love?

Really, Github does seem to be the cat's meow of the open source world at the moment, with a nod to BitBucket, where Mercurial-based projects are hosted.  As I've said I've only recently begun to use Github, so I've discovered it has some significant oddities for someone just joining the ecosystem.  I'm surprised there hasn't been more discussion of this, a quick search of the interwebs only really turned up this post by Andrew Wilkinson.

I'll summarize his points and then add some spice.
  • Coders are "rock stars" that are emphasized over their projects and the interface is designed from a contributor point of view.  He goes on to later point out that projects of any decent size typically have more users than contributors, so the interface is a bit quirky for someone just wanting to use the project.
  • Each fork gets its own issues and wiki, making it confusing where to discuss the project.
  • Determining which fork to use is not trivial
The interface issues of the first I don't have much of an issue with because it sounds like Github is working on it; they've even made some recent improvements on this front.

The second is definitely a problem, but I think is a necessary one.  A fork potentially has its own features and bugs, so these need to be put somewhere.  However, anything not specific to the fork should be put in the wiki/issue tracker/whatever of the main project.  I think this would be less of an issue if the next point were fixed.

Finally, I come to the heart of my Github confusion:  which fork does one use?  By 'use' I mean either fork and contribute to or, as a user, download and install.  The developers have said they are working on a way to better identify the "main" project, but the solution is yet to be seen.

I'll show how I find the "main" project, and then examine the issue itself.  There are two "find main" methods that probably need to be used together.  The Network graph (read up on it here, you really need to understand these graphs to understand the my later figures) is not the place I start because it is relative to the current project, it will be used in a bit.  What I mean by "relative to the current project" is, if you're looking at a project that is say the fork of the original project, all the commits for the original up to the point that the fork occurred are put into the forked project's timeline.  Thus, I first try to find the "grandfather" or original project that started the chain.  I do this by following the "forked from" links until I get up to the one that is not a fork of another.

Currently looking at dcramer's version of the project, a fork of robhudson's, which happens to be the grandfather.




After getting to the grandfather, I then look at the Network graph, which I consider clearer now because all forks show only their own commits.  Typically the grandfather is the "main" version of the project, but occasionally a grandfather will become dormant and a fork will become the main line of development.  Look to see whether the grandfather is still being updated, forked, and other forks are merging back.  If not, see if another fork has taken over this position.  If not, find the fork with the bug fixes and improvements that seem best, as it will probably become the "main" version as other people make the same conclusion as you (hopefully).  Another thing that may be helpful is checking the number of watchers of a project/fork, this can give you an idea of its popularity.

Okay, so hopefully you can see this is about as clear as mud and a rather inexact science.  Wilkinson's "rock star" description is apt:  projects are first identified first by the coder--robhudson's django-debug-toolbar--rather than the project itself.  Admittedly this makes sense based on how git and forks work, but it leaves the interface muddied.  Whom do I trust?  robhudson or dcramer?  Side note: I'm glad people generally use their names or sensible nicknames as identifiers, if "l33tskillz393" had a fork I don't think I'd even give it the time of day.

To further clarify lets look at some pictures of the Network graphs for a couple of projects I've looked at recently.

(django-pagination)

What I have identified as the grandfather and main branch is the line on the top.  There are more forks not shown here, but none below hgrimelid's have any "recent" commits.  It is pretty clear here that the grandfather branch is the "main" version of the project:  past forks have either died or merged their changes back in (merges can be seen on the blue and neon green lines in the upper left).  There are some recent forks off the latest grandfather commit, possibly with important bug changes or features, so it makes the decision a bit less clear.  Go with the main branch and assume important changes will be merged in a future version, or go with a fork and hope it doesn't turn into a dead end?

Lets look at another with a slightly different situation.

(django-sorting)
The grandfather is again on top.  But this time the grandfather looks a bit outdated, no updates in months!  Looks like a very active project though, plenty of forks--and forks of forks!--being made and updated.  I do not think there is a clear choice.

Github has highlighted an unforeseen problem with distributed version control when the participants aren't under some guiding light, such as working for the same company.  The traditional model of a project is that there is some entity--a person, a committee, a company--that determines a project's versions, features, etc.  Distributed source control may be used to develop the project, but at some point someone says "this is the next version" and everyone trusts that authority.  This can be seen commercially in, say, how Microsoft releases new versions of Windows every so often.  In the more complicated world of distributed version control, look at what git was created for developing in the first place: the kernel.  It gets all kinds of forks and such but in the end the idea is that they get merged back in to the mainline kernel, Torvalds baptizes it, and distributions push this authoritative version out to users.  In this traditional model, users and programmers really only care about the project as a whole, not the forks that went into it.

Github flips this on its head, I assume because of the "rock star" approach.  Each programmer is given equal stage and there is no definitive project.  Without an authority people go on their merry way and we get these spider webs of Network graphs.

I'm going to eagerly await Github's "find the main branch" solution.  My admittedly rather warm-and-fuzzy suggestion is to strongly encourage merging forks.  Traditionally forks have been a big deal because they split a project's development due to legal reasons or vision disagreements or whatever.  But the assumption is there is no other recourse and no reconciliation.  Perhaps this trained behavior is one reason for the lack of merges?  Anyway, forks usually make significant changes to a project that are not necessarily meant to play nice with the original project.  In the Github world forks are THE way to update projects.  Thus forks are not splits, they are the way you do even small-scale things like fix bugs and add minor features.  These are things that should bubble up to the main project, not languish in a soon-to-be-forgotten fork.  It seems from my own experience that people fork, fix the bugs they need for their personal use, and then forget about the project altogether.  If you look at the figures I've provided you see an abundance of forks, but merges are rare.

If Github could convince all the forkers to be mergers it would be a much happier place.

Thursday, June 10, 2010

Traders don't know jack (about technology)

As I was driving home today I caught a story on NPR about the practice of high frequency trading, or HFT. For those of you who don't know, HFT in a nutshell is some (supposedly, we'll get to this in a second) clever guys writing custom software that makes stock trades in microseconds based on trading volumes and such. The idea is to make fractions of a penny billions of times. It is an interesting topic and for more you should listen to the story.

What caught my ear though is they begin talking about the new server farm the NYSE is building in New Jersey that will handle all its transactions. The story then goes on to talk about how the HFT people are falling over themselves trying to get office space as close to the NYSE building as possible or, even better, inside the building itself. It is explained that the closer to the NYSE servers these guys get the faster their programs can communicate with the NYSE and thus lead to better informed trades.

If you're a computer scientist something should sound amiss there. Closer to the NYSE servers makes your trades go faster? Buzzzz wrong! This is most definitely true if you have a direct connection to the NYSE but I am assuming these guys don't, they're using the internet like the rest of us. This means their information packets have to go from their computers, out to their service provider to get routed, to bouncing around the interwebs, and finally come shooting back to the NYSE which is only a few feet away from the original computer. Cell phones work in a similar manner, so if you're not a tech person and want a more concrete example get someone in the same room as you and call their cell phone from yours. Note the delay between when the person says something and when you hear it coming through the phone. It takes time for the voice to travel from the phone, to a tower, go through some routing and other towers, and then to the other phone, which is analogous to how internet traffic bounces around.

So, Wall Street proves its stupidity yet again. I think I'll invest with the guys that are building their office next to their ISP.

Monday, May 10, 2010

Simple Page Navigation -- why is it so hard?

I am a big fan of Men's Health but have come to really loathe their online articles. I began looking at this article and the first page required me to scroll a bit to read the text. One would think the "next page" sort of links would be at the bottom thanks to the unofficial convention amongst major web sites. This is not the case however, the next/previous buttons are at the TOP of the article, requiring you to first scroll to read the page (this is okay in my book despite the no-scrolling zealotry) then scroll back to where you were to move to the next page. Terrible HCI design. There needs to be plenty of room for creativity on the web, but I'm afraid the art departments are still winning the battle over the (probably non-existent) user experience folks.

I've been noticing poor navigation on plenty of other sites too so I'm not trying to pick on MH. The worst offenders seem to be "top 10" sort of lists. If I get the time, I'd like to dive into this a little more.