A Web service for random GitHub usernames, via Google BigQuery, R, and CouchDB

In the course of building some much-needed testing infrastructure for total-impact, I found I needed a source of random GitHub usernames. A forum post directed me to the very cool GitHub Archive project, which pushes its extensive collection of GitHub data to Google BigQuery. BigQuery in turn lets you write SQL-style queries on ginormous datasets like this one. After a quick BigQuery signup and look at the schema, I had  a list of  One Million Usernames. Sweet.

Unfortunately, BigQuery isn’t really setup to do lots of fast lookups on the same query (update: Or Is It?), which is what I needed. It does, though, let you download CSV, which I did. From, there the list of names went into R (here’s the code), where I got rid of  duplicates and (with the help of this great post) uploaded the usernames to Cloudant, a cloud-based CouchDB service. Since CouchDB communicates entirely over HTTP, this essentially gives the dataset a RESTful API for free.

Once the data was in Couch, writing a thin Python wrapper around the HTTP call was a piece of cake; essentially, all you have to do is query Couch’s all_docs endpoint looking for the document id nearest to a randomly-generated string. All in all, a lovely afternoon’s work and a great example of how open APIs, cloud-based services, and open-source software can make slinging big data easy enough that even a grad student can do it :)

4 Comments

  1. Posted July 23, 2012 at 4:26 pm | Permalink

    I liked this post a lot – it’s an interesting use of the GitHub data.

    I slightly disagree with your statement that “BigQuery isn’t really setup to do lots of fast lookups on the same query.” I think that it’s more accurate to say that BigQuery is optimized for queries that require full table scans (such as numerical aggregations) versus ‘needle in a haystack’ type searches for a single record in a dataset. However, depending on the amount of data you are processing, you can run a large number of very, very fast queries (perhaps the same query) concurrently – over huge datasets.

    For a simple query that returns a collection of unique Github usernames, you could run something like:

    SELECT actor FROM githubarchive:github.timeline WHERE actor IS NOT NULL and actor != ” GROUP BY actor LIMIT 1000000;

  2. jason
    Posted July 25, 2012 at 10:12 pm | Permalink

    Michael: Thanks, interesting to learn more. You just exceeded the sum total of my BigQuery knowledge in one blog comment :) Edited the post.

  3. Scott Chamberlain
    Posted June 30, 2013 at 10:07 pm | Permalink

    wrt couchdb from R, I started an R package to interact with CouchDB, still in early days, here https://github.com/schamberlain/sofa

    Best, Scott

  4. jason
    Posted June 30, 2013 at 10:35 pm | Permalink

    Very cool, Scott! Quite useful.

Post a Comment

Your email is never shared. Required fields are marked *

*
*