In the course of building some much-needed testing infrastructure for total-impact, I found I needed a source of random GitHub usernames. A forum post directed me to the very cool GitHub Archive project, which pushes its extensive collection of GitHub data to Google BigQuery. BigQuery in turn lets you write SQL-style queries on ginormous datasets like this one. After a quick BigQuery signup and look at the schema, I had a list of One Million Usernames. Sweet.
Unfortunately, BigQuery isn’t really setup to do lots of fast lookups on the same query (update: Or Is It?), which is what I needed. It does, though, let you download CSV, which I did. From, there the list of names went into R (here’s the code), where I got rid of duplicates and (with the help of this great post) uploaded the usernames to Cloudant, a cloud-based CouchDB service. Since CouchDB communicates entirely over HTTP, this essentially gives the dataset a RESTful API for free.
Once the data was in Couch, writing a thin Python wrapper around the HTTP call was a piece of cake; essentially, all you have to do is query Couch’s
all_docs endpoint looking for the document id nearest to a randomly-generated string. All in all, a lovely afternoon’s work and a great example of how open APIs, cloud-based services, and open-source software can make slinging big data easy enough that even a grad student can do it :)