In the course of building some much-needed testing infrastructure for total-impact, I found I needed a source of random GitHub usernames. A forum post directed me to the very cool GitHub Archive project, which pushes its extensive collection of GitHub data to Google BigQuery. BigQuery in turn lets you write SQL-style queries on ginormous datasets like this one. After a quick BigQuery signup and look at the schema, I had a list of One Million Usernames. Sweet.
Unfortunately, BigQuery isn’t really setup to do lots of fast lookups on the same query (update: Or Is It?), which is what I needed. It does, though, let you download CSV, which I did. From, there the list of names went into R (here’s the code), where I got rid of duplicates and (with the help of this great post) uploaded the usernames to Cloudant, a cloud-based CouchDB service. Since CouchDB communicates entirely over HTTP, this essentially gives the dataset a RESTful API for free.
Once the data was in Couch, writing a thin Python wrapper around the HTTP call was a piece of cake; essentially, all you have to do is query Couch’s all_docs endpoint looking for the document id nearest to a randomly-generated string. All in all, a lovely afternoon’s work and a great example of how open APIs, cloud-based services, and open-source software can make slinging big data easy enough that even a grad student can do it :)
2 Comments
I liked this post a lot – it’s an interesting use of the GitHub data.
I slightly disagree with your statement that “BigQuery isn’t really setup to do lots of fast lookups on the same query.” I think that it’s more accurate to say that BigQuery is optimized for queries that require full table scans (such as numerical aggregations) versus ‘needle in a haystack’ type searches for a single record in a dataset. However, depending on the amount of data you are processing, you can run a large number of very, very fast queries (perhaps the same query) concurrently – over huge datasets.
For a simple query that returns a collection of unique Github usernames, you could run something like:
SELECT actor FROM githubarchive:github.timeline WHERE actor IS NOT NULL and actor != ” GROUP BY actor LIMIT 1000000;
Michael: Thanks, interesting to learn more. You just exceeded the sum total of my BigQuery knowledge in one blog comment :) Edited the post.