A Web service for random GitHub usernames, via Google BigQuery, R, and CouchDB

In the course of building some much-needed testing infrastructure for total-impact, I found I needed a source of random GitHub usernames. A forum post directed me to the very cool GitHub Archive project, which pushes its extensive collection of GitHub data to Google BigQuery. BigQuery in turn lets you write SQL-style queries on ginormous datasets like this one. After a quick BigQuery signup and look at the schema, I had  a list of  One Million Usernames. Sweet.

Unfortunately, BigQuery isn’t really setup to do lots of fast lookups on the same query (update: Or Is It?), which is what I needed. It does, though, let you download CSV, which I did. From, there the list of names went into R (here’s the code), where I got rid of  duplicates and (with the help of this great post) uploaded the usernames to Cloudant, a cloud-based CouchDB service. Since CouchDB communicates entirely over HTTP, this essentially gives the dataset a RESTful API for free.

Once the data was in Couch, writing a thin Python wrapper around the HTTP call was a piece of cake; essentially, all you have to do is query Couch’s all_docs endpoint looking for the document id nearest to a randomly-generated string. All in all, a lovely afternoon’s work and a great example of how open APIs, cloud-based services, and open-source software can make slinging big data easy enough that even a grad student can do it :)

An open response to Taylor and Francis

Update: they said no. Sad, but…not shocked.

Thanks for inviting me to contribute an “altmetrics” entry to the upcoming Fourth Edition of your Encyclopedia of Library and Information Sciences. I’d be delighted to do so, under the condition that I would be able to retain the copyright to my work. Failing that, another option would be a waiver permitting me to 1) self-archive the entry on my website, and 2) post it as an article on Wikipedia.

Of course this latter resource serves a much different audience than the ELIS, but it is an important one that nicely complements more traditional subject-specific encyclopedias. In a scholarly publishing world that increasingly values openness and societal value, I think this kind of publishing agreement is not only sensible, but also responsible and ultimately necessary.

Thanks again for the invitation, and I look forward to (hopefully) being able to contribute!
Best,
Jason

Toward a second revolution: Altmetrics, total-impact, and the decoupled journal [video]

Here’s video of my talk for Purdue University libraries (with thanks to them for filming and uploading it). I discuss how social media is transforming scholarly communication, how we can measure it with altmetrics, and how these metrics will inform algorithmic filters to fundamentally transform the way we communicate research. For a (much) faster version, you can also check out the slides.

total-impact awarded $125k sloan grant

[Reposted from our total-impact blog.  Heather Piwowar and I are co-PIs.]

We just heard: total-impact has been awarded $125,000 by the Sloan Foundation! What does this mean for users?  By April 1, 2013, we plan to hit important milestones in three areas:

Product:

  • addition of over a dozen new information sources to total-impact.org, particularly data repositories
  • 60 github watchers, 20 forks
  • substantial innovation in user interfaces for and visualizations of altmetric data

Use:

  • 50k visits to total-impact.org, 30k unique visitors
  • at least 100 scholars embedding or linking to TI reports on their CV
  • at least 25 TI reports included in annual review or tenure & promotion packages
  • 15 publishers/repositories embedding total-impact data on articles/datasets
  • 5 in-process or published research studies based on TI data

Sustainability: A sustainable business plan and organizational model for a mission?driven TI organization

For more detail, see the grant proposal. We are so excited!  Thanks to Josh Greenberg, program director.  You won’t regret it :) As always, let us know if you’ve got thoughts or ideas on how we can best make these goals happen.  Now let’s go change the world!

***

Want to follow total-impact?  check out (thanks to Heather for compiling this list):

Twitter and the new scholarly ecosystem

This is a copy of a guest post I wrote for the LSE Impact of Social Sciences blog:

In 1990, Tim Berners-Lee created the Web as a tool for scholarly communication at CERN. In the two decades since, his creation has gone on to transform practically every enterprise imaginable–except, somehow, scholarly communication.  Here, instead, we lurch ponderously through the time-sanctified dance of dissemination, 17th-century style. The article reigns. Scholars continue to wad the vibrant, diverse results of their creativity and expertise–figures, datasets, programs, abstracts, annotations, claims, reviews, comments, collections, workflows, discussions, arguments and programs–into publishers’ slow molds to be cast into articles: static, leaden information ingots.

Growing numbers of scholars, though, are realizing that this approach is no longer the best we can do. We’re defrosting our digital libraries, moving over a million personal reference collections online to services like Zotero and Mendeley (and in the process making the open reference list a new kind of publication). Scholars are flocking to scholarly blogs to post ideas, collaborate with colleagues, and discuss literature, often creating a sort of peer-review after publication. Emboldened by national mandates and notable successes, we’re beginning to publish reusable datasets as first-class citizens in the scholarly conversation. We’re sharing our software as publications and on the Web. The journal was the first revolution in scholarly communication; we’re on the brink of a second, driven by the new diversity, speed, and accessibility of the Web.

The poster child for this Scholcomm Spring is Twitter. There’s been terrific interest in scholars using Twitter to discuss and cite literature, for teaching, to enrich conferences, or less formally as a “global faculty lounge.” We recently finished a large study to get better data on these uses.

Instead of asking for self-identified scholars on Twitter, we started out with a list of around 9,000 scholars from five US and UK universities, then searched for their names on the Twitter API. After manually confirming all the matches, we downloaded all the tweets each scholar had made and coded the content of these. The graphic below has some details of our findings (click for full-size image), but here’s a summary:

  1. Twitter adoption is broad-based: scholars from different fields and career stages are taking to Twitter at about the same rate.
  2. Scholars are using Twitter as a scholarly medium, making announcements, linking to articles, even engaging in discussions about methods and literature. But the majority of most scholars’ tweets are personal, underscoring Twitter as a space of context collapse, where users manage multiple identities.
  3. Only about 1 in 40 scholars has an actively-updated Twitter account. This may seem small, but keep in mind that Twitter’s only 5 years old; email was still a scholarly novelty 15 years after its creation. Taking the long view, the current count of scholars using Twitter is probably less important than its continued growth, which we see clearly.

Results like these are encouraging for those of us who see social media and related environments as the natural next frontier for communicating scholarship. It seems that scholars, without waiting for approval from the mandarins of the publishing industry, are beginning to explore and colonize the Web’s wide-open spaces.

But perhaps the most exciting thing about this nascent scholarly Great Migration is that the new, online tools of scholarship begin to give public substance to the formally ephemeral roots of scholarship: the discussions never transcribed, the annotations never shared, the introductions never acknowledged, the manuscripts saved and reread but never cited. These backstage activities are now increasingly  tagged, cataloged, and archived on blogs, Mendeley, Twitter, and elsewhere.  As more scholars move more of their workflows to the public Web, we are assembling a vast registry of intellectual transactions–a web of ideas and their uses whose timeliness, speed, and precision make the traditional citation network look primitive.

I’ve been involved in early efforts to understand and use these new data sources to inform alternative metrics of impact, or “altmetrics.” Altmetrics could be used in evaluating scholars or institutions, complementing unidimensional citation counts with a rich array of indicators revealing diverse impacts on multiple populations. They could also inform new, real-time filters for scholars burdened by information overload: imagine a system that gathers and analyzes the bookmarks, pageviews, tweets, and blog posts from your online networks, using your interactions with them to learn and display each day’s most important articles or posts.

Even better, what if every scholar in the world had such a system? We might do away with journals entirely. The Web can disseminate and archive products for nearly free. The slow, back-room machinations of closed peer review could be replaced by an open, accountable, distributed system that simply listens in to expert communities’ natural reactions to new work–the same way Google efficiently ranks the Web by listening in to the crowdsourced “review” of the hyperlink network.

Of course, this particular vision may not pan out. And although the current signs point toward more growth, scholars might get tired of Twitter. But to hang our hopes on a particular vision or tool is to miss what’s truly revolutionary about this moment. The journal monoculture, long the only viable approach to scholarly communication, is beginning to yield at its fringes to a more diverse, vibrant, online ecosystem of scholarly expression. This new ecosystem promises to change not only the way we express scholarship, but the way we measure, assess, and consume it.

Open Access: 3 koans

1.

The teacher was sitting one day beneath a cherry tree, regarding the birds as they ate its fruit. A student approached the teacher and spoke: “Master,  I am afraid that if I make my research notes open, others will steal my good ideas.”

Instead of answering the student, the master turned and cursed the cherry tree: “You foolish tree! You labor to produce sweet cherries, only to have them stolen by these birds!”

The student was surprised at his teacher’s lack of wisdom, and rose to correct him: “But Master, surely you see that in taking the cherries, the birds also spread the tree’s seeds!”

At that moment the student was enlightened.

2.

Once, a student travelled a long way to speak with the teacher. “Teacher,” the student said, “I have heard your teachings and made all I create freely accessible to all. What is more, I have given it a non-commercial license, to ensure it will not be abused by evil, for-profit companies.”

The teacher responded, “go to the well in the middle of this town, draw out a cup of water, and bring it back here.” The student was surprised at this request, but followed the teacher’s instructions.

When the student returned, the teacher asked, “while you were waiting to draw water, what did you see?”

The student replied, “I saw a farmer getting water to give his livestock, a baker getting water to make bread, and a shopkeeper getting water to wash her windows. All three were prosperous and happy.”

“Very good,” said the teacher. “Now, taste the water. Does it slake your thirst, or not?”  The student tasted the water, and was enlightened.

3.

Once two students were in the midst of an argument as they sat down to eat with the teacher.

One student said, “I believe that true openness means copyleft: we must require anyone using our work to make it it freely available in turn.”  The second student disagreed, saying, “No, openness is about making things easy to share and reuse; we should embrace the least restrictive licenses available.”

Before the argument could continue, the teacher interrupted. “Students,” the teacher asked,  “you see this pot of good food in front of us. Before we eat, I am curious: should it be called soup, or stew?”

“Master, it is stew,” answered the first student.  “No,” retorted the second, “this is too thin; it is soup.”  The master said nothing, so the students continued to argue over this; neither wanted to admit to being wrong in front of the teacher.

After some time, the first student turned and said, “Master, I do not think we can agree on this matter. Soup or stew, it is food, and we are all very hungry; may we at least serve the food and eat, that we may argue with our bellies full?”

Upon saying this, the student was enlightened.

my.altmetrics.org: alt-metrics for your CV

The PLoS alt-metrics study is moving well; we’ve transitioned to an open notebook built on GitHub (which is awesome and a good topic for another post) and findings are starting to emerge.

I’m starting to think about next steps, and to me the obvious one is to build a frontend for our crawler–giving working scholars and funders the opportunity to try out alt-metrics for themselves.

I’m posting some rough mockups (click ‘em to embiggen) of what a public alt-metrics machine might look like. This would be a place where people could upload a list of DOIs (or some other ID) and get a page that’ll let them track, visualise, and analyse the impact of their work in a broader and faster way than citations alone allow–sort of a Google Analytics for your CV.

I’m thinking it’d be cool to also have a way to embed results in another webpage, so you could make your actual CV show alt-metrics in real time. But there are a lot of directions to take this. If you’ve got suggestions, I’d love to hear!

Here’s the homepage. Pretty basic–just a place to upload identifiers:

 

And here’s the results page where the real action happens. This shows a result from a pretty large set of articles, like a lab or funder might input:

Has journal commenting failed?

It’s a great idea: take all the insights, suggestions, and criticisms on scholarly articles, the comments shared in journal clubs and scribbled in margins the world over, and make them accessible to everyone. Attach them to the article itself; make it a conversation, not an artifact. We have blog commenting, video commenting–why not article commenting?

That’s sounded good to a lot of publishers, and over the last five years, we’ve seen article commenting systems become pretty popular. But there’s a growing sense that article commenting isn’t working.

The bad

Gotzsche et al. (2010) look at author replies to BMJ’srapid response” comments. We’d hope the chance to interact with authors would be a big plus for article commenting; however, they found that even when comments could “invalidate research or reduce… reliability,”  over half the time authors couldn’t be bothered to respond.

In another study, Schriger at al. (in press; thanks Bora) examine the prevalence of commenting systems in top medical journals.  They report that the percentage of journals offering rapid review has dropped from 12% in 2005 to 8% in 2009, and that fully half the journals sampled had commenting systems laying idle, completely unused by anyone. The authors conclude, “postpublication critique of articles in online pages provided by the journal does not seem to be taking hold.”

Finally, I collected data on PLoS comments as part of a larger investigation of alt-metrics. As evident from the graphic, the number articles with comments has held more or less steady as the total articles published has grown: again, not a pretty picture for those of us excited about article commenting.

The good

I’m not ready to give up on comments yet, though, because I think there’s a different way to see these findings. The question shouldn’t be “have comments failed,” but “are they succeeding somewhere, and why?”  After all, we’re still in the very early stages of this thing; change in scholarly communication so far has happened on a scale of centuries.

Active, widespread commenting would be a radical change in how scholars communicate, and as with all fundemental shifts, we can assume most early efforts will be failures. In the 1900s, way more automobile manufacturers went broke building lousy cars than flourished making good ones. So in looking at comment ecosystems, we shouldn’t be stuck ogling the crowd of inevitable false starts–we should be trying to spot the nascent Model T.

And when we do see venues where comments are disproportionately successful, we should be trying to figure out what they’re doing right. While half the sample of the Schriger et al. study are stuck without a single commented article, BMJ, CMAJ, and Ann. Intern. Med. all have comments on 50-76%. How are they different? The BMJ articles sampled by Gotzshe et al. had a mean of 4.9 responses each, which is pretty respectable. Why are these here, but not elsewhere?

In the case of PLoS, we can see that even journals from the same publisher and on the same platform show widely different commenting rates. Is it the editors, the nature of the field, or something else that’s making PLoS Biology’s comment rate climb as PLoS Genetics’ holds steady and PLoS ONE’s drops?  This is a great opportunity for research that will help commenting evolve further.

The future

So I think that while we see cases where journal commenting is beginning to succeed, we should continue to put resources behind spreading that success. This said, I have to admit I’m doubtful that publisher-hosted commenting is the future.

Today we have two scholarly communication ecosystems: the formal, peer-reviewed one, and the shadow system encompassing everything from scribbled marginalia, to chats in the lab, to peer reviews themselves. Sooner or later, I believe the shadow ecosystem will migrate to the web; a detailed argument for why is a different post, but there are too many advantages. It’ll happen. The advance guard is already conversing, learning, and collaborating on Zotero, Mendeley, CiteULike, blogs, Twitter, and so on.

Publisher-hosted article commenting is the formal system’s bid to gain a foothold in the informal system as it moves online. And it’s a smart bid, because as the shadow system sheds its ephemerality, it’s going to become increasingly important to how we measure and do scholarship.

But the problem is that journal-based comments are as siloed as the articles they comment on; there’s limited exposure, and no community. Scholars will want to have their conversations with their people, in their ways, in their places.  Today, that mostly means Twitter and blogs (as we saw in #arsenicLife); in the future, it may also be scholar-specific services like The Third Reviewer, COASPedia, or VIVO.

So while I support article commenting as it now exists, I think challenge of the future won’t be moving the shadow communication system online–it’ll be aggregating it so it can be consumed, measured, and filtered efficiently and meaningfully. I think alt-metrics will play a part in that, but again, that’s another post :)
 
 
 

References:

Here’s the dataset and R code for the PLoS graphics; I hope to be releasing the full data next week.

Gotzsche, P. C., Delamothe, T., Godlee, F., & Lundh, A. (2010). Adequacy of authors’ replies to criticism raised in electronic letters to the editor: cohort study. BMJ, 341(aug10 2), c3926-c3926. doi:10.1136/bmj.c3926

Schriger, D. L., Chehrazi, A. C., Merchant, R. M., & Altman, D. G. (In press). Use of the Internet by Print Medical Journals in 2003 to 2009: A Longitudinal Observational Study. Annals of Emergency Medicine, In Press, Corrected Proof. doi:10.1016/j.annemergmed.2010.10.008

MEDLINE literature growth chart

We all know the volume of scientific literature is growing.  I went looking for an infographic showing this, but wasn’t satisfied with what I found, so I made one, based on the publication dates of articles in MEDLINE.

I got the data by searching PubMed with the query
("[year]"[Publication Date])where [year] was each year from 1950-2009. Then I charted the results in R, and resized them in Photoshop.

The data, R code, and images  are all CC0 (public domain), and can be used wherever and for whatever you fancy.

small version of graphic

num-medline-articles-published-by-year.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# setup.
pub <- read.table("path_to_data_file", header=TRUE)
par(cex=2.2) # controls the relative size of the text
mainTitle <- "MEDLINE-indexed articles\npublished per year"
 
# make the plot.
# see http://www.harding.edu/fmccown/r/ for a nice intro on plot options
plot(pub, main=mainTitle, ylab='', xlab='', type="l", axes=F, 
     col='red', lwd=6, ylim=c(0,1000000))
 
# label the axes
axis(1, at=seq(1950, 2010, 10), lab=seq(1950, 2010, 10))
labs <- c('','','200k','','400k','','600k','','800k','','1M')# quick and dirty labels...
axis(2, at=seq(0, 1000000, 100000), lab=labs, las=2)  # "las" makes labeles display horiz.

Scientometrics 2.0

I’m excited that I’ve had two papers accepted this week: “Scientometrics 2.0: Toward new metrics of scholarly impact on the social Web,” with Brad Hemminger, and “How and why scholars cite on Twitter” (online soon) with Kaitlin Costello.

What’s special about these two papers is that they are the start of  a research project that I hope will become my dissertation, an idea I’m somewhat reluctantly calling “scientometrics 2.0.” (do we really need more 2.0s?) Scientometrics is

…the science of measuring and analysing science. In practice, scientometrics is often done using bibliometrics which is a measurement of the impact of (scientific) publications. (Wikipedia)

My idea is that we should be looking beyond this, and starting to mine Web 2.0 sources for signals of scholarly impact. There are a few big advantages to this approach:

  1. It’s much faster.  Once a scholarly article is published, it takes a years for citations to that article to accumulate.  But it can take just days for, say, Diggs or tweets to show up: in our Twitter sample we found that nearly half the links to peer-reviewed articles appeared within a week of those articles’ publication.  This speed could be harnessed to make real-time, personal filters that inform scholars what’s groundbreaking across a broad set of fields. As the velocity and volume of science grow, this could be very valuable.
  2. If I cite something, it probably had an impact in my work.  But what kind of impact?  What if I read it and talked about it, and it informed my general thinking–but not enough to cite?  Just looking at citations, we’re missing many other kinds of impact.  Ten years ago, this was the best we could do.  But today, scholars are using online tools like CiteULike, Mendeley, and Zotero to manage their libraries; Faculty of 1000 to review articles;  and Twitter, FriendFeed, and ResearchBlogging.org to discuss them.  Tools like these–and importantly, the open APIs many of them offer–allow us to lift the curtain and observe scholars in their native habitat.  Scientometrics 2.0 offers a chance for us to develop a richer, more nuanced picture of scholarly impact.
  3. Finally, this approach allows us to break the centuries-old monopoly of the peer-reviewed article or monograph on scientific communication.  We can measure reactions not just to these articles, but also to blog posts, datasets, or videos.  If a certain blog post in your field is generating lots of buzz, there’s a good chance it’s worth your time.  Scientometrics 2.0 can support a sort of informal, “soft peer-review” that works for free, on everything.

At first, this approach will mostly be used for relatively “pure” academic study–learning more about how scholars communicate how impact is transmitted.  Soon, however, young scholars will start making a case to tenure and promotion committees that their heavily tweeted or bookmarked article should count in their favor. Ultimately, I think we’ll see tools that leverage this information to help direct scholars to the most important and relevant work for them, kind of a PostRank for academics.

Of course, there are some obstacles to this.  The most important one for now is getting people to trust that these alternative sources really mean anything.  Who cares if an article is tweeted a lot?  Won’t people game this?  What about scholars who don’t use social media (a majority, for now)?  These questions have answers, but they need to be taken seriously (see the articles for more detailed discussions).

Ultimately, scientometrics 2.0 is going to have to be something we investigate very carefully, and in the proper context.  However, in that context I think it has the potential to be quite valuable, and I”m excited about working toward this in the next several years.

(Note: for a bunch of relevant citations, see the first article.)