On Unpublished Software

09Feb12

sciseekclaimtoken-4f343317d3d60

I ran across this post at The Tree of Life entitled ‘Interesting new metagenomics paper w/ one big big big caveat – critical software not available”.

The long and short of it? Paper appears in Science, has fancy new methodology, lacks the software for someone else to use their methodology. Blog author understandably annoyed. But I have some sympathy with the authors of the paper itself, as much as I prefer the code for an analysis to be available for publication. My thoughts after the jump.

First, this post was born as a comment on the OP blog. For some reason, blogger hates my WordPress login.

Second, I understand the spirit of the post. In an ideal world, the software would be available. From a stable URL. For every single platform under the sun. Or maybe even a web interface, if the project was feeling particularly fancy. This is How It Should Be ™.

That being said, there are reasons that isn’t true. I’ve had one collaborator essentially decide not to give out code because further analysis – that would appear in future papers – was already baked in, and they weren’t going to go through their code line by line to make sure some poor grad student’s project didn’t get scooped by someone reading the code carefully, or that in removing that stuff, they didn’t manage to otherwise break the software.

But lets leave that aside for the moment, and say – as the original published paper did – that the purpose of all this is a new technique, with new software, that we’re hoping people use. There’s still reasons for the paper to come out and the software not be available yet – legitimate reasons. The development of software and its use in science, while very closely linked, are actually disjoint processes that need not progress at the same pace. Some issues that have happened to me:

Beautiful software, ready to go, sitting idle for months waiting for the right numbers to come in to make it usable.

Output from software that is interesting science on its own, but the software isn’t ready for primetime. Maybe it’s got a hideous command line interface with a dozen opaque arguments that appear in no logical order. Maybe the quick and dirty solution to something that produced interesting results for several datasets is inefficient enough that it needs more memory than something like it should. Maybe the documentation is a series of ad-hoc scribbles on a white board. Or maybe it works on your machine, works on a student’s machine, but utterly breaks the first time you try it on a colleagues. Or perhaps you simply want to make it better, and while the science is ready to go and won’t improve by festering for six months, in six months you could have a GUI. Or better performance. Or cross-platform software. Or all of the above.

I can understand the bloggers frustration. And papers that do this should absolutely both provide enough methods detail that you could write your own software if you had the inclination, and focus on those methods, not mysterious code you don’t have access to. But when I read a paper where there’s clearly software to be had, but it’s not available yet, my first thought is “What went wrong?”

About these ads


4 Responses to “On Unpublished Software”

  1. 1 Justin Rising

    “I’ve had one collaborator essentially decide not to give out code because further analysis – that would appear in future papers – was already baked in, and they weren’t going to go through their code line by line to make sure some poor grad student’s project didn’t get scooped by someone reading the code carefully, or that in removing that stuff, they didn’t manage to otherwise break the software.”

    This is the perfect example of something that’s easy to avoid by properly using a version control system (e.g., svn/git/etc.). Check in the version that went in to the original paper, make a tag/branch for it, and then you can make whatever modifications you want while still always having access to something that’s suitable for release.

  2. I read Jonathan’s post too and had no sympathy for the authors. Neither, though, did I have antipathy toward them – they are not the issue. What’s at fault here are the journals and the reviewers. Until they enforce some standards, authors will get away with bad practice. How a reviewer can declare the findings of a study to be valid without access to the analyses used is completely beyond my comprehension.

  3. 3 RSingh

    The practice of not including the code seriously hampers the dissemination of the published work.

    Researchers also have limited time and the lack of code is an impediment to learning or utilising the published technique.

    It should be made mandatory that the code is included, either within the paper or reference to a link on a website.

    • I’m not sure I agree. While I do appreciate when researchers make their code public, I’ve never actually found myself needing to rely on someone else’s code for a research project. If a paper relies on and focuses entirely on the software, I might understand, but I’ve always argued that the baseline level of presentation should be that the code can be rebuilt, not merely that it exists. This is, for no other reason, important to prevent the propagation of inefficient algorithm design, coding errors, or other design flaws.

      Indeed, the one time I tried using someone else’s code, the amount of trouble I went to trying to shoehorn their software into my question entirely offset the time saved not coding it myself.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: