This work was presented as a poster at the American Auditory Society’s Annual Scientific and Technology Conference, 2-4 March 2017, Scottsdale, AZ. You can download a copy of the poster here.
The National Institutes of Health have recently undertaken steps to enhance rigor and reproducibility of biomedical research. For example, grant applications are now required to explicitly address how the proposed research will achieve robust and unbiased results. Reproducibility is an integral part of science as it allows for research findings to be either validated or refuted and lets scientists successfully build upon previous research. Only when multiple studies, conducted by different scientists, demonstrate similar results can we obtain a reasonable approximation of how the world works. In this blog post I will review some useful tools and discuss some good practices that support transparency and reproducibility.
To ensure the reproducibility of your research, you just need to follow these five simple rules:
- Pre-register your study
- Document everything you do
- Don’t do anything by hand, script everything
- Use a version control system
- Provide open access to all publications, scripts, and data
Following these five rules will not only help other researchers validate or refute your findings, it will also make your research less error-prone, enhance your efficiency and productivity, and make your collaborations more effective. Moreover, it sends a clear signal of transparency and trustworthiness.
So how should we go about following these rules? In the sections below I will discuss some good practices and provide some useful tools that will help enhance the transparency and reproducibility of your research. I will also point you to some useful training resources.
Pre-register your study
A few years ago, a group of scientists wrote an open letter to the Guardian newspaper advocating for study pre-registration in scientific publishing. The idea is that your study design and analysis details are peer reviewed before you start data collection. This helps distinguish exploratory from confirmatory analyses, guards against p-hacking or cherry-picking of data and analyses, and overcomes publication bias.
For a list of journals that currently offer pre-registration, have a look here. Most hearing science journals do not yet offer pre-registration. Let’s convince them!
Even if your journal of choice does not give you the option to submit your study design and analysis plan for peer review, you can still create a registered report via:
Document, document, document
It is incredibly important to carefully document every step of the research process. Keep track of precisely why, when, and how you conducted your research. Instead of using a paper notebook or a collection of word documents, use an electronic lab notebook. This allows you to store all your research output in one place, access your notebook wherever you are, search entries, and share your notes with your collaborators.
There are a bunch of different electronic lab notebooks out there. Some are free, others are quite costly. Some allow you to store large volumes of data, others are more basic. Find one that best suits your needs. Some electronic lab notebooks that may be worth exploring (in no particular order) are:
- Open Science Framework
- Microsoft OneNote
It is also incredibly important to provide detailed documentation for your scripts and your data. If either you or someone else ever wants to re-use it, they’ll need all the help they can get to make sense of your scripts and data files.
Avoid manual data processing
In order for research to be fully reproducible, you have to script absolutely everything. If you manipulate your data by hand at any point, it is difficult to replicate exactly what you did at a later stage. And it may be practically impossible for anyone else to reproduce your results.
Use R (instead of SPSS) for your statistical analyses. You can download both R and RStudio - a graphical user interface for R - for free. Software Carpentry offers a short course that walks you through the basics of Bash. Interested in learning Python? There are a lot of courses and books out there, such as Learn Python the hard way or MIT’s Introduction to computer science and programming using Python.
In general, some great websites that provide online programming courses are:
Feeling tired just thinking about the idea of having to spend hours working your way through an online course? The best way to learn is just to get your hands dirty. So for your next data analysis, just open up RStudio, or the Bash Shell and figure things out as you go along. As always, Google is your friend, as is stackoverflow.
Automatically embed stats in your manuscripts
The next step in making your research reproducible, is to integrate your data, code, and manuscript. That way you can trace back the results you report in your paper to the underlying code and data. Automatically embedding your statistical output in your manuscripts will avoid the risk of any copy-paste errors when reporting your results. Another advantage is that you can automatically update your graphs and statistics after you have re-analysed your data.
There are some nifty solutions out there to automatically embed statistical output in your manuscripts. For example, if you are a Microsoft Word user, you can download StatTag, a free plug-in that allows you to embed statistics, tables, and figures into Word and update your results with one button press. Alternatively, you can write dynamic documents directly from RStudio, using R Markdown (which in turn uses the knitr and pandoc R packages). Have a look at some of the things that are possible with R Markdown here.
Getting started with R Markdown is fairly straightforward. Here are some useful resources to help you get started:
- R Markdown cheat sheet
Formatting is of course very important when preparing manuscripts for publication. Instead of using Microsoft Word, you could use LaTeX and integrate your statistical output using Sweave and knitr. However, even if you’re not a LaTeX user, you can prepare nicely formatted manuscripts in RStudio. For information, have a look at this tutorial and consider using the rticles R package.
In addition to generating reports and manuscripts, it is possible to create interactive web applications in RStudio using shiny. You can show your data on the web and allow people to play with a set of parameters to see what effect they have on your results. For some fancy examples, have a look at this showcase.
Use a version control system
Have you ever had multiple versions of a certain file floating around on your computer, having forgotten which version was the ‘right’ one? This can be particularly problematic if months or years down the line someone asks if you’d be happy to share the code you used in your paper. The solution to this issue is to use a version control system, such as Git. You can use it to keep track of changes in your documents and scripts by storing revisions of your files in a centralized repository. Using a version control system you can easily compare, restore, and merge different versions of files. What is particularly nice is that you can undo changes you made to a given script and its dependent files simultaneously. Rather than having to try to remember which files were changed in conjunction with one another, you can undo the changes to these files all at once.
To start using Git as your version control system, you need to do two things:
Git itself serves as the local version control system on your computer. GitHub and GitLab provide centralized repositories where you can manage and share your files with your collaborators or the world.
Repositories on GitHub are in principle public, which means that anyone can look at and download your code (people can’t make changes to your code unless they have your permission). If you are not ready to share your work, you can make your repository private. Usually you have to pay to host private repositories on GitHub. However, if you have a university email address, you can request an educational discount and have as many private repositories as you want, for free. Instead of using GitHub, you can also create an account with GitLab. You can have an unlimited number of private repositories on their site.
There is a bit of a learning curve to getting started with Git, but it is well worth it! Here are some useful resources to get you on your way:
Publish open access
There has been a recent push for open access publishing. An increasing number of funding agencies, such as the NIH, the Research Councils UK, and the Wellcome Trust have established policies for open access publishing.
There are many advantages to making your publications publicly available. It drives innovation on a global scale, gives clinicians access to the latest research findings, and gives you increased visibility and citations. For an excellent overview of why open access publishing is so important, have a look at this video rom “Piled Higher and Deeper”, by Jorge Cham (phdcomics.com).
There are a few different options to provide open access to your publications.
- Pay an article processing charge
- Deposit paper in an open access repository after an embargo period
- Submit the post-print version of the paper to an online repository
- Post PDFs of your publications on your personal website
Provide open access to your scripts
And important part of enhancing the transparency and reproducibility of your research is to share your code, along with some (example) data and documentation. This enables others to exactly replicate your analyses. Your code can then also be used in future research. It may even lead to some citations!
You can of course share your code and supporting documents as a bunch of individual files. However, a nice way to distribute your code is to create an R package to go along with your publication. Everything will be nicely packaged up together. All people need to do is install your package and they can play with your code and data in R. You can create an R package in RStudio using the devtools package. For some more information on how to go about it, have a look at this tutorial.
If you’re using a version control system, like GitHub, the easiest way to share your scripts or R package is to link to your GitHub repository in your publication. You can even make your repository citable by assigning it a doi. You can learn more about how to do that here.
It might even be worth writing a paper specifically about your analysis methods and code in journals, such as:
Provide open access to your data
Last but not least, in addition to making your publications and code open access, it is important to provide open access to your de-identified data. This enhances the transparency of your research, helps preserve your data for the future, and enables reuse (including meta-analysis) of your data. It is of course important to keep HIPAA compliance and any intellectual property rights in mind when deciding how to go about sharing your data.
When you archive your data, it is important to include detailed documentation so that you and others can make sense of all the different parts of your data set. In a few months or years, it is quite likely that you won’t remember what all the different fields in your excel file refer to.
Deposit your data and documentation in an online repository, such as:
- Harvard Data Verse
- Dryad Digital Repository
- Open Science Framework
- Dat Project
- A repository hosted by your institution, funder, or journal
Acknowledgments: Many thanks to Pamela Souza for making this work possible. Thanks to José Joaquín Atria and Sriram Boothalingam for inspiring discussions on the topic. Work supported by NIH.