Why not sharing our data with other scientists? To stimulate multidisciplinary research? To accelerate scientific output by allowing more eyes to look at the data? To have fun together? In a comment which was published last week in the New England Journal of Medicine the authors recommended that data from randomized clinical trials might not be freely accessible at the first publication. They stated that a delay up to five years after the first publication should be allowed to the “data owners”. In the NRC of last Saturday, Rosanne Hertzberger writes about the lack of an open culture in science and mentions the comment in this context.
Complexity of the data is the first argument against data sharing which is given in the comment. What do the “data owners” mean with complexity of the data? Complexity of study design, sampling scheme and structure? Or the underlying biology represented by the data? Or both?
As a biostatistician I have often encountered these issue around data sharing. In biomedical projects where my colleagues like to add data from another study to improve the power. In my own research projects where I like to illustrate the new methods with exciting datasets.
Statistical research aims development of methods which efficiently analyze the data to infer relationships between variables in the whole population from which the subjects of the study were sampled. Statistical inference might be quite challenging because of non-random sampling, measurement error, model assumptions to be made etc…
In a perfect world the new methodology and the results of analyses of new data would be published together. However typically data are first analyzed with existing but not completely perfect methodology and published before the methodological paper. Sometimes a methodological paper is published first but with a less interesting dataset. Statistician might even have to squeeze the data to fit the statistical problem of the not-yet-available new dataset.
One of my research topic is statistical methodology for family data. The last 10 years, I have worked very hard to start collaborations with many researchers who collect data from families. I wrote a project for a personal grant to develop new methods for family data and included an impressive table of family studies with relevant old and new datasets. The life scientists in the committee wondered whether sufficient data were available for my project…
What does the committee mean? Of course there is sufficient data.
I did not get the fellowship.
The committee wrote about me: We are not convinced that she is able to deal with the data???!!
Why not? Is the data too complex for me?
Shockingly, the argument used by the committee is similar to the one used against data sharing in the comment. The committee however cannot mean the complexity of the data structure and study design. Obviously, as an experienced statistician I know much more about how to deal with these issues than the life science committee members themselves.
So what did they mean? Did they mean that I do not understand the biology represented by the data? Or the real problems in health care?
Might be. I am very much interested in biology, but I am not a biologist. I am very much interested in clinical applications of my methodology but I am not a clinician. However I always discuss the relevance of the methodology to be developed with the data owners. I share the results of the analysis of the data with my new method to the “data owner” before publication. On most of my papers the “data owners” are coauthors. So why worry?
The committee members did not ask me about the role of data in statistical research, and the role of data and statistical research in life sciences. They assumed that they know. Do they?
Just as complexity of data structure cannot be the argument of sharing data with statisticians, complexity of biology cannot be the argument of sharing data with other life scientists.
I think that most of the (bio)medical researchers just do not want to share “their” data. It is their little baby.
In addition to the problem that no sharing of data hampers the contribution of the data to science, there is one more problem. By being busy with sitting on your data, you are less aware of opportunities in other data sets where you can contribute with your specific expertise. You have no idea about how much data there is in this world.
For sure, comments such as “complexity of data” and “a statistician is not able to deal with the data” are very weak arguments. Scientists should be able to explain the complexity of the data to statisticians, otherwise they should not have started the study in the first place. In the first publication (preferably in an open access journal) , scientists should explain the complexity of the data to the scientific world and allow access to the data to other researchers. I agree with Rosanna: The more people work on your data the better!!