A screening program tailored to my risk profile. Only receiving a treatment from which I will benefit. That is what we all want. Can big data help to achieve this goal? Yes it can. The reason is that big data allow identification of homogeneous patient groups. Within a group, patients will benefit from a specific treatment. Across groups, patients probably need different treatments.

Within the research domain, studies are often underpowered. The problem is that underpowered studies yield inconclusive results. What is the reason for this? Studies aim to identify biological mechanisms or to assess the effect of treatment. When effect sizes are a priori overestimated the number of subjects will be too small. Can big data help here?

Yes: Information from data resources improves study design because effect sizes can be better guesstimated.

Is the Health Sector ready for big data?

Recently a report was published in The Netherlands where several challenges were mentioned. These need to be addressed to enable big data applications in health care: 1) new technology for collecting, sharing, analyzing and storing data 2) standardization of data 3) access to the data and 4) privacy. No doubt these innovations are important and necessary. But how do we come from data to risk profiles? We use quantitative methods. Specifically machine learning and statistical tools.

Apparently, it is assumed that innovations which are required to scale up to big data of machine learning and statistical techniques, can be left to other applications such as customer’s data and astronomy. Is this fair? The answer is NO: The size of data in health will be the largest when the wealth of data from devices measuring biological processes and life-style in continuous time will be used. And there are more reasons.

Wrong predictions from consumer data might be annoying: receiving an email with advertisements on beef while you are vegetarian. Predicting disease status affects our lives. Validation of statistical methods is thus vital. Not much research has been performed on these methods in the era of big data yet.

In MIMOmics we develop methods to integrate multiple omics data sets for biological insight and for risk predictions. We address biostatistical issues such as missing data and heterogeneity across studies. Note that these issues might give biased results and therefore yield wrong conclusions. Careful modelling is required. The MIMOmics methods need to be brought to the next level of integration of even more datasets: omics, spatial, electronic health records, images, data from devices etc.

The health sector has to step up to provide methods for big data health research. If we do not invest in methodology for data analysis we will end up with an expensive infrastructure with big data. What a loss of opportunities!!

Jeanine J. Houwing-Duistermaat
Professor in data analytics and statistics