How big data has created a big crisis in science

 

Connecting state and local government leaders

The reproducibility crisis is driven in part by invalid statistical analyses that are from data-driven hypotheses -- the opposite of how things are traditionally done.

The Conversation

There’s an increasing concern among scholars that, in many areas of science, famous published results tend to be impossible to reproduce.

There’s an increasing concern among scholars that, in many areas of science, famous published results tend to be impossible to reproduce.

This crisis can be severe. For example, in 2011, Bayer HealthCare reviewed 67 in-house projects and found that they could replicate less than 25 percent. Furthermore, over two-thirds of the projects had major inconsistencies. More recently, in November, an investigation of 28 major psychology papers found that only half could be replicated.

Similar findings are reported across other fields, including medicine and economics. These striking results put the credibility of all scientists in deep trouble.

What is causing this big problem? There are many contributing factors. As a statistician, I see huge issues with the way science is done in the era of big data. The reproducibility crisis is driven in part by invalid statistical analyses that are from data-driven hypotheses -- the opposite of how things are traditionally done.

Scientific method

In a classical experiment, the statistician and scientist first together frame a hypothesis. Then scientists conduct experiments to collect data, which are subsequently analyzed by statisticians.

A famous example of this process is the “lady tasting tea” story. Back in the 1920s, at a party of academics, a woman claimed to be able to tell the difference in flavor if the tea or milk was added first in a cup. Statistician Ronald Fisher doubted that she had any such talent. He hypothesized that, out of eight cups of tea, prepared such that four cups had milk added first and the other four cups had tea added first, the number of correct guesses would follow a probability model called the hypergeometric distribution.

Such an experiment was done with eight cups of tea sent to the lady in a random order – and, according to legend, she categorized all eight correctly. This was strong evidence against Fisher’s hypothesis. The chances that the lady had achieved all correct answers through random guessing was an extremely low 1.4 percent.

That process -- hypothesize, then gather data, then analyze -- is rare in the big data era. Today’s technology can collect huge amounts of data, on the order of 2.5 exabytes a day.

While this is a good thing, science often develops at a much slower speed, and so researchers may not know how to dictate the right hypothesis in the analysis of data. For example, scientists can now collect tens of thousands of gene expressions from people, but it is very hard to decide whether one should include or exclude a particular gene in the hypothesis. In this case, it is appealing to form the hypothesis based on the data. While such hypotheses may appear compelling, conventional inferences from these hypotheses are generally invalid. This is because, in contrast to the “lady tasting tea” process, the order of building the hypothesis and seeing the data has reversed.

Data problems

Why can this reversion cause a big problem? Let’s consider a big data version of the tea lady -- a “100 ladies tasting tea” example.

Suppose there are 100 ladies who cannot tell the difference between the tea, but take a guess after tasting all eight cups. There’s actually a 75.6 percent chance that at least one lady would luckily guess all of the orders correctly.

Now, if a scientist saw some lady with a surprising outcome of all correct cups and ran a statistical analysis for her with the same hypergeometric distribution above, then he might conclude that this lady had the ability to tell the difference between each cup. But this result isn’t reproducible. If the same lady did the experiment again she would very likely sort the cups wrongly -- not getting as lucky as her first time -- since she couldn’t really tell the difference between them.

This small example illustrates how scientists can “luckily” see interesting but spurious signals from a dataset. They may formulate hypotheses after these signals, then use the same dataset to draw the conclusions, claiming these signals are real. It may be a while before they discover that their conclusions are not reproducible. This problem is particularly common in big data analysis due to the large size of data, just by chance some spurious signals may “luckily” occur.

What’ worse, this process may allow scientists to manipulate the data to produce the most publishable result. Statisticians joke about such a practice: “If we torture data hard enough, they will tell you something.” However, is this “something” valid and reproducible? Probably not.

Stronger analyses

How can scientists avoid the above problem and achieve reproducible results in big data analysis? The answer is simple: Be more careful.

If scientists want reproducible results from data-driven hypotheses, then they need to carefully take the data-driven process into account in the analysis. Statisticians need to design new procedures that provide valid inferences. There are a few already underway.

Statistics is about the optimal way to extract information from data. By this nature, it is a field that evolves with the evolution of data. The problems of the big data era are just one example of such evolution. I think that scientists should embrace these changes, as they will lead to opportunities to develop of novel statistical techniques, which will in turn provide valid and interesting scientific discoveries.

This article was first posted on The Conversation.

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.