David C. Bailey
Abstract
Judging the significance and reproducibility of quantitative research requires a good understanding of relevant uncertainties, but it is often unclear how well these have been evaluated and what they imply. Reported scientific uncertainties were studied by analysing 41 000 measurements of 3200 quantities from medicine, nuclear and particle physics, and interlaboratory comparisons ranging from chemistry to toxicology. Outliers are common, with 5σ disagreements up to five orders of magnitude more frequent than naively expected. Uncertaintynormalized differences between multiple measurements of the same quantity are consistent with heavytailed Student’s tdistributions that are often almost Cauchy, far from a Gaussian Normal bell curve. Medical research uncertainties are generally as well evaluated as those in physics, but physics uncertainty improves more rapidly, making feasible simple significance criteria such as the 5σ discovery convention in particle physics. Contributions to measurement uncertainty from mistakes and unknown problems are not completely unpredictable. Such errors appear to have powerlaw distributions consistent with how designed complex systems fail, and how unknown systematic errors are constrained by researchers. This better understanding may help improve analysis and metaanalysis of data, and help scientists and the public have more realistic expectations of what scientific results imply.
1. Introduction
What do reported uncertainties actually tell us about the accuracy of scientific measurements and the likelihood that different measurements will disagree? No scientist expects different research studies to always agree, but the frequent failure of published research to be confirmed has generated much concern about scientific reproducibility [1,2].
When scientists investigate many quantities in very large amounts of data, interesting but ultimately false results may occur by chance and are often published. In particle physics, bitter experience with frequent failures to confirm such results eventually led to an ad hoc ‘5sigma’ discovery criterion [3–6], i.e. a ‘discovery’ is only taken seriously if the estimated probability for observing the result without new physics is less than the chance of a single sample from a Normal distribution being more than five standard deviations (5σ) from the mean.
In other fields, arguments that most novel discoveries are false [7] have caused increased emphasis on reporting the value and uncertainty of measured quantities, not just whether the value is statistically different from zero [8,9]. Research confirmation is then judged by how well independent studies agree according to their reported uncertainties, so assessing reproducibility requires accurate evaluation and realistic understanding of these uncertainties. This understanding is also required when analysing data, combining studies in metaanalyses, or making scientific, business or policy judgements based on research. The experience of research fields like physics, in which values and uncertainties have long been regularly reported, may provide some guidance on what reproducibility can reasonably be expected [10].
Most recent investigations into reproducibility focus on how often observed effects disappear in subsequent research, revealing strong selection bias in published results. Removing such bias is extremely important, but may not reduce the absolute number of false discoveries since not publishing nonsignificant results does not make the ‘discoveries’ go away. Controlling the rate of false discoveries depends on establishing criteria that reflect real measurement uncertainties, especially the likelihood of extreme fluctuations and outliers [11].
Outliers are observations that disagree by an abnormal amount with other measurements of the same quantity. Despite every scientist knowing that the rate of outliers is always greater than naively expected, there is no widely accepted heuristic for estimating the size or shape of these long tails. These estimates are often assumed to be approximately Normal (Gaussian), but it is easy to find examples where this is clearly untrue [12–14].
To examine the accuracy of reported uncertainties, this paper reviews multiple published measurements of many different quantities, looking at the differences between measurements of each quantity normalized by their reported uncertainties. Previous similar studies [15–20] reported on only a few hundred to a few thousand measurements, mostly in subatomic physics. This study reports on how well multiple measurements of the same quantity agree, and hence what are reasonable expectations for the reproducibility of published scientific measurements. Of particular interest is the frequency of large disagreements which usually reflect unexpected systematic effects.
1.1. Systematic effects
Sources of uncertainty are often categorized as statistical or systematic, and their methods of evaluation classified as Type A or B [21]. Type A evaluations are based on observed frequency distributions; whereas Type B evaluations use other methods. Statistical uncertainties are always evaluated from primary data using Type A methods, and can, in principle, be made arbitrarily small by repeated measurement or large enough sample size.
Uncertainties due to systematic effects may be evaluated by either Type A or B methods, and fall into several overlapping classes [22]. Class 1 systematics, which include many calibration and background uncertainties, are evaluated by Type A methods using ancillary data. Class 2 systematics are almost everything else that might bias a measurement, and are caused by a lack of knowledge or uncertainty in the measurement model, such as the reading error of an instrument or the uncertainties in Monte Carlo estimates of corrections to the measurement. Class 3 systematics are theoretical uncertainties in the interpretation of a measurement. For example, determining the proton radius using the Lamb shift of muonic hydrogen requires over 20 theoretical corrections [23] that are potential sources of uncertainty in the proton radius, even if the actual measurement of the Lamb shift is perfect. The uncertainties associated with Classes 2 and 3 systematic effects cannot be made arbitrarily small by simply getting more data.
When considering the likelihood of extreme fluctuations in measurements, mistakes and ‘unknown unknowns’ are particularly important, but they are usually assumed to be statistically intractable and are not often considered in traditional uncertainty analysis. Mistakes are ‘unknown knowns’, i.e. something that is thought to be known but is not, and it is believed that good scientists should not make mistakes.
‘Unknown unknowns’ are factors that affect a measurement but are unknown and unanticipated based on past experience and knowledge [24]. For example, during the first 5 years of operation of LEP (the Large Electron Positron collider), the effect of local railway traffic on measurements of the Z^{0} boson mass was an ‘unknown unknown’ that noone thought about. Then improved monitoring revealed unexpected variations in the accelerator magnetic field, and after much investigation these variations were found to be caused by electric rail line ground leakage currents flowing through the LEP vacuum pipe [25].
In general, systematic effects are challenging to estimate [12,21,22,26–29], but can be partially constrained by researchers making multiple internal and external consistency checks: is the result compatible with previous data or theoretical expectations? Is the same result obtained for different times, places, assumptions, instruments or subgroups? As described by Dorsey [30], p. 11, scientists ‘change every condition that seems by any chance likely to affect the result, and some that do not, in every case pushing the change well beyond any that seems at all likely’. If an inconsistency is observed and its cause understood, the problem can often be fixed and new data taken, or the effect monitored and corrections made. If the cause cannot be identified, however, then the observed dispersion of values must be included in the uncertainty.
The existence of unknown systematic effects or mistakes may be revealed by consistency checks [31], but small unknown systematics and mistakes are unlikely to be noticed if they do not affect the measurement by more than the expected uncertainty. Even large problems can be missed by chance (see §(d)) or if the conditions changed between consistency checks do not alter the size of the systematic effect. The power of consistency checks is limited by the impossibility of completely changing all apparatus, methods, theory and researchers between measurements, so one can never be certain that all significant systematic effects have been identified.
2. Methods
2.1. Data
Quantities were only included in this study if they are significant enough to have warranted at least five independent measurements with clearly stated uncertainties.
Medical and health research data were extracted from some of the many metaanalyses published by the Cochrane Collaboration [32]; a total of 5580 measurements of 310 quantities generating 99 433 comparison pairs were included. Particle physics data (8469 measurements, 864 quantities and 53 988 pairs) were retrieved from the Review of Particle Physics [33,34]. Nuclear physics data (12 380 measurements, 1437 quantities and 66 677 pairs) were obtained from the Table of Radionuclides [35].
Most nuclear and particle physics measurements have prior experimental or theoretical expectations which may influence results from nominally independent experiments [20,36], and medical research has similar biases [7], so this study also includes a large sample of interlaboratory studies that do not have precise prior expectations for their results. In these studies, multiple independent laboratories measure the same quantity and compare results. For example, the same mass standard might be measured by national laboratories in different countries, or an unknown archaeological sample might be divided and distributed to many laboratories, with each laboratory reporting back its Carbon14 measurement of the sample’s age. None of the laboratories knows the expected value for the quantity nor the results from other laboratories, so there should be no expectation, selection or publication biases. These Interlab studies (14 097 measurements, 617 quantities and 965 416 pairs) were selected from a wide range of sources in fields such as analytical chemistry, environmental sciences, metrology and toxicology. The measurements ranged from genetic contamination of food to highprecision comparison of fundamental physical standards, and were carried out by a mix of national, university and commercial laboratories.
All quantities analysed are listed in the electronic supplementary materials [37].
2.2. Data selection and collection
Data were entered using a variety of semiautomatic scripts, opticalcharacter recognition and manual methods. No attempt was made to recalculate past results based on current knowledge, or to remove results that were later retracted or amended, since the original paper was the best result at the time it was published. When the Review of Particle Physics [34] noted that earlier data had been dropped, the missing results were retrieved from previous editions [38].
To ensure that measurements were as independent as possible, measurements were excluded if they were obviously not independent of other data already included. Because relationships between measurements are often obscure, however, there undoubtedly remain many correlations between the published results used.
Medical and health data were selected from the 8105 reviews in the Cochrane database [32] as of 25 September 2013. Data were analysed from 221 Intervention Reviews whose abstract mentioned at least six trials with more than 5000 total participants, and which had at least one analysis with five or more studies with greater than 3500 total participants. The average heterogeneity inconsistency index (I^{2}≡1−d.f./χ^{2}) [36,39] is about 40% for the analyses reported here. Because analyses within a review may be correlated, only a maximum of three analyses and five comparison groups were included from any one review. About 80% of the Cochrane results are the ratio of intervention and control binomial probabilities, e.g. mortality rates for a drug and a placebo. Such ratios are not Normal [40], so they were converted to differences that should be Normal in the Gaussian limit, i.e. when the group size n and probability p are such that n, np and (1−p)n are all ≫1, so the binomial distribution converges towards a Gaussian distribution. (The median observed values for these data were n=100, p=0.16.) The 68.3% binomial probability confidence interval was calculated for both the intervention and control groups to determine the uncertainties.
2.3. Uncertainty evaluation
Measurements with uncertainties are typically reported as x±u, which means that the interval x−u to x+u contains with some defined probability ‘the values that could reasonably be attributed to the measurand’ [21]. Most frequently, uncertainty intervals are given as ±ku_{S}, where k is the coverage factor and u_{S} is the ‘standard uncertainty’, i.e. the uncertainty of a measurement expressed as the standard deviation of the expected dispersion of values. Uncertainties in particle physics and medicine are often instead reported as the bounds of either 68.3% or 95% confidence intervals, which for a Normal distribution are equivalent to the k=1 and 2 standard uncertainty intervals.
For this study, all uncertainties were converted to nominal 68.3% confidence interval uncertainties. The vast majority of measurements reported simple single uncertainties, but if more than a single uncertainty was reported, e.g. ‘statistical’ and ‘systematic’, they were added in quadrature.
2.4. Normalized differences
All measurements, x_{i}±u_{i}, of a given quantity were combined in all possible pairs and the difference between the two measurements of each pair calculated in units of their combined uncertainty u_{ij}:
2.1The dispersion of z_{ij} values can be used to judge whether independent measurements of a quantity are ‘compatible’ [41]. A feature of z as a metric for measurement agreement is that it does not require a reference value for the quantity. (The challenges and effects of using reference values are discussed in §(c).)
The uncertainties in equation (2.1) are combined in quadrature, as expected for standard uncertainties of independent measurements. (The effects of any lack of independence are discussed in §(b).)
Uncertainties based on confidence intervals may not be symmetric about the reported value, which is the case for about 13% of particle, 6% of medical, 0.3% of nuclear and 0.06% of interlab measurements. Following common (albeit imperfect) practice [42], if the reported plus and minus uncertainties were asymmetric, z_{ij} was calculated from equation (2.1) using the uncertainty for the side towards the other member of the comparison pair. For example, if x1=80±32x1=80±23, x2=100±54x2=100±45 and x3=126±1512x3=126±1215, then z12=(100−80)/32+42−−−−−−√z12=(100−80)/32+42 and z23=(126−100)/52+122−−−−−−−√z23=(126−100)/52+122.
The distributions of the z_{ij} differences are histogrammed in figure 1, with each pair weighted such that the total weight for a quantity is the number of measurements of that quantity. For example, if a quantity has 10 measurements, there are 45 possible pairs, and each entry has a weight of 10/45. (Other weighting schemes are discussed in §(c).) The final frequency distribution within each research area is then normalized so that its total observed probability adds up to 1. If the measurement uncertainties are well evaluated and correspond to Normally distributed probabilities for x, then z is expected to be Normally distributed with a standard deviation σ=1.
Probability distribution uncertainties (e.g. the vertical error bars in figure 1) were evaluated using a bootstrap Monte Carlo method where quantities were drawn randomly with replacement from the actual dataset until the number of Monte Carlo quantities equalled the actual number of quantities. The resulting artificial dataset was histogrammed, the process repeated 1000 times, and the standard deviations of the Monte Carlo probabilities calculated for each z bin.
Random selection of measurements instead of quantities was not chosen for uncertainty evaluation because of the corrections then required to avoid bias and artefacts. For example, if measurements are randomly drawn, a quantity with only five measurements will often be missing from the artificial dataset for having too few (less than five) measurements drawn, or if it does have five measurements some of them will be duplicates generating unrealistic z=0 values, or if duplicates are excluded then they will always be the same five nonrandom measurements. Without correcting for such effects, the resulting measurement Monte Carlo generated uncertainties are too small to be consistent with the observed bintobin fluctuations in figure 1. Correcting for such effects requires using characteristics of the actual quantities and would be effectively equivalent to using random quantities.
2.5. Data fits
Attempts were made to fit the data to a wide variety of functions, but by far the best fits were to nonstandardized Student’s tprobability density distributions with ν degrees of freedom.
2.2Student’s tdistribution is essentially a smoothly symmetric normalizable power law, with S_{ν,σ}(z)∼(z/σ)^{−(ν+1)} for z≫σν−−√z≫σν.
The fitted parameter σ defines the core width and overall scale of the distribution and is equal to the standard deviation in the ν→∞ν→∞ Gaussian limit and to the halfwidth at half maximum in the ν→1ν→1 Cauchy (also known as Lorentzian or BreitWigner) limit. The parameter ν determines the size of the tails, with small ν corresponding to large tails. The values and standard uncertainties in σ and ν were determined from a nonlinear leastsquares fit to the data that minimizes the nominal χ^{2} [43]:
2.3where z_{i}, B_{i} and u_{Bi} are the bin z, contents, and uncertainties of the observed zdistributions shown in figure 1.
Possible values of z are sometimes limited by the allowed range of measurement values, which could suppress heavy tails. For example, many quantities are fractions that must lie between 0 and 1, and there is less room for two measurements with 10% uncertainty to disagree by 5σ than for two 0.01% measurements. The size of this effect was estimated using Monte Carlo methods to generate simulated data based on the values and uncertainties of the actual data, constrained by any obvious bounds on their allowed values. The simulated data were then fit to see if applying the bounds changed the fitted values for σ and ν. The largest effect was for Medical data where ν was reduced by about 0.1 when minimally restrictive bounds were assumed. Stronger bounds might exist for some quantities, but determining them would require careful measurementbymeasurement assessment beyond the scope of this study. For example, each measurement of the longterm duration of the effect of a medical drug or treatment would have an upper bound set by the length of that study. Since correcting for bounds can only make ν smaller (corresponding to even heavier tails), and the observed effects were negligible, no corrections were applied to the values of ν reported here.
3. Results
3.1. Observed distributions
Histograms of the zdistributions for different datasets are shown in figure 1. The complementary cumulative distributions of the data are given in table 1 and shown in figure 2.
None of the data are close to Gaussian, but all can reasonably be described by almostCauchy Student’s tdistributions with ν∼2−3. For comparison, fits to these data with Lévy stable distributions have nominal χ^{2} 4–30 times worse than the fits to Student’s tdistributions. The number of ‘5σ’ (i.e. z>5) disagreements observed is as high as 0.12, compared with the 6×10^{−7}expected for a Normal distribution.
The fitted values for ν and σ are shown in table 2. Also shown in table 2 are two data subsets expected to be of higher quality, BIPM Interlaboratory Key comparisons (372 quantities, 3712 measurements and 20 245 pairs) and Stable Particle properties (335 quantities, 3041 measurements and 16 649 pairs). The Key comparisons [44] should define stateoftheart accuracy, since they are measurements of important metrological standards carried out by national laboratories. Stable particles are often easier to study than other particles, so their properties are expected to be better determined. Both ‘better’ data subsets do have narrower distributions consistent with higher quality, but they still have heavy tails. More selected data subsets are discussed in §(d).
The probability distribution for the nominal χ^{2} statistic is not expected to be an exact regular χ^{2}distribution. The differences are due to the nonGaussian uncertainties of the lowpopulation highz bins, and because the bin contents are not independent since a single measurement can contribute to multiple bins as part of different permutation pairs. Based on fits of simulated datasets with a mix of ν comparable to the observed data, the range of nominal χ^{2} reported in table 2 seems reasonable, i.e. the chances of χ^{2}/d.f. ≤0.6 or ≥1.9 were 15% and 2%, respectively.
To see whether more important quantities are measured with less disagreement, a small additional dataset of measurements of fundamental physical constants (7 quantities, 320 measurements and 9098 pairs) was also analysed. The constants are Avogadro’s number, the fine structure constant, the Planck constant, Newton’s gravitational constant, the deuteron binding energy, the Rydberg constant and the speed of light (before it became a defined constant). These measurements have very heavy tails, despite their importance in physical science. Quantities with more interest do not seem to be better measured, as is also shown by considering only quantities with at least 10 published measurements, which do not have significantly smaller tails (see σ_{10},ν_{10} in table 2).
Figure 1 shows that the comparison pairs z_{ij} are Student’s tdistributed, but what does this imply about the dispersion of individual x_{i} measurements? Except for the ν=1 and ∞∞ Cauchy and Normal limits, the distribution of differences of values selected from a Student’s tdistribution is not itself a tdistribution, but it can be closely approximated as one [45]. The distributions of the parent individual x measurements were estimated by Monte Carlo deconvolution. Artificial measurements were generated from tdistributions with parameters ν_{x} and σ_{x}, and these measurements combined into permutation pairs to generate an artificial zdistribution. This distribution was compared to the observed zdistributions, and then ν_{x} and σ_{x} were iteratively adjusted until the best match was achieved between the artificial and observed zdistributions. As shown in table 2, the approximate Student’s tparameters (ν_{x}, σ_{x}) of the individual measurement populations have ν_{x}<ν and hence are slightly more Cauchylike than the permutation pairs distributions.
3.2. Combined uncertainty
The definition of z by equation (2.1) assumes that the measurements x_{i} and x_{j} are independent and that the uncertainties u_{i} and u_{j} can be combined following the rules for standard uncertainties.
If x_{i} and x_{j} are correlated, however, equation (2.1) should be replaced by
3.1where cov(x_{i},x_{j}) is the covariance of x_{i} and x_{j} [21].
It is not in general possible to quantitatively evaluate the covariance for individual pairs of measurements in the datasets, but the effects of any correlations are not expected to be large, and they cannot explain the observed heavy tails. Any positive covariance would decrease the denominator in equation (3.1) and increase the width of the zdistributions. Correlations between measurements are expected to be much more likely positive than negative, but even perfect anticorrelation could only decrease zvalues by at most a factor of 1/2–√1/2 compared to the uncorrelated case (i.e. changing cov(x_{i},x_{j}) from 0 to −u_{i}u_{j} in equation (3.1) reduces z_{ij} by 2–√2 if u_{i}=u_{j}, and less if u_{i}≠u_{j}). Correlations are further discussed in §(f).
Another possible issue with equation (2.1) is that its usual derivation assumes that u_{i} and u_{j} are standard deviations of the expected dispersion of possible values (e.g. [21], §E.3.1). This assumption is a concern since the standard deviation is an undefined quantity for Student’s tdistributions if ν<2, and the observed z and inferred xdistributions have ν near or below this value. Even if the variance of a distribution is undefined, however, the dispersion of the difference of two independent variables drawn from such distributions may still be calculated numerically and in some cases analytically.
Cauchy uncertainties add linearly instead of in quadrature, since the distribution of differences of two variables drawn from two Cauchy distributions with widths σ_{1} and σ_{2} is simply another Cauchy distribution with width σ_{diff}=σ_{1}+σ_{2}. The corresponding definition of z would be
3.2AlmostCauchy distributions should almost follow the rules for combination of Cauchy (ν=1) distributions. Applying equation (3.2) to the data produces zdistributions that appear almost identical to those in figure 1, except that the fitted values of σ for Interlab, nuclear, particle, medical data are smaller by factors of 0.78,0.80,0.75,0.74, while the the fitted values of ν are almost unchanged (ν_{linear}/ν_{quad}=0.99,1.00,0.98,0.94). The scale factor for σ would be 1/2–√=0.711/2=0.71 if all measurements of a quantity had equal uncertainties (u_{i}=u_{j}), since switching from quadrature (equation (2.1)) to linear (equation (3.2)) would simply scale all the calculated zvalues by 1/2–√1/2 and not affect ν. Similarly, if data with equal ν=1,σ=1 Cauchy uncertainties were analysed using equation (2.1), the resulting permutation pairs would have ν=1,σ=2–√ν=1,σ=2, as shown in the last line of table 2.
3.3. Alternative weighting schemes and compatibility measures
There are several ways to weight data in the distribution plots, but the fitted parameter values are not usually greatly affected by the choice (table 3). The default method was to give each measurement equal weight (‘M’ in table 3). Jeng [20] gave equal weight to all measurement pairs (‘P’), but this gives extreme weight to quantities with a large number (N) of measurements since the number of permutations grows as (N−1)N/2. Giving each quantity equal weight (‘Q’) also seems less fair, since a quantity measured many times will be weighted the same as a quantity measured only a few times.
Instead of using measurement pairs to study compatibility, Roos et al. [16] instead calculated the weighted mean for each quantity, and then plotted the distribution of the uncertaintynormalized difference (‘h’) from that mean for each measurement, i.e.
3.3where
3.4h is very similar to interlaboratory comparison ζscores [46], which are the standard uncertainty normalized differences between measurements and an externally assigned value for the quantity. The problem with using actual ζscores is that they depend on having assigned values for the quantity independent of the measurements. Such values are not usually available for the quantities studied here, so any assigned value must be determined from the measurements themselves, and such ‘consensus values’ can be problematic [46]. The particular issue with h is whether the weighted mean x¯x¯is the best assigned value for a quantity given all the available measurements. This is a reasonable assumption if the uncertainties are Normal, since then x¯x¯ from equation (3.4) is the maximum likelihood value for x [16]. If the uncertainties are not Normal, however, x¯x¯ may be far from maximum likelihood, so it is not clear if x¯x¯ is the best choice for the assigned value. Because of these issues, z was preferred over h in this study, but the h– and zdistributions are very similar. As can be seen from table 3, the fit quality and parameter values are comparable for h– and zdistributions, except the tails appear even heavier in h.
3.4. Selected data subsets
To further investigate the variance in the distributions for different types of measurements, several additional data subsets were examined and their parameters listed in table 4.
The Key Metrology data subset is for electrical, radioactivity, length, mass and other similar physical metrology standards. To see whether the most experienced national laboratories were more consistent, table 4 also lists Selected Metrology data from only the six national laboratories that reported the most Key Metrology measurements. These laboratories were PTB (PhysikalischTechnische Bundesanstalt, Germany), NMIJ (National Metrology Institute of Japan), NIST (National Institutes of Standards and Technology, USA), NPL (National Physical Laboratory, UK), NRC (National Research Council, Canada) and LNE (Laboratoire national de métrologie et d’essais, France). Similarly, Key Analytical chemistry data selected from the same national laboratories are also shown. These are for measurements such as the amount of mercury in salmon, PCBs in sediment or chromium in steel. The metrology measurements by the selected national laboratories do have much lighter tails with ν∼10, but this is not the case for their analytical measurements where ν∼2.
New Stable particle data have the lightest tail in table 2, but it is not clear if this is because the newer results have better determined uncertainties or are just more correlated. The trend in particle physics is for fewer but larger experiments, and more than a third of the newer Stable measurements were made by just two very similar experiments (Belle and BaBar), so the New Stable data are split into two groups in table 4. There is no significant difference between the Belle/BaBar and Other experiments data.
Nuclear lifetimes with small and large relative uncertainties were compared. They have similar tails, but the smaller uncertainty measurements appear to underestimate their uncertainty scales.
Measurements of Newton’s gravitation constant are notoriously variable [14,47], so a dataset without G_{N} results was examined. The heavy tail is reduced, albeit with large uncertainty.
3.5. Relative uncertainty
The accuracy of uncertainty evaluations appears to be similar in all fields, but unsurprisingly there are notable differences in the relative sizes of the uncertainties. In particular, although individual physics measurements are not typically more reproducible than in medicine, they often have smaller relative uncertainty (i.e. uncertainty/value) as shown in figure 3.
Perhaps more importantly for discovery reproducibility, uncertainty improves more rapidly in physics than in medicine, as is shown in figure 4. This difference in rates of improvement reflects the difference between measurements that depend on steadily evolving technology versus those using stable methods that are limited by sample sizes and heterogeneity [48]. The expectation of reduced uncertainty in physics means that it is feasible to take a waitandsee attitude towards new discoveries, since better measurements will quickly confirm or refute the new result. Measurement uncertainty in nuclear and particle physics typically improves by about a factor of 2 every 15 years. Constants data improve twice as fast, which is unsurprising since more effort is expected for more important quantities.
Physicists also tend not to make new measurements unless they are expected to be more accurate than previous measurements. In the datasets reported here, the median improvement in uncertainty of Nuclear measurements compared to the best previous measurement of the same quantity is u_{best}/u_{new}=2.0±0.3, and the improvement factors for Constants, Particle and Stable measurements are 1.8±0.3, 1.7±0.2 and 1.3±0.1, respectively. By contrast, Medical measurements typically have greater uncertainties than the best previous measurements, with median u_{best}/u_{new}=0.62±0.03. This is an understandable consequence of different uncertainty to cost relationships in physics and medicine. Study population size is a major cost driver in medical research, so reducing the uncertainty by a factor of two can cost almost four times as much, which is rarely the case in physics.
3.6. Expectations and correlations
Prior expectations exist for most measurements reported here except for the Interlab data. Such expectations may suppress heavy tails by discouraging publication of the anomalous results that populate the tails, since before publishing a result dramatically different from prior results or theoretical expectations, researchers are likely to make great efforts to ensure that they have not made a mistake. Journal editors, referees and other readers also ask tough questions of such results, either preventing publication or inducing further investigation. For example, initial claims [49,50] of 6σ evidence for fasterthanlight neutrinos and cosmic inflation did not survive to actual publication [51,52].
Figure 5 shows that physics (Particle, Nuclear and Constants) measurements are more likely to agree if the difference in their publication dates is small. Such ‘bandwagon effects’ [20,36] are not observed in the Medical data, and they are irrelevant for Interlab quantities which are usually measured almost simultaneously. These correlations imply that measurements are biased either by expectations or common methodologies. Such correlations might explain the small (less than 1) values of σ_{x} for Nuclear, Particle and Constants data, or it could be that researchers in these fields simply tend to overestimate the scale of their uncertainties [53]. Removing expectation biases from the physics data would probably make their tails heavier.
Although Interlab data are not supposed to have any expectation biases, they are subject to methodological correlations due to common measurement models, procedures and types of instrumentation, so even their tails would probably increase if all measurements could be made truly independent.
4. Discussion
4.1. Comparison with earlier studies
In a famous dispute with Cauchy in 1853, eminent statistician IrénéeJules Bienaymé ridiculed the idea that any sensible instrument had Cauchy uncertainties [54]. A century later, however, Harold Jeffreys noted that systematic errors may have a significant Cauchy component, and that the scale of the uncertainty contributed by systematic effects depends on the size of the random errors [55].
The results of this study agree with earlier research that also observed Student’s ttails, but only looked at a handful of subatomic or astrophysics quantities up to z∼5−10 [16,19,56–58]. Unsurprisingly, the tails reported here are mostly heavier than those reported for repeated measurements made with the same instrument (ν∼3−9) [59–61], which should be closer to Normal as they are not independent and share most systematic effects.
Instead of Student’s t tails, exponential tails have been reported for several nuclear and particle physics datasets [15,17,18,20], but in all cases some measurements were excluded. For example, the largest of these studies [20] looked at particle data (315 quantities and 53 322 pairs) using essentially the same method as this paper, but rejected the 20% of the data that gave the largest contributions to the χ^{2} for each quantity, suppressing the heaviest tails. Despite this data selection, all these studies have supraexponential heavy tails for z≳5z≳5, and so are qualitatively consistent with the results of this paper. It is possible that averaging different quantities with exponential tails might produce apparent power laws [62], but this would require wild variations in the accuracy of the uncertainty estimates.
Instead of looking directly at the shapes of the measurement consistency distributions, Hedges [10] compared particle physics and psychology results and found them to have similar compatibility, with typically almost half of the quantities in both fields having statistically significant disagreements.
Thompson & Ellison [63] reported substantial amounts of ‘dark uncertainty’ in chemical analysis interlaboratory comparisons. Uncertainty is ‘dark’ if it does not appear as part of the known contributions to the uncertainty of individual measurements, but is inferred to exist because the dispersion of measured values is greater than expected based on the reported uncertainties. For example, six (21%) of 28 BIPM Key Comparisons studied had ratios (s¯exp/sobss¯exp/sobs) of expected to observed standard deviations less than 0.2. This agrees with the key analytical results in table 4 (which include some of the same key comparisons). For sample sizes matching the 28 comparisons, 20% of samples drawn from a ν=2,σ=1.4 Student’s tdistribution would be expected to have s¯exp/sobs<0.2s¯exp/sobs<0.2. Pavese has also emphasized the large number of discrepant results in Key comparisons [64].
The Open Science Collaboration (OSC) recently replicated 100 studies in psychology [65], providing some of the most direct evidence yet for poor scientific reproducibility. Using the OSC study’s supplementary information, z can be calculated for 87 of the reported original/replication measurement pairs, and 27 (31%) disagree by more than 2σ, and 2 (2.3%) by more than 5σ. This rate of disagreements is inconsistent with selection bias acting on a Normal distribution unless the more than 5σ data are excluded, but can be explained by selectionbiased Student’s tdata with ν∼3, consistent with the medical data reported in table 2.
4.2. How measurements fail
When a measurement turns out to be wrong, the reasons for this failure are often unknown, or at least unpublished, so it is interesting to look at examples where the causes were later understood or can be inferred.
For medical research, heterogeneity in methods or populations is a major source of variance. The largest inconsistency in the Medical dataset is in a comparison of fever rates after acellular versus wholecell pertussis vaccines [66]. The large variance can probably be explained by significant differences among the study populations and especially in how minor adverse events were defined and reported.
The biggest zvalues in the particle data come from complicated multichannel partial wave analyses of strong scattering processes, where many dozens of quantities (particle masses, widths, helicities, etc.) are simultaneously determined. Significant correlations often exist between the fitted values of the parameters but are not always clearly reported, and evaluations may not always include the often large uncertainties from choices in data and parametrization.
The largest disagreement in the Interlab data appears to be an obvious mistake. In a comparison of radioactivity in water [67], one laboratory reported an activity of 139 352±0.82 Bq kg^{−1} when the true value was about 31. Even without knowing the expected activity, the unreasonably small fractional uncertainty should probably have flagged this result. Such gross errors can produce almostCauchy deviations. For example, if the numerical result of a measurement is simply considered as an infinite bit string, then any ‘typographical’ glitch that randomly flips any bit with equal probability will produce deviations with a 1/x distribution.
One can hope that the best research will not be sloppy, but not even the most careful scientists can avoid all unpleasant surprises. In 1996, a team from PTB (the National Metrological Institute of Germany) reported a measurement of G_{N} that differed by 50σ from the accepted value; it took 8 years to track down the cause—a plausible but erroneous assumption about their electrostatic torque transmitter unit [68]. A 6.5σ difference between the CODATA2006 and CODATA2010 finestructure constant values was due to a mistake in the calculation of some eighthorder terms in the theoretical value of the electron anomalous magnetic moment [69]. A 1999 determination [70] of Avogadro’s number by a team from Japan’s National Research Laboratory of Metrology using the newer Xray crystal density method was off by approximately 9σ due to subtle silicon inhomogeneities [71]. In an interlaboratory comparison measuring PCB contamination in sediments, the initial measurement by BAM (the German Federal Institute for Materials Research and Testing) disagreed by many standard uncertainties, but this was later traced to crosscontamination in sample preparation [72]. Several nuclear halflives measured by the US National Institute for Standards and Technology were known for some years to be inconsistent with other measurements; it was finally discovered that a NIST sample positioning ring had been slowly slipping over 35 years of use [73].
Often discrepancies are never understood and are simply replaced by newer results. For example, despite bringing in a whole new research team to go over every component and system, the reason for a discordant NIST measurement of Planck’s constant was never found, but newer measurements by the same group were not anomalous [74].
4.3. Causes of heavy tails
Heavy tails have many potential causes, including bias [7], overconfident uncertainty underestimates [75] and uncertainty in the uncertainties [17], but it is not immediately obvious how these would produce the observed tdistributions with so few degrees of freedom.
Even when the uncertainty u is evaluated from the standard deviation of multiple measurements from a Normal distribution so that Student’s tdistribution would be expected, there are typically so many measurements that ν should be much larger than what is observed. Exceptions to this are when calibration uncertainties dominate, since often only a few independent calibration points are available, or when uncertainties from systematic effects are evaluated by making a few variations to the measurements, but these cannot explain most of the data.
Any reasonable publication bias applied to measurements with Gaussian uncertainties cannot create very heavy tails, just a distorted distribution with Gaussian tails—to produce one false published 5σ result would require bias strong enough to reject millions of studies. Underestimating σ does not produce a heavy tail, only a broader Normal zdistribution. Mixing multiple Normal distributions does not naturally produce almostCauchy distributions, except in special cases such as the ratio of two zeromean Gaussians.
The heavy tails are not caused by poor older results. The heaviesttailed data in figure 1 are actually the newest—93% of the interlaboratory data are less than 16 years old—and eliminating older results taken prior to the year 2000 does not reduce the tails for most data as shown in table 2.
Intentionally making up results, i.e. fraud, could certainly produce outliers, but this is unlikely to be a significant problem here. Since most of the data were extracted from secondary metaanalyses (e.g. Review of Particle Properties, Table of Radionuclides and Cochrane systematic reviews), results withdrawn for misconduct prior to the time of the review would probably be excluded. One metaanalysis in the medical dataset does include studies that were later shown to be fraudulent [76], but the fraudulent results actually contribute slightly less than average to the overall variance among the results for that metaanalysis.
4.4. Modelling
Modelling the heavy tails may help us understand the observed distributions. One way is to assume that the measurement values are normally distributed with standard deviation t that is unknown but which has a probability distribution f(t) [15,17–19,77]. The measured value x is then expected to have a probability distribution
4.1This is essentially a Bayesian estimate with prior f(t) and a Normal likelihood with unknown variance. If the uncertainties are accurately evaluated and Normal with variance σ^{2}, f(t) will be a narrow peak at t=σ. Assuming that f(t) is a broad Normal distribution leads to exponential tails [17] for large z.
To generate Student’s tdistributions, f(t) must be a scaled inverse χ^{2}– (or Gamma) distribution in t^{2} [19,77]. This works mathematically, but why would variations in σ for independent measurements have such a distribution?
Heavy tails can only be generated by effects that can produce a wide range of variance, so we must model how consistency testing is used by researchers to constrain such effects. Consistency is typically tested using a metric such as the calculated χ^{2}statistic for the agreement of N measurements x_{i} [43]
4.2where x¯x¯ is the x_{i} weighted mean and u_{i} are the standard uncertainties reported by the researchers. For accurate standard uncertainties, χ2cχc2 will have a χ^{2}– probability distribution with ν=N−1. If, however, the reported uncertainties are incorrect and the true standard uncertainties are tu_{i}, then it will be χ2true(x,tu)=χ2c(x,u)/t2χtrue2(x,tu)=χc2(x,u)/t2 that is χ^{2}distributed.
Researchers will probably search for problems if different consistency measurements have a poor χ2c(x,u)χc2(x,u), which typically means χ2c(x,u)>νχc2(x,u)>ν. The larger an unknown systematic error is, the more likely it is to be detected and either corrected or included in the reported uncertainty, so published results typically have χ2c(x,u)∼νχc2(x,u)∼ν. Since χ2c(x,u)/t2χc2(x,u)/t2 is expected to have a χ^{2}distribution, a natural prior for t^{2} is indeed the scaled inverse χ^{2}distribution needed to generate Student’s tdistributions from equation (4.1).
More mechanistically, it could be assumed that a normally distributed systematic error will be missed by N_{m} independent measurements if their χ^{2}(u)=(t^{2}/u^{2})χ^{2}(t) is less than some threshold χ2max∼ν=Nm−1χmax2∼ν=Nm−1. If the distribution of all possible systematic effects is P_{0}(t), then the probability distribution for the unfound errors will be
4.3where F is the cumulative χ^{2}distribution. P_{0}(t) is unknown, but a common Bayesian scaleinvariant choice is P_{0}(t)∝1/t^{α}, with α>0.
Using this model with the reported uncertainty σ as the lower integration bound, the curve generated from equations (4.1) and (4.3) is very close to a ν=N_{m}−1+α Student’s tdistribution. The observed small values for ν mean that both N_{m} and α must be small. Making truly independent consistency tests is difficult, so it is not surprising that the effective number of checks (N_{m}) is usually small.
This model is plausible, but why are systematic effects consistent with a P_{0}(t)∝1/t^{α} powerlaw size distribution?
4.5. Complex systems
Scientific measurements are made by complex systems of people and procedures, hardware and software, so one would expect the distribution of scientific errors to be similar to those produced by other comparable systems.
Powerlaw behaviour is ubiquitous in complex systems [78], with the cumulative distributions of observed sizes (s) for many effects falling as 1/s^{α}, and these heavy tails exist even when the system has been designed and refined for optimal results.
A Student’s tdistribution has cumulative tail exponent α=ν, and the values for ν reported here are consistent with powerlaw tails observed in other designed complex systems. The frequency of software errors typically has a cumulative powerlaw tail corresponding to small ν∼2−3 [79], and in scientific computing these errors can lead to quantitative discrepancies orders of magnitude greater than expected [80]. The size distribution of electrical power grid failures has ν∼1.5−2 [81], and the frequency of spacecraft failures has ν∼0.6−1.3 [82]. Even when designers and operators really, really want to avoid mistakes, they still occur: the severity of nuclear accidents falls off only as ν∼0.7 [83], similar to the powerlaws observed for the sizes of industrial accidents [84] and oil spills [85]. Some complex medical interventions have powerlaw distributed outcomes with ν∼3−4 [86].
Combining the observed powerlaw responses of complex systems with the powerlaw constraints of consistency checking for systematic effects discussed in §(d), leads naturally to the observed consistency distributions with heavy powerlaw tails. There are also several theoretical arguments that such distributions should be expected.
A systematic error or mistake is an example of a risk analysis incident, and powerlaw distributions are the maximal entropy solutions for such incidents when there are multiple nonlinear interdependent causes [85], which is often the case when things go wrong in research.
Scientists want to make the best measurements possible with the limited resources they have available, so scientific research endeavours are good examples of highly structured complex systems designed to optimize outcomes in the presence of constraints. Such systems are expected to exhibit ‘highly optimized tolerance’ [87,88], being very robust against designedfor uncertainties, but also hypersensitive to unanticipated effects, resulting in powerlaw distributed responses. Simple continuous models for highly optimized tolerant systems are consistent with the heavy tails observed in this study. These models predict that α∼1+1/d [88,89], where d(>0) is the effective dimensionality of the system, but larger values of α arise when some of the resources are used to avoid large deviations [89], e.g. spending time doing consistency checks.
4.6. How can heavy tails be reduced?
If one believes that mistakes can be eliminated and all systematic errors found if we just work hard enough and apply the most rigorous methodological and statistical techniques, then results from the best scientists should not have heavy tails. Such a belief, however, is not consistent with the experienced challenges of experimental science, which are usually hidden in most papers reporting scientific measurements [4,90]. As Beveridge famously noted [91], often everyone else believes an experiment more than the experimenters themselves. Researchers always fear that there are unknown problems with their work, and traditional error analysis cannot ‘include what was not thought of’ [47].
It is not easy to make accurate a priori identifications of those measurements that are so well done that they avoid having almostCauchy tails. Expert judgement is subject to wellknown biases [92], and obvious criteria to identify better measurements may not work. For example, the Open Science Collaboration found that researchers’ experience or expertise did not significantly correlate with the reproducibility of their results [65]—the best predictive factor was simply the statistical significance of the original result. The best researchers may be better at identifying problems and not making mistakes, but they also tend to choose the most difficult challenges that provide the most opportunities to go wrong.
Reducing heavy tails is challenging because complex systems exhibit scale invariant behaviour such that reducing the size of failures does not significantly change the shape of their distribution. Improving sensitivity makes previously unknown small systematic issues visible so they can be corrected or included in the total uncertainty. This improvement reduces σ, but even smaller systematic effects now become significant and tails may even become heavier and ν smaller. Comparing figures 1 and 3, it appears that data with higher average relative uncertainty tend to have heavier tails. This relationship between relative uncertainty and measurement dispersion is reminiscent of the empirical Horwitz power law in analytical chemistry [93], where the relative spread of interlaboratory measurements increases as the required sensitivity gets smaller, and of Taylor’s Law in ecology, where the variance grows with sample size so that the uncertainty on the mean does not shrink as 1/N−−√1/N [94].
In principle, statistical errors can be made arbitrarily small by taking enough data, and ν can be made arbitrarily large by making enough independent consistency checks, but researchers have only finite time and resources so choices must be made. Taking more consistency check data limits the statistical uncertainty, since it is risky to treat data taken under different conditions as a single dataset. Consistency checks are never completely independent since it is impossible for different measurements of the same quantity not to share any people, methods, apparatus, theory or biases, so researchers must decide what tests are reasonable. The observed similar small values for ν may reflect similar spontaneous and often unconscious cost–benefit analyses made by researchers.
The data showing the lightest tail reported here (in table 4) may provide some guidance and caution. The high quality of the selected metrology standards measurements by leading national laboratories shows that heavy tails can be reduced by collaboratively taking great care to ensure consistency by sharing methodology and making regular comparisons. There are, however, limits to what can be achieved, as illustrated by the much heavier tail of the analytical standards measured by the same leading labs. Secondly, consistency is easier than accuracy. Interlaboratory comparisons typically take place over relatively short periods of time, with participating institutions using the best standard methods available at that time. Biases in the standard methods may only be later discovered when new methods are introduced. For example, work towards a redefinition of the kilogram and the associated development of new silicon atom counting technology revealed inconsistencies with earlier wattbalance measurements, and this has driven improvements in both methods [74]. Finally, selection bias that hides anomalous results is hard to eliminate. For one metrology key comparison, results from one quantity were not published because some laboratories reported ‘incorrect results’ [95].
Reducing tails is particularly challenging for measurements where the primary goal is improved sensitivity that may lead to new scientific understanding. By definition, a better measurement is not an identical measurement, and every difference provides room for new systematic errors, and every improvement that reduces the uncertainty makes smaller systematic effects more significant. Frontier measurements are always likely to have heavier tails.
5. Conclusion
Published scientific measurements typically have nonGaussian almostCauchy ν∼ 2−4 Student’s terror distributions, with up to 10% of results in disagreement by greater than 5σ. These heavy tails occur in even the most careful modern research, and do not appear to be caused by selection bias, old inaccurate data, or sloppy measurements of uninteresting quantities. For even the best scientists working on wellunderstood measurements using similar methodology, it appears difficult to achieve consistency better than ν∼10, with about 0.1% of results expected to be greater than 5σ outliers, a rate a thousand times higher than for a Normal distribution. These may, however, be underestimates. Because of selection/confirmation biases and methodological correlations, historical consistency can only set lower bounds on heavy tails—multiple measurements may all agree but all be (somewhat) wrong.
The effects of unknown systematic problems are not completely unpredictable. Scientific measurement is a complex process and the observed distributions are consistent with unknown systematics following the lowexponent powerlaws that are theoretically expected and experimentally observed for fluctuations and failures in almost all complex systems.
Researchers do determine the scale of their uncertainties with fair accuracy, with the scale of medical uncertainties (σ_{x}∼1) slightly more consistent with the expected value (σ_{x}=1) than in physics (σ_{x}∼0.7−0.8). Medical and physics research have comparable reproducibility in terms of how well different studies agree within their uncertainties, consistent with a previous comparison of particle physics with social sciences [10]. Medical research may have slightly lighter tails, while physics results typically have better relative uncertainty and greater statistical significance.
Understanding that error distributions are often almostCauchy should encourage use of tbased [96], median [97] and other robust statistical methods [98], and supports choosing Student’s t [99] or Cauchy [100] priors in Bayesian analysis. Outliertolerant methods are already common in modern metaanalysis, so there should be little effect on accepted values of quantities with multiple published measurements, but this better understanding of the uncertainty may help improve methods and encourage consistency.
False discoveries are more likely if researchers apply Normal conventions to almostCauchy data. Although much abused, the historically common use of p<0.05 as a discovery criterion suggests that many scientists would like to be wrong less than 5% of the time. If so, the results reported here support the nominal 5σ discovery rule in particle physics, and may help discussion of more rigorous significance criteria in other fields [101–103].
This study should help researchers better understand the uncertainties in their measurements, and may help decisionmakers and the public better interpret the implications of scientific research [104]. If nothing else, it should also remind everyone to never use Normal/Gaussian statistics when discussing the likelihood of extreme results.
Data accessibility
The sources for all data analysed are listed at http://dx.doi.org/10.5061/dryad.jb3mj [37].
Competing interests
I have no competing interests.
Funding
Financial support came from the University of Toronto.
Acknowledgements
I thank the students of the University of Toronto Advanced Undergraduate Physics Lab for inspiring consideration of realistic experimental expectations, and the University of Auckland Physics Department for their hospitality during very early stages of this work. I am grateful to D. Pitman for patient and extensive feedback, to R. Bailey for his constructive criticism, to R. Cousins, D. Harrison, J. Rosenthal and P. Sinervo for useful suggestions and discussion, and to M. Cox for many helpful comments on the manuscript.
Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3660959.
 Received August 18, 2016.
 Accepted November 30, 2016.
 © 2017 The Authors.
Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
References
Source: rsos.royalsocietypublishing.org