Lies, damn lies and how statistical significance is affected by sample size
In my review of the effect of HTTP2 on performance I mentioned that the result wasn't statistically significant. Jim Newbery looked at the numbers and said that the result did fit the criteria for statistical significance. He was correct and I had got my sums wrong but in checking my work I found something else.
You're on 10 - where can you go from there?
The initial data was based on 10 sample points. (One was missing from the HTTP2 data set - that must have been the drummer.)
Day | First Byte | |
---|---|---|
HTTP2 | HTTP1.1 | |
Average | 1.87 | 2.23 |
Standard deviation | 0.195 | 0.301 |
1 | 2.13 | 2.13 |
2 | 1.89 | 2.49 |
3 | 1.83 | 2.15 |
4 | 2.10 | 1.88 |
5 | 1.59 | 1.87 |
6 | 2.06 | 1.89 |
7 | 1.81 | 2.45 |
8 | ---- | 2.20 |
9 | 1.84 | 2.46 |
10 | 1.62 | 2.74 |
From this data we can say - with 99% confidence (p=0.0078
) - that this result is statistically significant.
These go to eleven.
The next runs in the sequences were 2.12s and 1.90s - against trend but still not the furthest outliers in either data set.
Day | First Byte | |
---|---|---|
HTTP2 | HTTP1.1 | |
Average | 1.90 | 2.20 |
Standard deviation | 0.199 | 0.302 |
1 | 2.13 | 2.13 |
2 | 1.89 | 2.49 |
3 | 1.83 | 2.15 |
4 | 2.10 | 1.88 |
5 | 1.59 | 1.87 |
6 | 2.06 | 1.89 |
7 | 1.81 | 2.45 |
8 | ---- | 2.20 |
9 | 1.84 | 2.46 |
10 | 1.62 | 2.74 |
11 | 2.12 | 1.90 |
Crunching the numbers we can now say - with 99% confidence (p=0.0154
) - that this result is not statically significant.
It's one louder, isn't it?
By adding one result we've managed to flip the statistical significance (with 99% confidence) so how meaningful is either result? The root problem here is that we're trying to draw too much significance from too little data taken from a low quality set. (And commiting crimes against p values.)
The safe conclusion from this small data set is that the change to HTTP2 hasn't introduced a regression and it seems to be slightly faster - which is Good Enough™ for this purpose.
I've always been weary of putting too much faith in statistical significance since a first year chemistry degree experiment disproved the second law of thermodynamics (with 95% confidence). It turned out a faulty water bath had skewed the data and figuring that out was a much more valuable lesson than whatever, now long forgotten, chemical reaction we were ostensibly studying.
Resources
Evan’s Awesome A/B Tools were invaluable for this - especially the Two-Sample T-Test calculator and visualiser.