Lies, damn lies and how statistical significance is affected by sample size

Orde Saunders' avatarPublished: by Orde Saunders

In my review of the effect of HTTP2 on performance I mentioned that the result wasn't statistically significant. Jim Newbery looked at the numbers and said that the result did fit the criteria for statistical significance. He was correct and I had got my sums wrong but in checking my work I found something else.

You're on 10 - where can you go from there?

The initial data was based on 10 sample points. (One was missing from the HTTP2 data set - that must have been the drummer.)

DayFirst Byte
HTTP2HTTP1.1
Average1.872.23
Standard
deviation
0.1950.301
12.132.13
21.892.49
31.832.15
42.101.88
51.591.87
62.061.89
71.812.45
8----2.20
91.842.46
101.622.74

From this data we can say - with 99% confidence (p=0.0078) - that this result is statistically significant.

These go to eleven.

The next runs in the sequences were 2.12s and 1.90s - against trend but still not the furthest outliers in either data set.

DayFirst Byte
HTTP2HTTP1.1
Average1.902.20
Standard
deviation
0.1990.302
12.132.13
21.892.49
31.832.15
42.101.88
51.591.87
62.061.89
71.812.45
8----2.20
91.842.46
101.622.74
112.121.90

Crunching the numbers we can now say - with 99% confidence (p=0.0154) - that this result is not statically significant.

It's one louder, isn't it?

By adding one result we've managed to flip the statistical significance (with 99% confidence) so how meaningful is either result? The root problem here is that we're trying to draw too much significance from too little data taken from a low quality set. (And commiting crimes against p values.)

The safe conclusion from this small data set is that the change to HTTP2 hasn't introduced a regression and it seems to be slightly faster - which is Good Enough™ for this purpose.

I've always been weary of putting too much faith in statistical significance since a first year chemistry degree experiment disproved the second law of thermodynamics (with 95% confidence). It turned out a faulty water bath had skewed the data and figuring that out was a much more valuable lesson than whatever, now long forgotten, chemical reaction we were ostensibly studying.

Resources

Evan’s Awesome A/B Tools were invaluable for this - especially the Two-Sample T-Test calculator and visualiser.