Colin McFarland: A/B testing

Orde Saunders' avatar Published: by Orde Saunders

Colin McFarland was talking about A/B testing at DiBi, these are my notes from his talk.

When you run lots of experiments they become more difficult.

Skyscanner use an in-house A/B testing framework.

A/B testing shouldn't just be for data scientists - can you reduce the barrier so that anyone can run a test. Understanding the pitfalls is key when you let anyone run tests.

  • Confirmation bias - we have a tendency to focus on data that confirms our beliefs.
  • HARAKing: Hypothesis after the results are known. Very easy to get false positives.
  • Cherry picking, don't understand the null hypothesis theory. (You can get different results on different days.)
  • If you get a big result ask where the errors are?
  • Have a hypothesis you are looking to validate.

Test like you're wrong - assume nothing interesting is happening. Design like you're right - try and make something interesting happen.

Default to rejecting the test if the result isn't significant. Run fast experiments, you might need to extend them or make a bigger experiment. Think about what's measurable. Check out the Hypothesis kit from Craig Sullivan - it might not work for you without modification.

When we use A/B tests we find out that things we love don't work - 5% (or less) of tests are successful. You need to change how you work, you need to change track or iterate. The Ikea effect: you are more attached to something you create. Experiment calculator

Things may not move in the direction you want - they will move in the direction that users want. We build products based on truth. We can give ideas that would normally be rejected a chance, they get assessed by users not by internal stakeholders.

Run A/B tests to learn, they don't have to be entirely focussed complete features. Create a button and see if people use it - if they do then you can build out what happens afterwards.

Don't take one experiment as the truth, run variations to validate the core principle.

Can use A/B tests to ensure that changes don't have a negative effect. This means we don't have to purely data driven, there is room for experimentation.