Saturday, May 3, 2014

What is Statistics all about?

"What is statistics?" is a question a lot of people should ask. When I say I am studying statistics (PhD) I get this "aha" look all the time, people thinking "so that's what you are doing".

I find this very weird! Why? Because, even after I had been doing statistics for a year at university level (some of it really hard stuff) I still had no idea what it was about.

So what is it all about?

Imagine the following; you are hired by Greenpeace to find out how many killer whales are in the ocean. This is just a small project in the big scope of trying to figure out how the environment is developing.

How would you do that? Assume that you are given a huge budget for this relatively small task.

Will you hire people to count all the killer whales in the entire world? No, that would not be possible. Would you put out observation posts then, at random locations, and count the number of whales there? Would you make a survey among fishermen and ask how many they have seen? Would you combine different methods?

It is easy to see that none of these methods are very good. There is no good solution to this problem. But Greenpeace still wants you to solve it. So you will have to settle for a bad/imperfect solution.

How does a solution look? Is it a number? Let us ask a typical physicist (or any other kind of scientist): "When you measure the constant of gravity, how do you give your result?" "I gave a number, and then an uncertainty. For example I say 9.8 plus minus 0.1, which means that I believe the answer to be 9.8, but it would not surprise me to hear that it was instead 9.7 or 9.9."

In statistics we often have poor (unprecise) solutions, because it's the best you can do. So our answer is always given together with a bullshitfactor (a relevant technical term is 'variance').

If you say 9.8 $\pm$ 0.1 in physics, you are a bit more precise in statistics, saying "I am 90% certain that the answer is between 9.7 and 9.9, my best estimate is 9.8".

A few examples (90% certainty that the truth is in our interval)
3.4 $\pm$ 0.1 - I'm quite confident the answer is 3.4
3.4 $\pm$ 0.5 - The answer should be somewhere near 3.4
3.4 $\pm$ 2.1 - I'm thinking the answer is 3.4, but really, it's mostly bullshit
3.4 $\pm$ 4.0 - I have no idea, but 3.4 is a number I like

So statistics is all about giving as good an answer as you can, and with equal importance, giving a truthful bullshitfactor. So that people using your conclusions are neither too certain nor too hesitant. So that they know how much is knowledge and how much is bullshit.



No comments:

Post a Comment