Google
 
   
Login
Username:

Password:


Lost Password?

Register now!
Search
Main Menu
top books
Polls
What do you think about php-deluxe.net?
Excellent!
Cool
Hmm..not bad
What the hell is this?
encyclopedia
recommendation
compare webbrowser
Freenet DSL
Who's Online
11 user(s) are online (9 user(s) are browsing encyclopedia)

Members: 0
Guests: 11

more...
browser tip
recommendation!
Sponsored
partner
Germany Next Topmodel
germanys next topmodel germanys next topmodel

Data dredging

Data dredging is the term used to refer to the inappropriate (sometimes deliberately so) search for statistically significant relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with a substantially different meaning, so the term data dredging is now used instead. The term data fishing is another label for doing data dredging.

Conventional statistical procedure is to formulate a research hypothesis, (such as people in higher social classes live longer ) then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance.

A key point is that one should not formulate a hypothesis as a result of seeing the data , at least not if the data is then used as proof of the hypothesis. If you want to work from data to hypotheses while avoiding the problems of data dredging, you need to collect a data set, then partition it into two subsets, A and B, with data items randomly placed in the two subsets. Only one subset - say, subset B - is examined for interesting hypotheses. Once a hypothesis has been formulated by examining subset B, the hypothesis can be tested on subset A, since subset A was not used to construct the hypothesis. Only where such a hypothesis is also supported by subset A is it reasonable to believe that the hypothesis might be valid.

Any large data set contains some chance features which will not be present in similar data sets. To simply declare these chance features to be facts is spurious.

An example: assume a television marketing campaign to sell banking services of a major bank. Suppose the campaign is run in one geographical area but not in another (similar area); the similar area serves as a control group. Suppose that overall sales in the treatment group - where the campaign was run - did not rise significantly compared to the control area. Suppose further that an analysis found that sales did go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, comparing the treatment and control groups, and that such increase(s) was/were statistically significant . There would certainly be a temptation to report such findings as proof that the campaign was successful, or would be successful if targeted to such group(s) in other markets.

It is important to realise that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.

See, for example:

[http://medicine.plosjournals.org/perlserv/request=get-document&doi=10%2E1371%2Fjournal%2Epmed%2E0020124 Why Most Published Research Findings Are False], Public Library of Science, August 2005.