Data mining |
Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of Data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition.
=Definition=
Data mining has been defined as The nontrivial extraction of implicit, previously unknown, and potentially useful information from data and The science of extracting useful information from large data sets or databases . Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.
A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts. The sales department will look at that information and may begin direct mail marketing of silk shirts to that customer, or it may alternatively attempt to get the customer to buy a wider range of products. In this case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company. Another widely used (though hypothetical) example is that of a very large North American chain of supermarkets. Through intensive analysis of the transactions and the goods bought over a period of time, analysts found that beers and diapers were often bought together. Though explaining this interrelation might be difficult, taking advantage of it, on the other hand, should not be hard (e.g. placing the high-profit diapers next to the high-profit beers). This technique is often referred to as Market Basket Analysis .
In statistical analyses in which there is no underlying theoretical model, data mining is often approximated via stepwise regression methods wherein the space of 2k possible relationships between a single outcome variable and k potential explanatory variables is smartly searched. With the advent of grid computing, it became possible (when k is less than approximately 40) to examine all 2k models. This procedure is called all subsets or exhaustive regression. Some of the first applications of exhaustive regression involved the study of clinical data.
=Data dredging=
Used in the technical context of data warehousing and analysis, the term data mining is neutral. However, it sometimes has a more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more properly criticized as data dredging in the statistical literature. Another term for this misuse of statistics is data fishing.
Used in this latter sense, data dredging implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. (This is also referred to as overfitting the model .) The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect. In spite of this, some exploratory data analysis is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear. The common approach, in data mining, to overcoming the problem of overfitting is to separate the data into two or three separate data sets (called the training set, validation set, and testing set). The model is built using the training and validation set, and is then tested using the testing set; the procedure can be repeated many times by resampling the data sets, in order to be more certain that a real pattern has been found and that the model is not merely capitalizing on random chance (i.e. overfitting).
A more significant danger is finding correlations that do not really exist. Investment analysts appear to be particularly vulnerable to this. There have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it. . However, when properly done, determining correlations in Investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and furthermore correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels.
Most data mining efforts are focused on developing a finely-grained, highly detailed model of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data.
=Privacy concerns=
There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they may screen out people who have diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.
Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns.
There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs with adverse reactions. Since the combination may occur in only 1 out of 1000 people, a single case may not be apparent. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.
Essentially, data mining gives information that wouldn t be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.
=Combinatorial game data mining=
*Data mining from combinatorial game Oracle machines:
Since the early 1990 s, with the availability of oracles for certain combinatorial games, also called s are notable examples of people doing this work, though they were not and are not involved in tablebase generation.
= Notable Uses of Data Mining =
*Data mining has been cited as the method by which the U.S. Army unit Able Danger supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year before the attack. **See the Wikinews article at: [http://en.wikinews.org/w/index.phptitle=U.S._Army_intelligence_had_detected_9/11_terrorists_year_before%2C_says_officer&oldid=130741 Wikinews: U.S. Army intelligence detection of 9/11 terrorists before attack] **See also the Wikipedia article on the unit Able_Danger.
=In fiction=
Vernor Vinge s science fiction novel A Fire Upon the Deep takes place in a universe where almost every piece of information is already known, but the precise location of that information is not, giving rise to the profession of Programmer Archaeologist .
=See also=
*Artificial intelligence *Artificial neural network *Business intelligence *Business performance management *Database *Data stream mining *Data warehouse *Decision tree *Descriptive statistics *Discovery Science *Document warehouse *Fuzzy logic *Hypothesis testing *Java Data Mining *Linear discriminant analysis *Logit (in reference to logistic regression) *Loyalty card *Machine learning *Nearest neighbor (pattern recognition) *Pattern recognition *Principal components analysis *Regression analysis *Relational data mining *Statistics *Text mining
=References=
Endnotes:
W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, Fall 1992, pages 213-228.
D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X
Fred Schwed, Jr, Where Are the Customers Yachts ISBN 0471119792 (1940).
T. Menzies, Y. Hu, Data Mining For Very Busy People. IEEE Computer, October 2003, pages 18-25.
K. A. Taipale, [http://ssrn.com/abstract=546782 Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data], [http://www.advancedstudies.org/ Center for Advanced Studies in Science and Technology Policy]. [http://www.stlr.org/cite.cgivolume=5&article=2 5 Colum. Sci. & Tech. L. Rev. 2] (December 2003).
Eddie Reed, Jing Jie Yu, Antony Davies, et al., [http://www.business.duq.edu/faculty/davies/research/rer.doc Clear Cell Tumors Have Higher mRNA Levels of ERCC1 and XPB than Other Types of Epithelial Ovarian Cancer], Clinical Cancer Research, 2003.
Other:
=External links=
=Software=
|
|
