Document-term matrix |
Document-term matrices are used in natural language processing programs. They represent natural language documents as mathematical objects and make it possible to process them as a whole.
=General Concept=
When creating a database of term (language) that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents: *D1 = I like databases *D2 = I hate hate databases , then the document-term matrix would be:
which shows which documents contain which terms and how many times they appear.
Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.
=Choice of Terms=
A point of view on the matrix is that each row represents a document. In the Vector space model which is normally the one used when computing a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant syntactic category , and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.
=Applications=
==Improving search results==
Latent semantic analysis (performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguation polysemy and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.|
|