Google
 
   
Login
Username:

Password:


Lost Password?

Register now!
Search
Main Menu
top books
Polls
What do you think about php-deluxe.net?
Excellent!
Cool
Hmm..not bad
What the hell is this?
encyclopedia
recommendation
compare webbrowser
Freenet DSL
Who's Online
9 user(s) are online (8 user(s) are browsing encyclopedia)

Members: 0
Guests: 9

more...
browser tip
Unix Befehle
manual of unix befehle
recommendation!
Sponsored
partner

Term-document matrix

Term-document matrices , or term-by-document matrices , are used in natural language processing programs. They represent natural language documents as mathematical objects and make it possible to process them as a whole.

=General Concept=

When creating a database of term (language) that appear in a set of documents, the term-document matrix contains rows corresponding to the terms and columns corresponding to the documents (or contexts ). The cell values represent the occurrence of a particular term in a particular document.

For instance, if one has the following two (short) documents:

*D1 = I like databases *D2 = I hate hate databases , then the term-document matrix would be:

which shows which terms are contained in which documents (we start reading at the rows) and which documents contain which terms (we start reading at the columns), and how many times they appear.

Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

=Choice of Terms=

A point of view on the matrix is that each row represents a document. In the Vector space model which is normally the one used when computing a term-document matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant syntactic category , and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

=Applications=

==Improving search results==

Latent semantic analysis (performing eigenvalue decomposition on the term-document matrix) can improve search results by disambiguation polysemy and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.