table of contents
appendices
Sicilian language
One task of natural language processing is to compare documents for similarity. The simplifying assumptions make document comparison a good place to start and provide a useful benchmark for comparison with more sophisticated techniques.
So let's start by assuming that language is just a "bag of words." Specifically, let's assume that word order (i.e. grammar) is not important and let's assume that words do not have synonyms or antonyms. Using these assumptions, we can measure the similarity between two documents by counting the words in those two documents using the cosine measure.
Before discussing the details of the cosine measure, let's start with a simple example. Suppose that document A contains the words: {up,up,down} and document B contains the words: {up,down,down}.
Since there are only two unique words in the documents (i.e. "up" and "down"), we can plot the documents as vectors in two-dimensional space (i.e. one dimension for each word).
In this simple example, the cosine of the angle between the two vectors, cos(θ), is our measure of the similarity between the two documents. In the example above, cos(37o)= 0.80.
Note that if both vectors were the same (e.g. if both documents contained one "up" and one "down"), then the angle would be zero degrees and the cosine measure of similarity would be one (i.e. cos(0o)= 1.00).
Note also that if there were no similarity between the documents, then the vectors would meet at a right angle and the cosine measure of similarity would be zero (i.e. cos(90o)= 0.00).
In the more general case, where the documents contain many unique words, we can calculate the cosine measure as the dot product of the two vectors, A · B, divided by the product of the length of the two vectors, ||A|| · ||B||:
cos(θ) = | A · B |
||A|| · ||B|| |
where the dot product is:
A · B = | a1·b1 + a2·b2 + ... + an·bn |
where the length of a vector is:
||A|| = | √ | a12 + a22 + ... + an2 |
and where ai is the number of times that word i occurs in document A.
Because word counts must be non-negative, the cosine measure will always return a value between zero and one. The measure will be zero when there is no similarity between the documents and the measure will be one when the two documents are identical.
In practice, we will have many documents to compare, so we will need a matrix of cosine measures. For that purpose, we can write the following R function:
## compute cosine measure
mkCosine <- function( tdm ) {
## interpret term doc matrix as a matrix
tdm <- as.matrix( tdm )
## compute dot product
dot <- tdm %*% t(tdm)
## cosine measure = dot product / product of lengths
csim <- dot / sqrt( diag(dot) %*% t(diag(dot)) )
csim
}
which uses the term-document matrix to compute the cosine measure for each pair of documents.
Copyright © 2002-2024 Eryk Wdowiak