Simplified Data Set
Assume terms a document are unique (tag-like dataset).
doc | terms |
---|---|
d0 | [ t0, t2, t3 ] |
d1 | [ t1, t3 ] |
d2 | [ t0, t2 ] |
and
docs = [ d0, d1, d2 ]
d0.terms = [ t0, t2, t3 ]
Weight of Terms
Represent weight of terms with a term-by-document Matrix, A,
where ti
denotes term i
and dj
denotes document j
.
Entry A(i, j)
is computed by TF-IDF
. For example:
A(0, 0) =
(1 / # of terms of d0) *
log2(# of docs / t0 occurrences for all docs)
= (1/3) * log2(3/2)
~= 0.195
Hence A of the dataset is
term\doc | d0 | d1 | d2 |
---|---|---|---|
t0 | 0.195 | 0 | 0.292 |
t1 | 0 | 0.792 | 0 |
t2 | 0.195 | 0 | 0.292 |
t3 | 0.195 | 0.292 | 0 |
Note: t0 … t3 can be viewed as a multi-dimension space such that d0
is a point
(0.195, 0, 0.195, 0.195)
of the space.
Document Similarity
Per the weight matrix we can compute document-to-document similarities matrix, S, by cosine-similarity. For example:
intersect(d0.terms, d0.terms) = [ t3 ]
S(0, 1) =
sum([ A(i, 0) * A(i, 1)) | ti is shared by d0 and d1 ]) /
[ sqrt(sum( A(i, 0)^2 ) for all i) +
sqrt(sum( A(i, 1)^2 ) for all i) ]
= [ A(3,0) * A(3, 1) ] / [
sqrt(0.195^2 + 0^2 + 0.195^2 + 0.195^2) +
sqrt(0^2 + 0.792^2 + 0^2 + 0.292^2)
]
= 0.199
doc\doc | d0 | d1 | d2 |
---|---|---|---|
d0 | 1 | 0.199 | 0.81 |
d1 | 1 | 0.0 | |
d2 | 1 |
As one can see, documents have more terms in common will have higher similarity score.
e.g. S(0, 1) = 0.199 represents 1 common term. S(0, 2) = 0.81 represents 2 common terms.
Nearest Neighbor
Given a new document, d3
, and its term list [t2, t3]
, we can compute the nearest neighbor of it.
First we compute weight of terms of d3
, which is
term\doc | d3 w/upd | d3 wo/upd |
---|---|---|
t0 | 0 | 0 |
t1 | 0 | 0 |
t2 | 0.292 | 0.792 |
t3 | 0.499 | 0.792 |
Note: Among calculation of weight, only log2(# of docs / term occurrences for all docs)
need to be updated. In practice, within large enough dataset, I think it should be fine to eliminate the updating. Above table shows both weight values with/without updating.
Then we compute similarity score for d3
to other documents.
doc\doc | d0 | d1 | d2 |
---|---|---|---|
d3 w/upd | 0.789 | 0.299 | 0.5 |
d3 wo/upd | 0.816 | 0.245 | 0.5 |
The nearest neighbor, i.e. 1-NN, of d3
is d0
.
Written with StackEdit.
留言
張貼留言