KNN Text Classification

Simplified Data Set

Assume terms a document are unique (tag-like dataset).

doc	terms
d0	[ t0, t2, t3 ]
d1	[ t1, t3 ]
d2	[ t0, t2 ]

and

docs = [ d0, d1, d2 ]
d0.terms = [ t0, t2, t3 ]

Weight of Terms

Represent weight of terms with a term-by-document Matrix, A,
where ti denotes term i and dj denotes document j.
Entry A(i, j) is computed by TF-IDF. For example:

A(0, 0) = 
	(1 / # of terms of d0) * 
	log2(# of docs / t0 occurrences for all docs)
	= (1/3) * log2(3/2)
	~= 0.195

Hence A of the dataset is

term\doc	d0	d1	d2
t0	0.195	0	0.292
t1	0	0.792	0
t2	0.195	0	0.292
t3	0.195	0.292	0

Note: t0 … t3 can be viewed as a multi-dimension space such that d0 is a point
(0.195, 0, 0.195, 0.195) of the space.

Document Similarity

Per the weight matrix we can compute document-to-document similarities matrix, S, by cosine-similarity. For example:

intersect(d0.terms, d0.terms) = [ t3 ]

S(0, 1) = 
    sum([ A(i, 0) * A(i, 1)) | ti is shared by d0 and d1 ]) /
    [ sqrt(sum( A(i, 0)^2 ) for all i) + 
      sqrt(sum( A(i, 1)^2 ) for all i) ]
      
    = [ A(3,0) * A(3, 1) ]  / [
	    sqrt(0.195^2 + 0^2 + 0.195^2 + 0.195^2) +
	    sqrt(0^2 + 0.792^2 + 0^2 + 0.292^2)
	    ]
	= 0.199

doc\doc	d0	d1	d2
d0	1	0.199	0.81
d1		1	0.0
d2			1

As one can see, documents have more terms in common will have higher similarity score.
e.g. S(0, 1) = 0.199 represents 1 common term. S(0, 2) = 0.81 represents 2 common terms.

Nearest Neighbor

Given a new document, d3, and its term list [t2, t3], we can compute the nearest neighbor of it.
First we compute weight of terms of d3, which is

term\doc	d3 w/upd	d3 wo/upd
t0	0	0
t1	0	0
t2	0.292	0.792
t3	0.499	0.792

Note: Among calculation of weight, only log2(# of docs / term occurrences for all docs) need to be updated. In practice, within large enough dataset, I think it should be fine to eliminate the updating. Above table shows both weight values with/without updating.

Then we compute similarity score for d3 to other documents.

doc\doc	d0	d1	d2
d3 w/upd	0.789	0.299	0.5
d3 wo/upd	0.816	0.245	0.5

The nearest neighbor, i.e. 1-NN, of d3 is d0.

Written with StackEdit.

Acerlog 捨漏格

搜尋此網誌