KNN Text Classification Simplified Data Set Assume terms a document are unique (tag-like dataset). doc terms d0 [ t0, t2, t3 ] d1 [ t1, t3 ] d2 [ t0, t2 ] and docs = [ d0, d1, d2 ] d0.terms = [ t0, t2, t3 ] Weight of Terms Represent weight of terms with a term-by-document Matrix, A, where ti denotes term i and dj denotes document j . Entry A(i, j) is computed by TF-IDF . For example: A(0, 0) = (1 / # of terms of d0) * log2(# of docs / t0 occurrences for all docs) = (1/3) * log2(3/2) ~= 0.195 Hence A of the dataset is term\doc d0 d1 d2 t0 0.195 0 0.292 t1 0 0.792 0 t2 0.195 0 0.292 t3 0.195 0.292 0 Note : t0 … t3 can be viewed as a multi-dimension space such that d0 is a point (0.195, 0, 0.195, 0.195) of the space. Document Similarity Per the weight matrix we can compute document-to-document similarities matrix, S, by cosine-similarity. For example: intersect(d0.terms, d0.terms) = ...