again, don't forget
and, on the k-means clustering front, this might be nice: https://code.google.com/p/ekmeans/
and the idea of named entity recognition... i need to implement this: http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo
and, of course, all of these: http://en.wikipedia.org/wiki/Category:String_similarity_measures