I recently created a project on GitHub called wiki-sim-search where I used gensim to perform concept searches on English Wikipedia.
I’ve recently needed to perform a benchmarking experiment with k-NN in C++, so I found mlpack as what appears to be a popular and high-performance machine learning library in C++.
In part 2 of the word2vec tutorial (here’s part 1), I’ll cover a few additional modifications to the basic skip-gram model which are important for actually making it feasible to train.
DBSCAN is a popular clustering algorithm which is fundamentally very different from k-means.
In this post I’m sharing a technique I’ve found for showing which words in a piece of text contribute most to its similarity with another piece of text when using Latent Semantic Indexing (LSI) to represent the two documents. This has proven valuable to me in debugging bad search results from “concept search” using LSI. You’ll find the equations for the technique as well as example Python code.