Concept Search on Wikipedia
I recently created a project on GitHub called wiki-sim-search where I used gensim to perform concept searches on English Wikipedia.
gensim includes a script,
make_wikicorpus.py, which converts all of Wikipedia into vectors. They’ve also got a nice tutorial on using it here.
I started from this gensim script and modified it heavily to comment and organize it, and achieve some more insight into each of the steps. I also included a few additional steps, like training and applying the LSI model, and performing similarity searches by article title.
What it takes
Building a corpus from Wikipedia is a pretty lengthy process–about 12 hours from Wikipedia dump to a set of LSI vectors. It also uses a lot of hard disk space, as you can imagine.
Here’s a breakdown of the steps involved in my version of
make_wikicorpus.py, along with the files generated and their sizes.
These numbers are from running on my desktop PC, which has an Intel Core i7 4770, 16GB of RAM, and an SSD.
|#||Step||Time (h:m)||Output File||File Size|
|0||Download Wikipedia Dump||--||enwiki-latest-pages-articles.xml.bz2||12.6 GB|
|1||Parse Wikipedia & Build Dictionary||3:12||dictionary.txt.bz2||769 KB|
|2||Convert articles to bag-of-words vectors||3:32||bow.mm||9.44 GB|
|2a.||Store article titles||--||bow.mm.metadata.cpickle||152 MB|
|3||Learn tf-idf model from document statistics||0:47||tfidf.tfidf_model||4.01 MB|
|4||Convert articles to tf-idf||1:40||corpus_tfidf.mm||17.9 GB|
|5||Learn LSI model with 300 topics||2:07||lsi.lsi_model||3.46 MB|
|6||Convert articles to LSI||0:58||lsi_index.mm||1 KB|
The original gensim script stops after step 4.
Notice how it parses Wikipedia twice–in steps 1 and steps 2. The alternative would be that, on the first pass, you write out the extracted tokens to another file (there’s no way you could keep them all in memory). If you want to save a little bit of time, I included my compiled dictionary in the project, so that you can skip over step 1.
Insights into Wikipedia
It’s fun to look at the statistics on Wikipedia.
These statistics are from the Wikipedia dump that I pulled down on 1/18/17.
|17,180,273||Total number of articles (without any filtering)|
|4,198,780||Number of articles after filtering out "article redirects" and "short stubs"|
|2,355,066,808||Total number of tokens in all articles (without any filtering)|
|2,292,505,314||Total number of tokens after filtering articles|
|8,746,676||Total number of unique words found in all articles (*after* filtering articles)|
- There are ~4.2M Wikipedia articles with real content.
- There are ~2.3B words across all of these articles, which means the average article length is 762 words.
- There are 8.75M unique words in Wikipedia.
Once you have the LSI vectors, you can search wikipedia to find the most similar articles to a specified article.
As a fun example, I searched for the top 10 articles conceptually similar to Topic model.
The results look pretty reasonable to me:
It’s interesting, I didn’t know about most of these related articles. I had never heard of “Online content analysis”, and wouldn’t have thought to look at “Semantic similarity” in researching topic modeling. A concept search like this seems pretty helpful for researching topics.
gensim Web App
The gensim guys turned this concept search of Wikipedia functionality into a cool little web app here.
It uses an approximate nearest neighbor library called Annoy to deliver fast results.
If you search by the article ‘Topic model’ using their web app, the results don’t seem nearly as good:
Top words per topic
It’s interesting to look at the top words per topic. You can see the top 10 words for each of the 300 learned topics here.
Some of them make sense:
- Topic #37: democratic, republican, trump, hillary, racing, airport, pt, huckabee, obama, clinton,
- Topic #51: ef, tornado, tropical, airport, church, damage, utc, storm, url, fc,
- Topic #114: forests, stm, broadleaf, shrublands, estero, subtropical, palearctic, grasslands, moist, utc,
But I’d say most of the topics are pretty confusing. There are a lot of words that show up which don’t seem like they should be nearly so important. For example:
- ‘saleen’ (a car manufacturer) appears in 15 topics
- ‘astragalus’ (a species of plant) appears in 7
- ‘southampton’ (a city in England) appears in 15 topics
There are also a lot of short words that seem like they may be HTML or stlying tags that are somehow not getting parsed out. For example, ‘http’, ‘ft’, ‘bgcolor’, ‘alt’.
I’m hoping to dig into the parsing code a bit and see if I can’t improve the results.
Performing similarity searches on Wikipedia gets interesting because of the matrix size.
I used LSI with 300 topics, so the matrix is:
That’s a lot of RAM!
For my implementation, I used
gensim.similarities.MatrixSimilarity, which loads the entire set of LSI vectors into memory.
However, gensim also includes a class
gensim.similarities.Similarity that allows you to “shard” the matrix so that you can keep it on disk and process it in chunks.
It would be interesting to try this and compare search performance.