Concept Search on Wikipedia

22 Feb 2017

I recently created a project on GitHub called wiki-sim-search where I used gensim to perform concept searches on English Wikipedia.

gensim includes a script, make_wikicorpus.py, which converts all of Wikipedia into vectors. They’ve also got a nice tutorial on using it here.

I started from this gensim script and modified it heavily to comment and organize it, and achieve some more insight into each of the steps. I also included a few additional steps, like training and applying the LSI model, and performing similarity searches by article title.

What it takes

Building a corpus from Wikipedia is a pretty lengthy process–about 12 hours from Wikipedia dump to a set of LSI vectors. It also uses a lot of hard disk space, as you can imagine.

Here’s a breakdown of the steps involved in my version of make_wikicorpus.py, along with the files generated and their sizes.

These numbers are from running on my desktop PC, which has an Intel Core i7 4770, 16GB of RAM, and an SSD.

#	Step	Time (h:m)	Output File	File Size
0	Download Wikipedia Dump	--	enwiki-latest-pages-articles.xml.bz2	12.6 GB
1	Parse Wikipedia & Build Dictionary	3:12	dictionary.txt.bz2	769 KB
2	Convert articles to bag-of-words vectors	3:32	bow.mm	9.44 GB
2a.	Store article titles	--	bow.mm.metadata.cpickle	152 MB
3	Learn tf-idf model from document statistics	0:47	tfidf.tfidf_model	4.01 MB
4	Convert articles to tf-idf	1:40	corpus_tfidf.mm	17.9 GB
5	Learn LSI model with 300 topics	2:07	lsi.lsi_model	3.46 MB
			lsi.lsi_model.projection	3 KB
			lsi.lsi_model.projection.u.npy	228 MB
6	Convert articles to LSI	0:58	lsi_index.mm	1 KB
			lsi_index.mm.index.npy	4.69 GB
	TOTALS	12:16		45 GB

The original gensim script stops after step 4.

Notice how it parses Wikipedia twice–in steps 1 and steps 2. The alternative would be that, on the first pass, you write out the extracted tokens to another file (there’s no way you could keep them all in memory). If you want to save a little bit of time, I included my compiled dictionary in the project, so that you can skip over step 1.

Insights into Wikipedia

It’s fun to look at the statistics on Wikipedia.

These statistics are from the Wikipedia dump that I pulled down on 1/18/17.

17,180,273	Total number of articles (without any filtering)
4,198,780	Number of articles after filtering out "article redirects" and "short stubs"
2,355,066,808	Total number of tokens in all articles (without any filtering)
2,292,505,314	Total number of tokens after filtering articles
8,746,676	Total number of unique words found in all articles (after filtering articles)

In summary:

There are ~4.2M Wikipedia articles with real content.
There are ~2.3B words across all of these articles, which means the average article length is 762 words.
There are 8.75M unique words in Wikipedia.

Similarity Search

Once you have the LSI vectors, you can search wikipedia to find the most similar articles to a specified article.

As a fun example, I searched for the top 10 articles conceptually similar to Topic model.

The results look pretty reasonable to me:

Most similar documents:
  0.90    Online content analysis
  0.90    Semantic similarity
  0.89    Information retrieval
  0.89    Data-oriented parsing
  0.89    Concept search
  0.89    Object-role modeling
  0.89    Software analysis pattern
  0.88    Content analysis
  0.88    Adaptive hypermedia
  0.88    Model-driven architecture

It’s interesting, I didn’t know about most of these related articles. I had never heard of “Online content analysis”, and wouldn’t have thought to look at “Semantic similarity” in researching topic modeling. A concept search like this seems pretty helpful for researching topics.

gensim Web App

The gensim guys turned this concept search of Wikipedia functionality into a cool little web app here.

It uses an approximate nearest neighbor library called Annoy to deliver fast results.

If you search by the article ‘Topic model’ using their web app, the results don’t seem nearly as good:

    Topic model
    Knowledge modeling
    Technology acceptance model
    Anchor Modeling
    Articulatory synthesis
    Technology adoption lifecycle
    MMT (Eclipse)
    Resource Description Framework
    Geotechnical centrifuge modeling
    Trans-Proteomic Pipeline

Memory Concerns

Performing similarity searches on Wikipedia gets interesting because of the matrix size.

I used LSI with 300 topics, so the matrix is:

([4.2E6 articles] x [300 features/article] x [4 bytes/feature]) / [2^30 bytes/GB] = 4.69GB

That’s a lot of RAM!

For my implementation, I used gensim.similarities.MatrixSimilarity, which loads the entire set of LSI vectors into memory.

However, gensim also includes a class gensim.similarities.Similarity that allows you to “shard” the matrix so that you can keep it on disk and process it in chunks.

It would be interesting to try this and compare search performance.

Chris McCormick Patterns & Messages Archive