Getting Started with mlpack
I’ve recently needed to perform a benchmarking experiment with k-NN in C++, so I found mlpack as what appears to be a popular and high-performance machine learning library in C++.
I’m not a very strong Linux user (though I’m working on it!), so I actually had a lot of trouble getting up and going with mlpack, despite their documentation.
In this guide, I’ll cover the steps needed to get up and running, but also offer some explanation of what’s going in each. So if you’re already an expert at working with C++ libraries in Linux, you may find this post pretty boring :).
Downloading pre-compiled mlpack with the package manager
I’m currently running Ubuntu 16.04, so some of this may be Ubuntu-specific.
The Ubuntu package manager helps you get mlpack installed as well as any dependencies (and it appears that mlpack has a lot of dependencies on, e.g., other linear algebra libraries).
The name of the package is ‘libmlpack-dev’. This is going to install the mlpack libraries and header files for you–it does not include the source code for mlpack, which you don’t need if you’re just planning to reference the libraries. It also does not include any source examples. They provide a couple code examples as text on their website; to run these you would create your own .cpp file and paste in the code (but you also need to supply your own dataset! 0_o). More on example code later.
I found the package name a little confusing (why isn’t it just “mlpack”?), so here are some clarifications on the “lib” and “-dev” parts of the package name:
- Dynamically-linked libraries like mlpack all have ‘lib’ prepended to their package name (like liblapack, libblas, etc.).
- The Dynamic Linker in Linux (called “ld”) requires dynamically-linked libraries to have the form lib.so (Reference).
- “.so” stands for “Shared Object”, and it’s analogous to DLLs on Windows.
- The “-dev” suffix on the package name is a convention that indicates that this package contains libraries and header files that you can compile against, as opposed to just executable binaries. (Reference)
Another thing that confused me–how would you know the name of the package you want to install if all you know is that the project is called “mlpack”?
This page provides a nice tutorial (with a lot of detail) about how you can find packages and learn about them. Here’s the command that I would have found most helpful, though:
apt-cache search 'mlpack'. Those single quotes around mlpack are actually wildcards–they allow it to match any package with mlpack anywhere in the name.
Here’s what each of those packages provides.
- libmlpack-dev - If you are going to write code that references the mlpack libraries, this is what you need.
- libmlpack2 - If you’re not programming with mlpack, but you’re using an application that uses the mlpack libraries, then you’d just need this package with the “runtime library” (the dynamically-linked library).
- mlpack-bin - The mlpack project actually includes command line tool versions of some of the machine learning algorithms it implements. So, for example, you could run k-means clustering on a dataset from the command line without doing any programming. This package contains those binaries.
- mlpack-doc - Documentation for the libraries.
So to write our own code using the mlpack libraries, we just need libmlpack-dev. Grab it with the APT (Advanced Packaging Tool) package manager with the following command:
This will install mlpack and all of the libraries it depends on. Except one, apparently–you’ll also need to install Boost:
Maybe Boost was left out of the dependency list because it’s so commonly used? I don’t know.
Something that left me pretty confused from the installation was that I had no idea where mlpack was installed to. (Mainly, I wanted to know this because I assumed it would have installed some example source files for me somewhere, but I learned later that it doesn’t include any.)
To list out all of the files installed by mlpack, use this command:
There are some default locations for libraries in Linux, and that’s where you’ll find mlpack:
- It installs lots of headers under /usr/include/mlpack/.
- It installs the library file to /usr/lib/x86_64-linux-gnu/libmlpack.so
These default locations are already part of the path for gcc / g++, so you’re all set to #include the mlpack header files in your code!
Compiling and Running an Example
As a first example, we’ll use the sample code from the mlpack site for doing a nearest neighbor search.
This very simple example takes a dataset of vectors and finds the nearest neighbor for each data point. It uses the dataset both as the reference and the query vectors.
I’ve taken their original example and added some detailed comments to explain what’s going on.
Save the following source code in a file called knn_example.cpp:
And save this toy dataset as data.csv:
To compile the example, you’ll use g++ (the C++ equivalent of gcc).
- knn_example.cpp - The code to compile.
- -o knn_example - The binary (executable) to output.
- -std=c++11 - mlpack documentation says you need to set this flag.
- -larmadillo -lmlpack -lboost_serialization - The “-l” flag tells the linker to look for these libraries.
- armadillo is a linear algebra library that mlpack uses.
Finally, to run the example, execute the binary:
And you should see the following output:
Nearest neighbor of point 0 is point 7 and the distance is 1. Nearest neighbor of point 1 is point 2 and the distance is 0. Nearest neighbor of point 2 is point 1 and the distance is 0. Nearest neighbor of point 3 is point 10 and the distance is 1. Nearest neighbor of point 4 is point 11 and the distance is 1. Nearest neighbor of point 5 is point 12 and the distance is 1. Nearest neighbor of point 6 is point 12 and the distance is 1. Nearest neighbor of point 7 is point 10 and the distance is 1. Nearest neighbor of point 8 is point 9 and the distance is 0. Nearest neighbor of point 9 is point 8 and the distance is 0. Nearest neighbor of point 10 is point 9 and the distance is 1. Nearest neighbor of point 11 is point 4 and the distance is 1. Nearest neighbor of point 12 is point 9 and the distance is 1.
You’re up and running!