Designing libraries: Expanding pre-existing capabilities

Today I want to continue on discussing the three categories of software I outlined in my last post (found here), and focus on libraries. Libraries are pieces of software that serve to expand the preexisting capabilities of a given programming language. Note the focus on the programming language itself, not on any given task one might be writing a piece of software to do. This leads me to the number one principle with respect to designing libraries:

The user base of a library are software developers.

Going back to the idea that you as the developer of a piece of software need to understand what your users need and need not know, we can outline some expectations for users of any given library.

They will know how to program in the target language. This is the most basic requirement for using a library. Libraries are not meant to completely abstract away from a language, but rather to augment the capabilities of that language.
They will know how to use a library to program their primary task. A software developer is building a piece of software, and you can safely assume that they will know how to go about creating that piece of software.
They will not know how to implement specific pieces of their program. For example, a developer might be writing some code to perform linear algebra. They might know what they need to do (e.g. multiply matrix A by matrix B), but they don’t necessarily know how to implement matrix multiplication efficiently.

So how do these points translate into design principles? From my experience, there are three main qualities that I ascribe to a well designed library.

A well designed library closely follows the general principles of the programming language it augments. This is to not violate the expectations of the user. If I am working with a library, and it requires me start using constructors to instantiate objects when the language I am working in doesn’t typically need to do that, that might throw me off in my use of it.
Libraries should be internally self consistent. This is not a category of software meant to hold a bunch of miscellaneous functions, nor should there be inconsistencies in how to use the set of functions.
Libraries should have extensive documentation. Now granted, all software should be well documented, but to me, libraries need even more care with respect to documentation. When I am working on some software, I need to know how every function I am using performs. Undocumented side effects could cause serious issues down the line.

So let’s talk about an example of a well designed library: Armadillo, a C++ library for matrix algebra. I use this library quite a bit, because it can be integrated into Rcpp. It makes for very fast memory efficient programs when it comes to matrix algebra, which I get to write a lot of due to networks. Armadillo, for the most part, follows all of the design principles above.

Armadillo functions/objects are C++ functions/objects. Nothing fancy, they behave exactly as you would expect, and most of the fancy template meta-programming is hidden from the user.
Armadillo is very internally consistent. All objects are accessed the same way, everything is very clearly labeled in the same way, That makes it quite easy to work with.
It also has very extensive documentation, as befitting a matrix algebra library.

Armadillo also makes user expectations very clear. Using it requires knowing how to program in C++, and how to do linear algebra, it simply provides implementations that are well optimized and simple to use. A great example of this is the sparse matrix class SpMat. Sparse matrices are representations of matrices with many zero entries, and a sparse matrix representation is considerably more memory efficient than a dense matrix. It achieves this by only storing the locations and values for non-zero entries, while a dense matrix representation stores all values. Note that these are both representations of a matrix, which means mathematically a matrix represented as dense or sparse is going to behave the same. But in terms of programming, a sparse matrix is much quicker to use in some applications. Armadillo implements this, and makes it simple for somebody like me, who knows linear algebra and C++, but not about optimizing matrix math, to use.

There are some aspects of Armadillo that are a bit odd from a design perspective. For example, why does the library have a function to fully model data using Gaussian Mixture Models? I am sure it works quite well, I just find it very odd to see in a matrix algebra library. Same goes for the fact that it has a k-means clustering function. Not necessarily a problem, but odd nonetheless.

If you are programming software in the social sciences, it is unlikely that you are going to be developing a library. Still, recognizing what makes a well designed library vs not helps me make decisions about what libraries I am going to use for a task. One of the most frustrating parts of developing software is not when your code throws a bug, but when you figure out that the library you thought could do X cannot actually do that thing.

Thanks for reading! During the holidays (and job app/interview season), I am planning on a 2 week schedule for these posts. In two weeks I will finish this series of posts with my final notes on designing modules, which to me represent the middle ground between applications and libraries. After that, I am going to talk about the why, how and what of code profiling.

Cheers,

Teague