Today I want to continue on discussing the three categories of software I outlined in my last post (found here), and focus on libraries. Libraries are pieces of software that serve to expand the preexisting capabilities of a given programming language. Note the focus on the programming language itself, not on any given task one might be writing a piece of software to do. This leads me to the number one principle with respect to designing libraries:

The user base of a library are software developers.

Going back to the idea that you as the developer of a piece of software need to understand what your users need and need not know, we can outline some expectations for users of any given library.

They will know how to program in the target language. This is the most basic requirement for using a library. Libraries are not meant to completely abstract away from a language, but rather to augment the capabilities of that language.
They will know how to use a library to program their primary task. A software developer is building a piece of software, and you can safely assume that they will know how to go about creating that piece of software.
They will not know how to implement specific pieces of their program. For example, a developer might be writing some code to perform linear algebra. They might know what they need to do (e.g. multiply matrix A by matrix B), but they don’t necessarily know how to implement matrix multiplication efficiently.

So how do these points translate into design principles? From my experience, there are three main qualities that I ascribe to a well designed library.

A well designed library closely follows the general principles of the programming language it augments. This is to not violate the expectations of the user. If I am working with a library, and it requires me start using constructors to instantiate objects when the language I am working in doesn’t typically need to do that, that might throw me off in my use of it.
Libraries should be internally self consistent. This is not a category of software meant to hold a bunch of miscellaneous functions, nor should there be inconsistencies in how to use the set of functions.
Libraries should have extensive documentation. Now granted, all software should be well documented, but to me, libraries need even more care with respect to documentation. When I am working on some software, I need to know how every function I am using performs. Undocumented side effects could cause serious issues down the line.

So let’s talk about an example of a well designed library: Armadillo, a C++ library for matrix algebra. I use this library quite a bit, because it can be integrated into Rcpp. It makes for very fast memory efficient programs when it comes to matrix algebra, which I get to write a lot of due to networks. Armadillo, for the most part, follows all of the design principles above.

Armadillo functions/objects are C++ functions/objects. Nothing fancy, they behave exactly as you would expect, and most of the fancy template meta-programming is hidden from the user.
Armadillo is very internally consistent. All objects are accessed the same way, everything is very clearly labeled in the same way, That makes it quite easy to work with.
It also has very extensive documentation, as befitting a matrix algebra library.

Armadillo also makes user expectations very clear. Using it requires knowing how to program in C++, and how to do linear algebra, it simply provides implementations that are well optimized and simple to use. A great example of this is the sparse matrix class SpMat. Sparse matrices are representations of matrices with many zero entries, and a sparse matrix representation is considerably more memory efficient than a dense matrix. It achieves this by only storing the locations and values for non-zero entries, while a dense matrix representation stores all values. Note that these are both representations of a matrix, which means mathematically a matrix represented as dense or sparse is going to behave the same. But in terms of programming, a sparse matrix is much quicker to use in some applications. Armadillo implements this, and makes it simple for somebody like me, who knows linear algebra and C++, but not about optimizing matrix math, to use.

There are some aspects of Armadillo that are a bit odd from a design perspective. For example, why does the library have a function to fully model data using Gaussian Mixture Models? I am sure it works quite well, I just find it very odd to see in a matrix algebra library. Same goes for the fact that it has a k-means clustering function. Not necessarily a problem, but odd nonetheless.

If you are programming software in the social sciences, it is unlikely that you are going to be developing a library. Still, recognizing what makes a well designed library vs not helps me make decisions about what libraries I am going to use for a task. One of the most frustrating parts of developing software is not when your code throws a bug, but when you figure out that the library you thought could do X cannot actually do that thing.

Thanks for reading! During the holidays (and job app/interview season), I am planning on a 2 week schedule for these posts. In two weeks I will finish this series of posts with my final notes on designing modules, which to me represent the middle ground between applications and libraries. After that, I am going to talk about the why, how and what of code profiling.

Cheers,

Teague

Today I want to drill down a bit into last week’s post (found here) about software design and talk about the first of three very general categories of software one might develop in the social sciences. These categories are a) my own taxonomy and b) only a very general taxonomy of course with many pieces of software fall into a mixture. All that to say, I have found these distinctions helpful in understanding how to structure software:

Applications - These pieces of software are designed to perform a primary task or set of tasks, while minimizing the amount of secondary knowledge (e.g. programming, data management) required of its users. This comes at the cost of being relatively inflexible.
Libraries - These pieces of software extend the capabilities of an existing programming language in some way. They require high secondary and primary knowledge of the user. This allows libraries to be very flexible in their use.
Modules - A middle ground between applications and libraries, this type of software simplifies a primary task, reduces secondary knowledge cost, and allows for a great deal of flexibility. Often, this type of software is made to work with several other modules as well.

With those brief descriptions, I want to start by discussing the general design of applications.

Applications minimize secondary knowledge cost.

The category that I refer to as “Applications” refers to any piece of software that aims to a) perform a complete task and b) minimize what additional knowledge users need to know. This is best illustrated with some examples of what I consider and don’t consider application.

Applications:

SPSS is an obvious choice for the category of application. It handles all aspects of running statistics, and it abstracts away from the language it was written in, a combination of Java and likely C.
The R package lavaan I also consider an application. It aims to handle all aspects of running SEM models, and it abstracts away from R considerably. Besides data input and some very basic function calls, most of the work in using lavaan is setting up the model syntax.

Not Applications:

I wouldn’t consider the R package ggplot2 to be an application. It performs a specific task yes, but it doesn’t abstract away from R sufficiently. Instead I would consider this a module.
The C++ library Armadillo (link) is definitely not an application, but rather I would consider this to be a library. It simply aims to extend the linear algebra capabilities of C++.

Designing the user interface for an application requires a great deal of careful consideration of what your user base is going to be, as you can have very little expectation as to the technical knowledge of a user. For example, SPSS is successful because it makes the act of running fairly complex statistical models a matter of navigating a set of graphical user interfaces (GUIs). This of course requires knowledge of the statistical models (at least in theory, if not in practice), but it doesn’t require any programming expertise. The only secondary knowledge it really requires is the ability to navigate GUIs.

Contrast this with base R’s statistical capabilities. I can easily run a regression in R in a single line of code that might take me several minutes of running through GUIs in SPSS. This however requires more knowledge. Not only do I need to know how to set up a regression, I need to understand R formulas, data input, and how to assign variable names to objects.

This “secondary knowledge cost” is what you are trying to minimize when you are writing a program. You can expect the user to know about what the program does (e.g. SPSS does statistics) and you are trying to minimize everything else the user needs to know (e.g. SPSS does not require object oriented programming).

Let me elaborate on this idea of secondary knowledge cost with a more personal example. I develop and maintain a Python package called clpipe (link). This “package“ is really a set of command line functions for quickly processing neuroimaging data on high performance clusters. For those of you who aren’t neuroimagers, neuroimaging data requires extensive processing before analysis, and this processing is quite mathematically complex. People spend entire academic careers on processing, and many software programs have been developed to perform this processing. There were several issues that I felt warranted a additional piece of software:

To get neuroimaging data from scanner to analysis requires the use of several programs at a minimum, which in turn requires the knowledge of how to use these programs (non-trivial, neuroimaging software is not typically designed well).
Quite a bit of time is spent on data management when you are working with neuroimaging data. Ideally, this can be done using some sort of scripting language, but that requires knowledge of the scripting language.
Processing neuroimaging data takes quite a bit of time. Processing subjects in parallel on high performance clusters makes this much quicker, but that requires knowledge of how to use an HPC.

So, in sum, to process neuroimaging data you not only need to know about the actual processing, you have to understand the idiosyncrasies of several neuroimaging programs, know how to do data management and ideally understand how to use a high performance cluster.

My program, clpipe, attempts to lessen this secondary knowledge cost by automating many of those steps. I have written very little code that actually processes the data, that is covered by a variety of programs that clpipe interfaces with (FMRIPREP, dcm2bids). Instead, clpipe manages data and submission of jobs to HPCs. All the secondary knowledge it requires is a working knowledge of navigating Linux filesystems (not unreasonable in neuroimaging) and a very basic understanding of how to format a couple of JSON files (configuration of the pipelines is done via JSON files). Of course, I made no attempt at lessening the primary knowledge cost. To use clpipe, you do need to know how to process neuroimaging data and all the myriad of choices you can make.

So, stepping back, what makes a good application? To me, a good application minimizes what additional things you need to know to do your primary task. The cost however, is that a good application is not flexible. It makes what it does easy, but you are SOL if you want to do something outside of that specific task (try tricking SPSS into doing something outside of what it is explicitly designed to do). So how does this translate into design principles? Here are my thoughts:

Identify what the primary task of an application is. Imagine your user as somebody who knows everything about that task (e.g. they are an expert in regression), but have absolutely no knowledge in anything else (e.g. they have never programmed in their lives).
Given that theoretical user and the restrictions on your implementation, minimize what additional things the user needs to know. If you are writing an R package to do one specific type of analysis, you are going to be hard pressed to make a GUI, but you can minimize what the user needs to know about R to use your package (again, lavaan is an excellent example of this).
Make sure not to violate the expected flow of a given task. An application is not providing the tools to do a task, it is doing the task for the user.
Be very wary of designing an application so that it is easiest for you to use. I see this quite a bit, and fall victim to it quite a bit as well. By definition, if you are developing a application, you have far more secondary knowledge than the target user.
In a related vein, don’t underestimate the lack of secondary knowledge that a user can have.
Finally, if you are developing an application, fully commit to that minimization of secondary knowledge. If you half-ass it, the resulting application will be much worse than if you decided to just develop a library or module. This is because you might be muddying user expectations of what they need to know. If you are honest with your users about expectations, that always makes for a better piece of software.

Designing an application as I defined it previously is quite a difficult task. When I started working on clpipe I was astonished how difficult it was getting to a point where users felt comfortable using it (they still don’t, but that is neither here nor there). This category is really the most design intensive of the three, because it is all about putting yourself in the place of a user who, by definition, doesn’t have the same level of knowledge you have. Think carefully, draft out your UX before you ever write a line of code, and have a number of beta testers!

Next week I will give some thoughts on how to think about developing libraries. These pieces of software are the opposite of a program, as they attempt to minimize primary knowledge cost at the price of requiring high secondary knowledge.

Cheers!

Teague

Designing libraries: Expanding pre-existing capabilities

A taxonomy of software: designing applications

Applications minimize secondary knowledge cost.

Applications:

Not Applications: