Everything in Its Place: A Guide to Data Audits

After a bit of a hiatus due to starting at University of Virginia, I’ve finally sat down and written the next post in my series on Data Management for Social Scientists. For those of you who missed the first two, you can check them out here and here! As always, this guide is based off of my own experiences, and there are many ways to successfully and efficiently document a dataset. If I’ve missed anything, please feel free to let me know!


So, you have joined a new lab, started a new lab, or received a dataset from a collaborator, and you are looking forward to digging in. You quickly realize that because that new lab or that new dataset doesn’t look anything like what you are used to, you need to take time to better understand the data structure. This sounds like a good time to perform a Data Audit. Data auditing is a practice often used in corporate settings to evaluate the location, quality and consistency of their databases, with particular eye to how the data is being used. In an academic research setting, the overall goals of a data audit remain the same:

  1. Determine where the data are. In many cases, this is a simple question to answer. If a collaborator sends you a single CSV file with their data, you probably have a good idea where that data is, but only if the data are complete, which brings us to our next goal.

  2. Determine if the data are complete. Studies, particularly in the social or biomedical sciences and particularly when dealing with human subjects, have extensive study design documentation (this is almost always a requirement for getting ethics approval for human subjects studies.) This documentation tells you, the auditor, what should be in the data that you were directed to.

  3. Determine if the data can be used for its specified purpose. In most studies, data will be analyzed, and this typically requires it to be formatted in a particular way. If, for example, the study collected free form responses in the form of a collection of .txt documents, this is less amenable to quantitative analyses than if those freeform responses were collected into a single tabular data file. 

  4. Determine if the data follows good data management practices. It is one thing to identify where the data are, and if the data is complete. In some cases, that portion of the data audit can be scripted. It is another thing entirely to determine how data either follows good data management practices, or which data management principles the data structure violates.

The end goal of any audit is not to restructure the data set. I want to repeat that, you, as the auditor, should not be changing how the data is managed. This even applies to heads of labs that want to perform their own data audit. If you change a data structure without the full buy in with the rest of the team, you will cause problems and might even make the data structure worse. Refactoring data is a distinct process, albeit one that is informed by the results of a data audit. The end goal of a data audit is the data audit report. 


The Data Audit Report

A data audit report is a human readable document that describes the results of the data audit, identifies issues, and suggests a set of solutions. This is not scholarly work, and should be written as straight forwardly as possible. This is not a trivial requirement, as many of you who have been asked, or have planned a data audit, likely have more computer science/data management experience, and if you are not careful, might use more technical terminology then is useful. Remember, the goal of a data audit is not to create a document for you to reference (though this is a major advantage), it is to create a document that anybody can use to understand the dataset in question. Take for example the following scenario:

Scenario:

In performing a data audit of an longitudinal study, you find that the data from multiple timepoints are stored in wide format .SAV files. This makes them difficult to access using open source data analysis tools, and the wide format of the data makes it difficult to perform longitudinal modeling. You want to propose converting the master copy of the dataset to long format, writing a script that when run will produce a wide format datafile, and changing the file type to a common delimited file type, like a CSV. In your report you write:

Solution:

Convert wide to long, create reverse conversion script in R, change file format to CSV.

This is informative language, and if you handed me a report with that as a solution, I would be able to understand that. But that requires knowledge of wide/long formats and why one would use them, why would you create a reverse conversion script rather than simply create an additional copy of the data set, and why CSV is better than SAV as a file format. The solution to these issues to divide the description of a solution from the implementation of said solution, and to add rationale to the solution:

Solution:

First, the dataset needs to be converted from wide format (rows are subjects, columns are variable/timepoint combinations) to long format (rows are timepoints, variables that differ over time are specified by a single value column, and a single variable name column), which would improve the ability of analysts to run longitudinal models on the dataset. However, as wide format is useful in computing summary statistics, a script needs to be created that will take the long format dataset, and convert it over to a wide format dataset whenever necessary. The long format dataset acts as the immutable raw data, and the wide format dataset can be reconstructed whenever necessary. Finally, the long raw datafile should be stored in a delimited text format, such as a .csv and accompanied by a JSON codebook.

Implementation Details:

  • Conversion from wide to long in R (reshape/melt+cast)

  • Conversion script written as “sourceable” in R, hard coded to take long data

  • Conversion to CSV one-time non-automated via R and the foreign package

  • Codebook generated using R, filled in manually.

As you can see, while there is more writing, there are far more details, and the proposed solution can be evaluated by a non-technical researcher. The implementation details act as a guide for a technical researcher, with the aim of these details being to provide enough information that any reasonably experienced data-manager could perform them.


How to Write a Data Audit Report

I have a certain structure I like to use when I perform a data audit. Broadly, it is broken into three main sections:

Summary of the Project

This is a high level summary of the project, and is mainly included so that future readers can understand the context of the dataset itself. If, for example, the dataset in question is from a large longitudinal neuroimaging study, this summary would describe what that study was about, and also describe the relevant aspects of the study design. For example, if this neuroimaging dataset contained 4 tasks, the relevant information is what those tasks are called, how many individual runs of the tasks are there in a given dataset, and any aspect of the task that might lead to uncommon datatypes (i.e. was physiology collected during a given task?). It would not be useful to include scientific information about the study design in this summary. From a data management perspective, it makes no difference if one task is an inhibitory control task, and the other is a working memory task. That being said, this summary should point out where the actual study design documents are, so that the scientific information is accessible.

Data Locations

In the report, this section provides a high level overview of where all the data is. A machine readable file, preferably a spreadsheet, needs to be generated that contains a comprehensive list of files and a summary of their content, but this does not need to be contained in the written report itself.

I like to break this section out into meaningful divisions. For example, if you were auditing a study that had both baseline self report measures and ecological momentary assessment data, I would divide up my data locations into those two categories. Again, I wouldn’t structure this section on the basis of scientific similarity, e.g. Anxiety Measures (self report, EMA). This is because presumably, the divisions you come up with are similar in terms of their data format, which is the relevant aspect for data management.

Data Completeness

This is a checklist of every aspect of the data that you expected to be present. There are two ways I like to identify what data are expected to be present. First, I look at the design documents, usually an IRB protocol or a grant application. These list all types of data collected, but don’t necessarily describe the data format. Next, I talk to the PIs, lab managers and the RAs that run the study data collection itself. This is always an enlightening exercise, as there is usually a disconnect between what the PIs think has been collected (with respect to format), and what is actually collected and stored.  If an aspect of the data is not present at all, then that needs to be noted. If data are missing for a subset of subjects, then that needs to be noted as well (this is not referring to missingness, rather, this refers to how the dataset itself is stored). 

Issues and Solutions

This is a list of issues that arose during the audit, and proposed solutions. This should be as specific as possible, with screenshots and references as needed. It should be immediately apparent upon reading an issue a) what the auditor thinks the issue is and b) that the evidence overwhelmingly points to that issue being a real concern.

I break issues down into red flags and yellow flags. Red flag issues are serious data integrity problems: i.e. a survey is not where it is expected to be, some aspect of chain of data custody has been broken, neuroimaging files are in an unusable format, etc., etc. There is no question that these problems need to be fixed right away, or at the very least brought to somebody’s attention. Unfortunately, these are the issues that usually are the hardest to solve. For example, in a recent dataset I was working on, due to a series of drive failures on a workstation used to process neuroimaging data, all the neuroimaging data from that dataset was wiped clean. Fortunately we had backups, but we only backed up the raw data and not the processed data that had taken a previous postdoc several months to process. We only lost time, rather than losing data, but it was still problematic. As nobody had been looking at this dataset since the previous postdoc left, I was the one to detect this problem during my audit.

Yellow flag issues are a bit more of a touchy subject. These issues are ones that you have identified as sub-optimal. The problem with raising these issues though, is that they are often due to the well meaning practices of the people who collected the data, and have worked with the data for years. You are effectively telling the PI, lab manager, and RAs: “In my opinion, you did this wrong, here is a better way of doing it.” Well, quite frankly, most folks in academia don’t appreciate that sort of thing, and so it pays to be, for lack of a better work, politick, when raising these yellow flag issues. Here’s an example I’ve encountered a number of times: 

SPSS is a commonly used statistical software. I won’t fault it, it does what it says on the tin, but I personally cannot stand using it. The reason I cannot stand using it is that its native file storage format, the .SAV file, has a “proprietary” structure. These files can be opened in SPSS, but opening them in another software like R takes additional packages. More to the point, I cannot open a .SAV file in a text editor. I like files that can be opened in a text editor, if at all possible. It makes it so much quicker to look for problems, or to get an understanding of how a dataset is structured. I also make an effort to only use open source tools, so I don’t actually have a copy of SPSS installed anywhere. 

Now anybody working in psychological research will have encountered these files. For me, storing data in a .SAV (or a .mat, or any other proprietary format) is a big yellow flag issue. But I guarantee you that telling your PI they need to stop using SPSS and switch over to a simple file structure like .csv, will not go over as well as you might think. Yes, if they made the switch YOU would work faster, because presumably you are interested in automating all of your data management processes. But if everybody else is working with SPSS, then they are just not going to want to make that switch suddenly. So instead of making that very harsh suggestion, I would approach it like so:

  1. Note the concern, and describe it: .SAV files are difficult to work with using most open source scripting languages.

  2. Lay out the long term solution: In the long term, .SAV files should be converted to .csv files, and item metadata stored as .json codebooks. 

  3. Suggest a shorter term improvement: In the meantime, all .SAV files should have their names standardized (i.e. behav_ses-01_parent.sav. behav_ses-01_child.sav), and all variable names should have a standardized structure.

  4. Note the advantages of this shorter term fix: Standardization would decrease analysis time and provide guarantees with respect to linking variables (variables that link cases across multiple datasets). 

Foremost in your mind should be: How would this change in data structure improves the experience of everybody who will work with this data in the future, not just me. If you are performing a data audit, you are likely the most experienced data manager in the room, so these issues are things you know how to deal with on the fly. Your job is to smooth these issues over, so that less experienced analysts don’t get caught up on them.

Finally, I personally like to highlight things I liked about a dataset, green flags. I believe that you can’t really learn what is good practice if nobody points out what was done well, so I try to point out cases where I don’t see an issue in how the data is stored. Strictly speaking, this is not a requirement, but I’ve found it to be helpful in my own learning.


Closing Thoughts

So let’s return to the question: why perform a data audit? A good data audit produces a document that can be used to a) reference the dataset as it currently exists and b) guide a data refactor. The former is useful for anybody working with the dataset currently, the latter useful to anybody who might take on the task of actually improving how the data is stored. A data audit, in my view, is a useful service to your colleagues in the lab or your collaborators. A well documented dataset is easier to work with than a poorly documented one, and a well structured and documented dataset is even better.

Eight Principles of Good Data Management

This post is the second in a series of posts about data management practices (see the introduction here). Before I get into talking about my principles of good data management, I want say I found out after my previous post that librarians, and library science as a field, have been thinking and writing about data management for years now. Many university libraries have programs/seminars where the librarians will consult with you about data management. This is a wonderful resource, and if my own experience is common, very underutilized.  So, if you are at a university, check out your library system! 


In today’s post, I walk through 8 principles of good data management. These are wholly informed by my own experiences with data management and analysis, and I wrote these with the following in mind: When it comes to data management, you are your own worst enemy. I’ve lost count of the times I start complaining about some aspect of a dataset,  I check who made the relevant change, and it turns out it was me a week ago. So these are principles that I try to follow (and oftentimes don’t quite) to protect my work against mistakes, to save time, and to make it easier to collaborate with others. This is, of course, not an exhaustive list, nor likely the absolute best way of managing data, but I’ve found them to be helpful personally.

Document Everything

This principle should be fairly self explanatory. There should be well formatted documentation for every bit of data in a dataset. Seems simple enough, but in practice documentation tends to fall by the wayside. After all, you know what type of data you collected right? But good documentation makes everyone's work so much simpler, so what makes for a good set of documentation? For me, good documentation follows these guidelines:

  • It references files/variable names. A codebook is documentation, but we can think of any sort of data documentation as codebook-esque. It is not useful to simply say: “The UPPS was collected.” I need to know that the UPPS is in the behav_baseline.csv, and is labeled along the lines of “upps_i1_ss” (UPPS, item 1, sensation seeking subscale).

  • It’s complete. First, there shouldn’t be any bit of data that is not described by some sort of document. If you have, for example, a screenshot of a behavioral test result (which is a real thing, the test in question was proprietary, and the only way of storing a record of the score the participant got was to take a screen shot of the reporting page!), as your raw data, then these files need to be described, even though all the test results have presumably been transcribed into a machine readable file. 

This also holds for item text (and response option text), or descriptions of biological assays, or even neuroimaging acquisition details. The codebook should contain all the information about how the data was collected in a easily human-readable format. The time that an analyst spends hunting through Qualtrics surveys, or reading through scanner protocol notes is less time they spend actually analyzing your data.

  • It’s cross-referenced: For the love of science, please put in the correct references when you describe a measure in your codebook! This makes it much easier to prepare an analysis for publication. Additionally, if certain measures are repeated, or have different varieties (child, parent, teacher version), make sure that these are all linked together. The goal here is to make it easy for an analyst to understand the structure of the data with respect to the analysis they are trying to perform, and to make it easier for them to prepare that analysis for eventual publication.

  • It’s thorough: This is not the same as being complete. Thoroughness is more about how aspects of the data are described. Technically the following is a complete description:

    • UPPS Child Scale (Zapolski et al., 2011)

But it doesn’t tell you anything about that measure. A more thorough description would be: 

  • UPPS-P Child Scale (Zapolski et al., 2011): Self report impulsivity scale with 4 subscales: Negative urgency (8 items), perseverance (8 items), premediation (8 items), sensation seeking (8 items). Items are a 1-4 likert scale. Child version is a modified version of the adult UPPS (Whiteside & Lynam, 2001)

This description tells me what the scale is, what it has in it, and what to expect about the items themselves. It’s also cross referenced, with historical information. It doesn’t go into the meaning of each subscale, that wouldn’t be within scope of a codebook, but it provides meaningful information for any analyst.

  • It’s well indexed: Give us a table of contents at the very least. I don’t want to have to flip through the codebook to find exactly what I need. The ability to look for, say, baseline child self report measures, and see that they start on page 20, just makes my job much easier. 

  • It describes any deviations from the expected: Say you modified a scale from a likert of 1-5 to 1-7. That needs to be noted in the documentation, else it can cause big issues down the line. On the other hand, if you used a scale as published, you just need to provide the minimal (but thorough!) description. 

When writing a codebook, one needs to remember, you are not writing this for yourself. You are writing it for somebody who has never seen this data before (which also applies to you, 2 weeks after you last looked at the data). What do they need to know?


Chain of Processing

Very few data management issues are worse than not knowing what somebody did to a piece of data. It literally makes it unusable. If I don’t know how a fMRI image is processed, or how a scale was standardized, I cannot use it in an analysis. On the other hand, if I have documentation as to what happened, who did it, and why, I can likely recover the raw form, or, at the very least, evaluate what was done. 

This principle is obviously inspired by the idea of a “chain of custody” in criminal investigations. My (admittedly lay-person) understanding of this principle is that for evidence to be considered in a trial, there needs to be a clear record of what was done to it and by who, from the moment the piece of evidence was collected to the moment the trial concludes. This protects everybody involved, from the police (from accusations of mishandling) to the accused (from actual mishandling). Similarly, this idea applied to data management protects both the analyst and the analysis at hand.

Describing the chain of processing can be done in multiple ways. I am in favor of a combined scripting/chain of processing approach, where I write processing scripts that take raw data, process it, and return either data ready to be analyzed, or the results of an analysis themselves. In this case, the script itself shows the chain of processing, and anybody who looks at it will be able to understand what was done in a given case (if I’ve written my code to be readable, which is always a dicey proposition). Another way is the use of changelogs.  These are text (or some equivalent machine/human readable file like a JSON) files that analysts can use to note when they make changes to any aspect of the data. Sometimes changes need to be done by hand, like when the data requires cleaning by hand (i.e. psychophysiology data), and the changelog would need to be manually updated. Other times these changelogs can be created by the scripts used to process the data. 

This is such an important principle to follow that I will say, I would prefer a badly written changelog or a hard to read script to no chain of processing at all.  


Immutable Raw / Deletable Derivatives

Imagine the case where a scale has been standardized. This is a fairly common procedure, subtracting the mean and dividing by the variance. It makes variables that are on different scales comparable. Now imagine that you only have the standardized scale, and no longer have the raw data. This is a huge issue. Because you do not know what values were used to standardize the scale, you wouldn’t be able to add any more observations. 

Okay, so that might be a trivial example. Let me mention an example that I’ve encountered many times. The processed neuroimaging files are available, but the raw images are not. Here, this is usually not due to the raw data being deleted, though that has occurred in my experience. 

If you don’t have the truly raw data, you cannot recover from any data management mistakes. This means that your raw data is precious. You might never analyze the truly raw data, I know I don’t, but it needs to be kept safe from deletion or modification. Ideally, your dataset can be divided into two broad sections. The first is the rawest data possible, images right off the scanner, a .csv downloaded straight from Qualtrics. Label it well, document it, and then set it to read only and never touch it again. The second half is your data derivatives. These are any bits of data that have undergone manipulation. If you have pulled out scales from that raw dataset, that is a derivative. If you have cleaned your raw physio data, derivative. Because you are presumably following the third principle, Chain of Processing, you know precisely how each derivative bit of data was created. As such, your derivatives are safely deletable. Now, it might not be wise to delete some derivatives, for example, if your physiological data was cleaned by hand (as much of it is), even if you know exactly how it was cleaned, given the time and effort you likely shouldn’t delete those derivative files. But if push came to shove, and those files were deleted, you would be able to recover any work you had done.

I delete derivatives all the time, because my workflow involves writing data processing scripts. In fact, I often don’t even produce actual derivative files, instead keeping any processing in memory so that when I go to run an analysis, I reprocess the data each time. Whatever way you do it, make sure your raw data is immutable, read only and backed up in multiple locations, and that you have a chain of processing to tell you how each derivative file was created. If both of those are in place, you can rest much easier when thinking about how your data is stored.


Automate or Validate

I’ve mentioned scripts several times so far, so it is not a surprise that scripting is one of my principles of good data management. This principle says, if you can automate a part of your processing, do that. However, oftentimes one can’t automate fully. In those cases, write scripts to validate the results of the non-automated processing. By validation I don’t mean checks to ensure the processing was done correctly, I mean checks to make sure that your files are stored correctly, and you don’t have any file format differences.

Why write scripts/validators? Because you are a weak, tired human being (presumably?). You make mistakes, you make typos, and you forget. Sure, you are clever, and can think of unique solutions to various data issues, but a solution is only useful when applied consistently. A computer on the other hand does only exactly what it is told to do, nothing more and nothing less. Take advantage of that quality! A script written that performs your data processing will process the same way each time, and acts as a record of what was done. But what about mistakes? Yes, you will make mistakes in your code, mistakes you don’t catch until later. I develop a processing pipeline for neuroimaging data (link!). In an early development build, I didn’t add a “-” in front of a single variable in a single equation. This led to an inverting of the frequency domain filter I was implementing, so instead of removing frequencies outside of .001-.1 mHZ, it removed frequencies within .001 -.1 mHZ. Fortunately, when I was testing this function this was simple to detect, and a couple of hours of tearing my hair out looking at my code for any errors led me to find the issue and correct it.

Contrast this with an experience a colleague had with their data. They were doing a fairly simple linear regression, and needed to merge data from two spreadsheets. Each spreadsheet looked identically ordered with respect to subject ID, so they copy and pasted the columns from one to the target spreadsheet. I’ve done this, we all have, I don’t fault them for it. We really shouldn’t be doing by-hand merges though. As my colleague realized the night before they were going to submit the manuscript, that in actuality, the first dataset was only ordered the same way for the first 100 or so observations. Then there was subject ID that was not in the second dataset. So, after the copy and paste, the first 100 observations were accurately matched, but after the first 100, all the observations were offset by one row. Visually, this wasn’t apparent until you scrolled to the bottom of a very long dataset, as there were no missing rows (which visually indicates these mismatch issues quite quickly). Statistically, this effectively equivalent to randomly reordering your variables column by column. A Very Bad Thing™. No wonder they had some very counterintuitive results! I am glad they found the issue before submitting that manuscript for publication, because if it was published with that mistake, it would need to have been retracted! 

So what happened? Well, my colleague did a very innocuous, commonly done bit of data processing, in a way they had done 100 times before. Just that this time, it led, really through no fault of their own other than the same momentary lapse of attention that afflicts most of us from time to time, to a retraction worthy mistake. A retraction worthy mistake that was nearly undetectable, and was only found because my colleague happened to scroll to the bottom of the spreadsheet while looking at some unrelated aspect of the data. 

Would scripting the data merge avoid this? Categorically yes. There are ways of messing up data merges when scripting, many ways, but in my experience those become apparent very quickly. This particular issue, there was a single additional observation in the first dataset, would have been completely avoided by scripting the merge. The scripted solution would also be more robust to other potential issue, for example, what if the ordering of the observations was completely different? Well, if you script the merge, you don’t even need to worry about that, the software will take care of it.

Validators are quite useful too, though I will admit I haven’t used/encountered many of them. The one I do use is the Brain Imaging Data Standard (BIDS) validator. BIDS is a standardized way of storing neuroimaging data (and an inspiration for this series of blog posts!), and the validator simply looks at the data structure to see if everything is where it needs to be. It flags files that have typos, and identifies where metadata needs to be added. Another validator I’ve written checks to make sure file names are structured the same for datasets of psychophysiological data, which requires cleaning by hand. This leads to typos in file names, because RA’s need to click Save As, and type out the name of the file. So, I run this validator before I do any additional data kludging, just so I know my scripts are going to get all the data I was sent.  Which is a great segue into my next principle: Guarantees.  


Guarantees

What are guarantees in the context of data management? Guarantees are simple: If I am looking at a dataset, I should be guaranteed that, if there is a certain file naming structure, or file format, etc, all relevant files follow that rule. Not most, not almost all, not all almost surely, but absolutely all relevant files. One way of guaranteeing your Guarantees is to use scripts to process your data. Guarantees are all about consistency, and nothing is more consistent than a computer running a data processing script. Validators are a way of verifying that your guarantee holds. 

But why bother? Why does it matter if the variable names are not consistent between scales? Or that mid study, the behavioral task output filename convention changed? Well, if I was doing analyses with a calculator (like the stats nerd I am), I would be able to adjust for small deviations. But I’m not going to do that (still a nerd), I write analysis scripts. And again, computers only ever do precisely what they are told to do. Guarantees are a way of simplifying writing new analysis scripts, or even new processing scripts. Here is an example: Consider two variable names: “UPPS_i1_ss_p” and “Conners_p_1”. I do quite a bit of text processing to create metadata when running analyses, and in this case, I might want to pull out item level means for every item in the UPPS and the Conners. But if I do a string split on “_”, and look for the item number in the second slot, well, in the UPPS, the second slot is “i1”, but in the Conners’, the second slot is “p.” I would have to make a modified version of my UPPS processing code to fit with the Conners. 

But what if my variables were guaranteed to have the following formatting?

“scale_i#_subscale_source” (with an indicator if there is no subscale). 

Then I can write a single script that pulls the necessary information from each variable name, and apply it to every scale in my dataset. It makes programming analyses much simpler, and reduces the need to check the codebook for every new scale.   

The main benefit of guarantees is that it reduces the cognitive load on the analyst. If I know that the file names have a standard structure, and that they were script generated, I can immediately relax and not be on the lookout for any deviations in file naming that might mess up my processing stream. Because of this, I can better focus on performing whatever processing correctly. In my experience, when one has to adapt code to little idiosyncrasies in the data structure, these adaptations are where the mistakes creep in. I’ve never written cleaner code than when I work with guaranteed data.


Open (Lab) Access

Science is fundamentally collaborative. Even if you are the main statistical analyst on a project, you will still need to access data that was generated or collected by other lab members. This brings up an annoying issue, that of data permissioning. There are two ways I’ve seen this issue come up. 

The first is a literal file permission problem. I work on a variety of neuroimaging projects, and, for a variety of historical reasons, neuroimaging data is usually processed on Linux workstations. One aspect of Linux, and particularly Linux on shared workstations, is that each file has a set of access permissions. In Linux, these permissions are the following: 1) can you, the file creator, read/write/execute the file? 2) can members of your user group read/write/execute the file? And 3) can anybody else read/write/execute the file? If you happen to be using a personal Linux machine (more power to you, I guess?), this is not an issue, as you can be fairly certain that the only person accessing files on your computer is you. But on a workstation this can become an issue, because if the permissions aren’t set correctly, other members of your lab won’t be able to access any files you have created. In neuroimaging this quickly becomes problematic, as each step in preprocessing creates temporary files. About 70% of issues I have encountered using various pipelines I have developed have ultimately come down to permission issues. 

Of course, fixing actual file permissions is a fairly simple thing to do. But, there is a more problematic “permissions” issue that often occurs in larger labs. I like to refer to this as a lab balkanization of data. This is when, due to internal lab politics, different bits of data are only accessible to certain investigators. One example of this, that I have personally experienced, is where I had access to the neuroimaging data from a large longitudinal study, but the self report data was not only not accessible by me, it was held by an investigator at a university half way across the country. To get access to this data, we had to request specific cases/variables, and this investigator would then send us just those records. 

Now before I start criticizing this practice, I will note that this sort of data access issue can happen for very good reasons (as was the case with the large longitudinal study). Oftentimes, if there is very sensitive information (think substance use self report in children), that data has additional security requirements. A good example of this is the ADDHealth national survey. This is a very large national survey which collected health behavior data on high schoolers, and one unique aspect of it is that there is social network information for many of the participants. Additionally, ADDHealth collected romantic partner data for participants, including if two participants were romantic partners. This, combined with the fact that one could, theoretically, easily identify specific participants based on the extensive self report data (deductive disclosure), means that this data needs to be specially protected. Well, to access the romantic partner data, an investigator needs to dedicate an entire room, that only they can access (not anybody in their lab, just the actual investigator), with a computer that has been chained to the room and has no internet access. There are a number of other requirements, but the one that made me laugh a bit is that if you were to store the data on a solid state drive (SSD), you are required to physically destroy the drive at the end of analysis. So there are a number of cases where restricting access to sensitive data is quite reasonable. 

That being said, I believe that a PI should make every effort to ensure equal access to all data for all analysts in their lab. This smooths the working process, and reduces mistakes due to miscommunication. When I am looking for data, I usually know exactly what I need, but I might not know exactly what form it takes. If I have access to all the data for a given study, I can hunt for what I need. If I have to ask another person to send me the data, I usually will have to go back and forth a couple times to get exactly what I need. 

So what are the reasons that this balkanization happens? Usually, there is no reason. Somebody just ran a processing script on their own desktop, and never put the file in a shared drive. Occasionally, balkanization can be subtly encouraged by competitive lab culture. Grad students might keep “their” data close to the chest because they worry that somebody might scoop their results. I’ll be blunt: Scooping within a lab should be impossible. If two lab members get into this sort of conflict, the PI is either ill-informed about what is happening in the lab, as they didn’t nip it in the bud, or come down hard on whoever was trying to scoop, or malevolent, in that they encouraged this behavior in an extremely misguided belief that competition, red in tooth and claw, makes for better scientists. It categorically does not. This balkanization can also occur at the same level of investigator, for example when two labs that have collaborated on a larger study divide the data between themselves. Personally, I find this to be ridiculous, as again, any concerns about who gets to publish what paper should be dealt with by dialogue, not by restricting access. But, admittedly, when this sort of divide happens, it is rarely resolved in the fashion I prefer (data pooled and everybody has equal access), simply due to investigator inertia/ego. 

To avoid issues with data access, data storage plans should be drawn up before the first subject is collected. These plans should indicate if there are any aspects of the data that are deemed sensitive that would require secure storage. Besides that, these plans should work to provide as equal and as full access as possible to any lab member who would, reasonably, be performing analyses. Who gets to write/publish a certain project should be negotiated openly and clearly. If this kind of transparency is encouraged, then questions about who has access to data quickly becomes irrelevant.  


Redundant Metadata

Metadata refers to data about data. A good example of this are scanner parameters for neuroimaging data. The data itself is the scan, while the metadata are the acquisition settings for that scan. In neuroimaging these are vital to know, as they tell you, among many other things, how fast the data was collected, what direction the scan was in, what the actual dimensions of the image are, etc, etc… In a more traditional self report survey, metadata could be the actual text of each item, along with what the text was for the response options. 

For multiple file datasets, such as ones where there are separate data files for each subject, a piece of metadata would be which subject is associated with each file. Metadata is obviously important, but oftentimes it is only stored in a single place at a time. Take this simple example, considering two directory structures:

/data/sub-01/behavioral/baseline.csv

/data/sub-01/behavioral/sub-01_behavioral_baseline.csv

Both directory/file combinations contain the same information: the file is the baseline behavioral data for subject 01. But the second combination has redundant information. Not only does the directory structure tell you that this is the behavioral data for subject 01, the file name itself reiterates this. Why is this useful? Well, say you want to analyze all the baseline behavioral data. You extract all the baseline data into a new directory. In the first case:

/newdir/baseline.csv

/newdir/baseline(1).csv

/newdir/baseline(2).csv

In the second:

/newdir/sub-01_behavioral_baseline.csv

/newdir/sub-02_behavioral_baseline.csv

/newdir/sub-03_behavioral_baseline.csv

In the first case, you’ve lost all identifying information, while in the second case, the important metadata is carried along in the file name. I know which case I would prefer to work with! While this scenario is a bit of a straw man, it does happen. I’ve seen it in neuroimaging datasets, where the subject is indicated only at the directory level, ala /sub-01/anat/mprage.nii.gz. In fact, this is/was a fairly common data structure, as certain neuroimaging software packages effectively incentived it.  

Metadata is tricky, because there is usually so much of it and you usually don’t know every piece you might need. So, store it all!


Standardization

So, you’ve decided to implement all the principles so far, and you’ve convinced your colleagues to implement good data management practices as well. Wonderful! Your analysis pipelines are humming along, your graduate students and postdocs have slightly less haunted looks in their eyes, and you feel that warm feeling that only well organized data can give you (no? Only me then?). 

With your new found confidence in your data, you decide to strike up a collaboration with a colleague. That colleague has also jumped onto the data management train, so you are confident that when they send you data it will be well organized and easy to use. 

So they send you their data! And it is beautifully organized! Beautifully organized in a completely different way than your data!

Well, now all of your data processing/analysis scripts will need to be rewritten. This might be much easier than normal, because of how your colleague’s data is organized, but it still takes time. So, how can we streamline this? 

Now we come to the final principle, to which all other data management principles lead: Standardization. On the face of it, this principle is fairly simple. Labs working with similar study designs/datasets should use a single standard data management setups, rather than many data management setups, no matter how well managed those setups might be. However, this is much easier said than done.

Different labs/projects/PIs both have different data requirements and likely use differing software tools. This leads to a proliferation of data management choices that make a single standardized data management schema that works for every possible case nearly impossible. The closest thing I have seen to complete standardization is the (previously mentioned) BIDS format, which is only possible because there are a limited number of data sources for neuroimaging data and a great deal of effort has gone into standardizing the low level file formats used in neuroimaging (e.g. Nifti files as a standard MRI storage format.) 

If universal data management standardization is impossible, what can be standardized? I think of datasets as puzzles made up of different modalities of data. Each modality represents a type of data that shares most characteristics. For example, I consider questionnaire data to be a single modality, as is fMRI data. Conceivably, a standard format for any questionnaire data could be developed (I would suggest pairing CSV files with a metadata JSON, but there are many other ways as well). I think standardization of different modalities of data is the right way of approaching this problem.

Even with restricting the scope to specific modalities of data, true standardization is difficult. So what can the individual researcher do? Well, first, and most importantly, researchers need to be talking about data management with colleagues and students. There is a tendency for PIs to abstract away from the nitty gritty of data management and data analysis, and while I understand the reason for that (grants don’t write themselves!), this inattention is one of the leading drivers of data rot. By working in data management into scientific discussions and project planning, I find it grounds the conversation and focuses it on the question of what can we do with what we have. From there, researchers should explicitly share their data management scheme with colleagues. If you’ve saved time by implementing good data management, then likely your colleagues would benefit from adopting what you’ve done. While this can be a bit of work, I’ve found that by emphasizing the timesaving aspects of good data management, otherwise very busy PIs become much more amenable to changing around the structure of their data storage.

Ideally, as data management is discussed and setups are shared, this would naturally lead to a type of standardization. Consider a fairly simple type of standardization, a well structured variable naming scheme: scale_i#_subscale_source. Changing variable names in a dataset tends to be very simple, and once researchers see how useful a standardized naming scheme can be, it can be quickly adopted. The key here is for researchers who are trying to bring good data management practices to the table to keep up the pressure. Researchers/scientists/academics, myself included, tend to have quite a bit of inertia with how we like to do things. We get into ruts, where the way we know how to do a task is so familiar and easy for us, we continue out of convenience. But the siren call of “it will save you time” is strong, and I’ve gotten the best results when pitching standardization by emphasizing the advantages, over pointing out what is going wrong.


Summary 

The above data management principles were derived from my own experiences working with all kinds of data and are not meant to be exhaustive or overly rigid. My goals when thinking about data management is: how do I protect my work from my greatest enemy, me from a week ago, and how do I save time and cognitive energy. I don’t like wasting time, and I don’t like to repeat work. That being said, all of these principles are well and good when you are setting up a new study, but what if you are currently working with a dataset? Or you are a new graduate student or postdoc, and you are being handed a dataset? You might want to start reorganizing the data immediately into a better management structure. I would urge caution though. Not only do PIs tend to not like their datasets being unilaterally reorganized by a new member of the lab (a scenario that I know nothing about, nothing at all), you also likely don’t know enough about the dataset in question to even begin to reorganize it. In order to efficiently and correctly reorganize the data, you need to understand what you have. You need to perform a data audit, which is systematic investigation of an existing dataset for the purposes of identifying:

  1. What should be in it. 

  2. What is actually in it. 

  3. How the data is currently organized.

  4. How it could be organized better.

In my next data management post, I’ll be walking through how I perform a data audit, and what I think needs to be in one. Thanks for reading!

Data Management for Researchers: Three Tales

This is part one of a series of blog posts about best practices in data management for academic research situations. Many of the issues/principles I talk about here are less applicable to large scale tech/industry data analysis pipelines, as data in those contexts tend to be managed by dedicated database engineers/managers, and the data storage setup tends to be fairly different than in an academic setting. I also don’t touch much on sensitive data issues, where data security is paramount. It goes without saying that if a good data management practices conflict with our ethical obligations to protect sensitive data, ethics should win every time.


Good Data Management is Good Research Practice 

Data are everywhere in research. If you are performing any sort of study, in any field of science, you will be generating data files. These files will then be used in your analyses and in your grant applications. Given how important the actual data files are, it is a shame that we as scientists don’t get more training in how to manage our data. While we often have excellent data analysis training, we usually have no training at all in how to organize data, how to build processing streams for our data, and how to curate and document our data so that future researchers can use it. 

However, researchers are not rewarded for being excellent at managing data. Reviewers are never going to say how beautifully a dataset is documented and you will never get back a summary statement from the NIH commenting on the choice of comma delimited vs tab delimited files. Quite frankly, if you do manage your data well, your reward will be the lack of comments. You will know you did well if you can send a single link to a new collaborator to your data set, and they respond with, “Everything looks like it is here, and the documentation answered all of my questions.” So, given that lack of extrinsic motivation, why should you take the time and effort to learn and practice good data management? Let me illustrate a couple of examples of why good data management matters. All of the examples below are issues that arose in real life projects (albeit with details changed both for anonymity and to improve the exposition), and each one of them could have been prevented with better data management.


Reverse coded, or was it?

I once had the pleasure of working on a large scale study that involved subjects visiting the lab to take a 2 hour battery of measures. This was a massive collection effort, which fortunately, as is so often the case with statisticians, I got to avoid, and come in when the data was already collected. This lab operated primarily in SPSS, which, for those not familiar with this software, is a very common statistical analysis software used throughout the social sciences. For many, many years, SPSS was the primary software that everybody in psychology was trained on, and to its credit, is quite flexible, has many features, and easy to use. The reason it is so easy to use however, is that it is a GUI based system, where users can specify statistical analyses through a series of dialog boxes. Layered under this is a robust syntax system that users can access, however this syntax is not a fully featured scripting language like R, and is, to put it mildly, difficult to understand.

In this particular instance, I was handed the full dataset and went about my merry way doing some scale validation. But then I ran into an issue. A set of items on one particular measure were not correlating in the expected direction with the rest of the items. In this particular measure, these items were part of a subscale that was typically reverse coded. The issue was, I couldn’t determine if the items had already been reverse coded! There were no notes, and the person who prepared the data couldn’t remember what they did, and couldn’t find any syntax. Originally, I was under the impression that the dataset I was handed was completely raw, but as it turns out it had gone through 3 different people, all of whom had recomputed summary statistics, scale scores, and made other changes to the dataset. Because we couldn’t determine if the items were reverse scored, we couldn’t use the subscale, and this particular subscale was one of the most important ones in the study (I believe it was listed as one of the ones we were going to analyze in the PI’s grant, which meant we had to report those results to the funding agency.) 

After a solid month of trying everything I could to determine if the items were reverse scored or not, I ran across a folder from a graduate student that had since left the lab. In that folder, I found a SPSS syntax file, which turned out to be the syntax file used to process this specific version of the dataset. However, the only reason I determined that, is because at the end of the syntax file, the data was output to a file named identically to the one I had. 

Fortunately, this story had a happy ending in terms of data analysis, but the journey through the abyss of data management was frustrating. I spent a month (albeit on and off) trying to determine if items were reverse coded or not! That was a great deal of time wasted! Now, many of you might be thinking, why didn’t I go back to the raw data? Well, the truly raw data had disappeared, and the dataset I was working with was considered raw, so verifying against the raw data was impossible. 

What I haven’t mentioned yet, is that this was my very first real data analysis project, and I was very green to the whole data management issue. This was a formative experience for me, and led me to switch entirely over to R from SPSS, in part to avoid this scenario in the future! This situation illustrated several violations of good data management practices (these will be explained in depth in a future post):

  • The data violated the Chain of Processing rule, in that nobody could determine how the dataset I was working with was derived from the original raw data.

  • It violated the Document Everything rule, in that there was no documentation at all, at least not for the dataset itself. The measures were well documented, but here I am referring to how the actual file itself was documented. 

  • The data management practices for that study as a whole violated the Immutable Raw / Deletable Derivatives rule, in that the raw data was changed, and if we had deleted the data file (which was a derivative file) I was working with, we would have lost everything.

  • It partially violated the Open Lab Access rule, in that the processing scripts were accessible to me, but were in the student’s personal working directory, rather than saved alongside of the datafile itself.

This particular case is an excellent example of data rot. Data rot is what happens when many different people work with a collection of data without a strong set of data management policies put in place. What happens is that, over time, more and more derivative files, scripts, subsets of data files, and even copies of the raw data are created as researchers work on various projects. This is natural, and with good data management policies in place, not a problem overall. But here, data rot led to a very time consuming situation. 

Data rot is the primary enemy that good data management can help to combat, and it is usually the phenomena that causes the largest, most intractable problems (i.e. nobody can find the original raw data). It is not the only problem that good data management practices can defend against, as we will see in the next vignette.


Inconsistent filenames make Teague a dull boy.

I am often asked to help out with data kludging issues people have, which is to say I help collaborators and colleagues get the data into a form that they can work with. In one particular instance, I was helping a colleague compute a measure of reliability between two undergraduate’s data cleaning efforts. I was given two folders filled with CSV files, a file per subject per data cleaner, and I went about writing a script to match the file names between both folders, and then to compute the necessary reliability statistics. When I went to run my script, it threw a series of errors. As it turns out, sometimes one data cleaner would have cleaned one subject’s datafile, while the other data cleaner missed that subject, which is expected. So I adjusted my script and ran it again. This time it ran perfectly, and I sent the results over to my colleague. They responded within 20 minutes to say that there were far fewer subjects with reliability statistics than they expected and asked me to double check my work. I went line by line through my script and responded that, given that the filenames were consistent between both data cleaners, my script picked out all subjects with files present in both cleaner’s folders. 

Now, some of you readers might be seeing my mistake. I assumed that the filenames were consistent, and like most assumptions I make, it made a fool out of me. Looking at the file names, I found cases that looked like:

s001_upps_rewarded.csv vs. s001_ upps_rewarded_2.csv

Note the space after the first underscore  and the _2 in the second file name. These sorts of issues were everywhere in the dataset. To my tired human eyes, they were minor enough that on a cursory examination of the two folders, I missed them (though I did notice several egregious examples and corrected them). But to my computer’s unfailingly rigid eyes, these were not the same file names, and therefore my scripts didn’t work. 

The reason this happened was because when this particular data was collected the RA’s running the session had to manually type in the filename before saving it. Humans are fallible, but can adjust for small errors, while computers will do exactly what you tell them to. In my case, there was nothing wrong with the script I wrote, it did exactly what I wanted it to do, ignore any unpaired files. The issue was that there was no guarantee to the file structure. So what data management principles did this case violate?

  • Absolute Consistency: The files were inconsistently named, which caused issues with how my script detected them.

  • Automate or Validate: The files were manually named, which means that there was no guarantee that they would be named consistently. Additionally, there was no validation tool to detect any violations of the naming convention (there is now, I had to write one).

Now, this was not a serious case. I didn’t spend a month of time fixing this issue, nor was it particularly difficult to fix. I did have to spend several hours of my time manually fixing the filenames, and any amount of time spent fixing data management issues is time wasted. This is because all data management issues can be preempted, or at the very least minimized, by using good data management principles. 


Lift your voices in praise of the sacred text fmri_processing_2007.sh

In addition to behavioral data, much of my work deals with neuroimaging data, and in fact, many of the ideas in this post came out of those experiences. Neuroimaging, be it EEG, MRI, FNIRS or some other modality they invented between the time this was posted and you read it, produces massive amounts of files. For example, one common way of storing imaging data is in DICOM format. DICOMS are not single files, but rather collections of files representing “slices” of a 3D image, along with a header that contains a variety of meta-data. There might be hundreds of files in a given DICOM, and multiple DICOMs can be in the same folder. This is not necessarily an issue, as most software can determine which file goes with which image, but now imagine those files, their conversions into a better data format, associated behavioral data (usually multiple files per scan, with usually multiple scans per person), and you can get a sense of my main issue with neuroimaging data: it can be stored in an infinite number of ways.

When I first started working with neuroimaging data, I was asked to preprocess a collection of raw functional MRI scans. Preprocessing is an important step in neuroimaging, because a) corrects for a variety of artifacts and b) fixes the small issue of people having differently shaped brains (by transforming their brain images into what is known as a standard space). Preprocessing fMRI images has quite literally thousands of decision points, and I wanted to see how the lab I received the data from did it. They proceeded to send me over a shell script titled fmri_processing_2007.sh. The 2007 in the file title was the year it was originally written. This occurred in 2020. The lab I was collaborating with was using a 13 year old shell script to process their neuroimaging data. 

As aghast as I was, I couldn’t change that fact, so I took the time to try to understand what processing steps were done, and I set the script running on my local copy of the dataset. It failed almost immediately. I realized that I had made the mistake of fixing what I considered issues in file names and organization, though I did attempt to do so in a way that wouldn’t break the script. After fixing the processing script, I managed to run it and it completed processing successfully. 

At the same time as this, I was working with a different neuroimaging group, and they requested processing code to run on their end. I sent over my modified script, as it was the only processing script I had on hand, and I felt like I had made it generalizable enough it should have handled most folder structures. I was severely mistaken. My folder structure looked something like this:

/project
    /fMRI/
        s001_gng_1.nii.gz
        … 
    /MPRAGE
        s001_mprage.nii.gz
        …

While the other labs folder structure looked like:

/project/
    s001/
        fMRI/
            gng_1.nii.gz
        MPRAGE/
            mprage.nii.gz

I had written my script to assume the first component of a file name was the subject ID, which it was in my data. In the other lab’s data however, their subject IDs were specified at the level of the folder. Obviously my script would not work without substantial alteration. I don’t think they ever did make those alterations.

There are two good data management principles violated here:

  1. Redundant Metadata: In the case of the other lab, the file names did not contain the subject information. What would have happened if those files were ever removed from the subject’s folder? 

  2. Standardization: This is more of a goal rather than a principle. Imagine if both I and the other lab had used a standardized way of storing our data, and written our scripts to fit. We would have been able to pass back and forth code without an issue, and that would have saved us time and trouble.

Neither data rot nor human fallibility were to blame for these issues. In fact, both datasets were extremely consistently organized, and there were no mistakes with naming. We simply didn’t use the same data structure and it is worthwhile to ask, why? In this case, it was a simple case of inertia. Both myself and the analysts at the other lab had scripts written for a given data structure. In my case, the scripts I had were handed down from PI to PI for years, until the original reason certain data design decisions were made faded from memory. I like to term this, the sacred text effect. This usually occurs with code or scripts, but can occur with any practice. Usually the conversation goes like this:

You: Why is this data organized this way? 

Them: Because that is how my PI organized data when I was in graduate school, and besides all of our analysis scripts are designed for this data structure.

You: Would you consider changing over to a more standardized data structure? There are several issues with the current structure that would be easily fixable, and if we use this standard, we can share the data more freely as well as use tools designed for this data structure. 

Them: Sure, I guess, but could you fix our current scripts to deal with the new structure?

Suddenly you signed up for more work! It is vital that labs do not get locked into a suboptimal data management practice simply due to inertia. If a practice doesn’t work, or causes time delays, take the time to fix it. It might take time now, but you will make up that time 10 fold. A great example of this, and a major inspiration for this work is the BIDS Standard, a data structuring scheme for storing neuroimaging data.


These three cases illustrate the consequences of bad data management, but there are many more examples I could write about. To adapt a common idiom about relationships:

Every example of good data management is good in the same ways, but every example of bad data management is bad in its own unique way.

But again, it is important to point out that this is not due to any incompetence on the part of researchers. I’ve spoken and worked with many researchers who do not have the same technical background I do, and each one recognizes the issues inherent with bad data management practices (and can usually come up with a startling number of examples from their own work). It is simply that they were never trained in good data management, so they have had to figure everything out on their own, and these are very busy people. In the next post, I’ll lay out what I see as 8 principles of good data management for researchers. These principles are based out of my experience in the social and biomedical sciences, and so they might not wholly apply to, for example, database management in a corporate setting. 

Designing libraries: Expanding pre-existing capabilities

Today I want to continue on discussing the three categories of software I outlined in my last post (found here), and focus on libraries. Libraries are pieces of software that serve to expand the preexisting capabilities of a given programming language. Note the focus on the programming language itself, not on any given task one might be writing a piece of software to do. This leads me to the number one principle with respect to designing libraries:

The user base of a library are software developers.

Going back to the idea that you as the developer of a piece of software need to understand what your users need and need not know, we can outline some expectations for users of any given library.

  • They will know how to program in the target language. This is the most basic requirement for using a library. Libraries are not meant to completely abstract away from a language, but rather to augment the capabilities of that language.

  • They will know how to use a library to program their primary task. A software developer is building a piece of software, and you can safely assume that they will know how to go about creating that piece of software.

  • They will not know how to implement specific pieces of their program. For example, a developer might be writing some code to perform linear algebra. They might know what they need to do (e.g. multiply matrix A by matrix B), but they don’t necessarily know how to implement matrix multiplication efficiently.

So how do these points translate into design principles? From my experience, there are three main qualities that I ascribe to a well designed library.

  • A well designed library closely follows the general principles of the programming language it augments. This is to not violate the expectations of the user. If I am working with a library, and it requires me start using constructors to instantiate objects when the language I am working in doesn’t typically need to do that, that might throw me off in my use of it.

  • Libraries should be internally self consistent. This is not a category of software meant to hold a bunch of miscellaneous functions, nor should there be inconsistencies in how to use the set of functions.

  • Libraries should have extensive documentation. Now granted, all software should be well documented, but to me, libraries need even more care with respect to documentation. When I am working on some software, I need to know how every function I am using performs. Undocumented side effects could cause serious issues down the line.

So let’s talk about an example of a well designed library: Armadillo, a C++ library for matrix algebra. I use this library quite a bit, because it can be integrated into Rcpp. It makes for very fast memory efficient programs when it comes to matrix algebra, which I get to write a lot of due to networks. Armadillo, for the most part, follows all of the design principles above.

  • Armadillo functions/objects are C++ functions/objects. Nothing fancy, they behave exactly as you would expect, and most of the fancy template meta-programming is hidden from the user.

  • Armadillo is very internally consistent. All objects are accessed the same way, everything is very clearly labeled in the same way, That makes it quite easy to work with.

  • It also has very extensive documentation, as befitting a matrix algebra library.

Armadillo also makes user expectations very clear. Using it requires knowing how to program in C++, and how to do linear algebra, it simply provides implementations that are well optimized and simple to use. A great example of this is the sparse matrix class SpMat. Sparse matrices are representations of matrices with many zero entries, and a sparse matrix representation is considerably more memory efficient than a dense matrix. It achieves this by only storing the locations and values for non-zero entries, while a dense matrix representation stores all values. Note that these are both representations of a matrix, which means mathematically a matrix represented as dense or sparse is going to behave the same. But in terms of programming, a sparse matrix is much quicker to use in some applications. Armadillo implements this, and makes it simple for somebody like me, who knows linear algebra and C++, but not about optimizing matrix math, to use.

There are some aspects of Armadillo that are a bit odd from a design perspective. For example, why does the library have a function to fully model data using Gaussian Mixture Models? I am sure it works quite well, I just find it very odd to see in a matrix algebra library. Same goes for the fact that it has a k-means clustering function. Not necessarily a problem, but odd nonetheless.

If you are programming software in the social sciences, it is unlikely that you are going to be developing a library. Still, recognizing what makes a well designed library vs not helps me make decisions about what libraries I am going to use for a task. One of the most frustrating parts of developing software is not when your code throws a bug, but when you figure out that the library you thought could do X cannot actually do that thing.

Thanks for reading! During the holidays (and job app/interview season), I am planning on a 2 week schedule for these posts. In two weeks I will finish this series of posts with my final notes on designing modules, which to me represent the middle ground between applications and libraries. After that, I am going to talk about the why, how and what of code profiling.

Cheers,

Teague

A taxonomy of software: designing applications

Today I want to drill down a bit into last week’s post (found here) about software design and talk about the first of three very general categories of software one might develop in the social sciences. These categories are a) my own taxonomy and b) only a very general taxonomy of course with many pieces of software fall into a mixture. All that to say, I have found these distinctions helpful in understanding how to structure software:

  • Applications - These pieces of software are designed to perform a primary task or set of tasks, while minimizing the amount of secondary knowledge (e.g. programming, data management) required of its users. This comes at the cost of being relatively inflexible.

  • Libraries - These pieces of software extend the capabilities of an existing programming language in some way. They require high secondary and primary knowledge of the user. This allows libraries to be very flexible in their use.

  • Modules - A middle ground between applications and libraries, this type of software simplifies a primary task, reduces secondary knowledge cost, and allows for a great deal of flexibility. Often, this type of software is made to work with several other modules as well.

With those brief descriptions, I want to start by discussing the general design of applications.

Applications minimize secondary knowledge cost.

The category that I refer to as “Applications” refers to any piece of software that aims to a) perform a complete task and b) minimize what additional knowledge users need to know. This is best illustrated with some examples of what I consider and don’t consider application.

Applications:

  • SPSS is an obvious choice for the category of application. It handles all aspects of running statistics, and it abstracts away from the language it was written in, a combination of Java and likely C.

  • The R package lavaan I also consider an application. It aims to handle all aspects of running SEM models, and it abstracts away from R considerably. Besides data input and some very basic function calls, most of the work in using lavaan is setting up the model syntax.

Not Applications:

  • I wouldn’t consider the R package ggplot2 to be an application. It performs a specific task yes, but it doesn’t abstract away from R sufficiently. Instead I would consider this a module.

  • The C++ library Armadillo (link) is definitely not an application, but rather I would consider this to be a library. It simply aims to extend the linear algebra capabilities of C++.

Designing the user interface for an application requires a great deal of careful consideration of what your user base is going to be, as you can have very little expectation as to the technical knowledge of a user. For example, SPSS is successful because it makes the act of running fairly complex statistical models a matter of navigating a set of graphical user interfaces (GUIs). This of course requires knowledge of the statistical models (at least in theory, if not in practice), but it doesn’t require any programming expertise. The only secondary knowledge it really requires is the ability to navigate GUIs.

Contrast this with base R’s statistical capabilities. I can easily run a regression in R in a single line of code that might take me several minutes of running through GUIs in SPSS. This however requires more knowledge. Not only do I need to know how to set up a regression, I need to understand R formulas, data input, and how to assign variable names to objects.

This “secondary knowledge cost” is what you are trying to minimize when you are writing a program. You can expect the user to know about what the program does (e.g. SPSS does statistics) and you are trying to minimize everything else the user needs to know (e.g. SPSS does not require object oriented programming).

Let me elaborate on this idea of secondary knowledge cost with a more personal example. I develop and maintain a Python package called clpipe (link). This “package“ is really a set of command line functions for quickly processing neuroimaging data on high performance clusters. For those of you who aren’t neuroimagers, neuroimaging data requires extensive processing before analysis, and this processing is quite mathematically complex. People spend entire academic careers on processing, and many software programs have been developed to perform this processing. There were several issues that I felt warranted a additional piece of software:

  • To get neuroimaging data from scanner to analysis requires the use of several programs at a minimum, which in turn requires the knowledge of how to use these programs (non-trivial, neuroimaging software is not typically designed well).

  • Quite a bit of time is spent on data management when you are working with neuroimaging data. Ideally, this can be done using some sort of scripting language, but that requires knowledge of the scripting language.

  • Processing neuroimaging data takes quite a bit of time. Processing subjects in parallel on high performance clusters makes this much quicker, but that requires knowledge of how to use an HPC.

So, in sum, to process neuroimaging data you not only need to know about the actual processing, you have to understand the idiosyncrasies of several neuroimaging programs, know how to do data management and ideally understand how to use a high performance cluster.

My program, clpipe, attempts to lessen this secondary knowledge cost by automating many of those steps. I have written very little code that actually processes the data, that is covered by a variety of programs that clpipe interfaces with (FMRIPREP, dcm2bids). Instead, clpipe manages data and submission of jobs to HPCs. All the secondary knowledge it requires is a working knowledge of navigating Linux filesystems (not unreasonable in neuroimaging) and a very basic understanding of how to format a couple of JSON files (configuration of the pipelines is done via JSON files). Of course, I made no attempt at lessening the primary knowledge cost. To use clpipe, you do need to know how to process neuroimaging data and all the myriad of choices you can make.

So, stepping back, what makes a good application? To me, a good application minimizes what additional things you need to know to do your primary task. The cost however, is that a good application is not flexible. It makes what it does easy, but you are SOL if you want to do something outside of that specific task (try tricking SPSS into doing something outside of what it is explicitly designed to do). So how does this translate into design principles? Here are my thoughts:

  • Identify what the primary task of an application is. Imagine your user as somebody who knows everything about that task (e.g. they are an expert in regression), but have absolutely no knowledge in anything else (e.g. they have never programmed in their lives).

  • Given that theoretical user and the restrictions on your implementation, minimize what additional things the user needs to know. If you are writing an R package to do one specific type of analysis, you are going to be hard pressed to make a GUI, but you can minimize what the user needs to know about R to use your package (again, lavaan is an excellent example of this).

  • Make sure not to violate the expected flow of a given task. An application is not providing the tools to do a task, it is doing the task for the user.

  • Be very wary of designing an application so that it is easiest for you to use. I see this quite a bit, and fall victim to it quite a bit as well. By definition, if you are developing a application, you have far more secondary knowledge than the target user.

  • In a related vein, don’t underestimate the lack of secondary knowledge that a user can have.

  • Finally, if you are developing an application, fully commit to that minimization of secondary knowledge. If you half-ass it, the resulting application will be much worse than if you decided to just develop a library or module. This is because you might be muddying user expectations of what they need to know. If you are honest with your users about expectations, that always makes for a better piece of software.

Designing an application as I defined it previously is quite a difficult task. When I started working on clpipe I was astonished how difficult it was getting to a point where users felt comfortable using it (they still don’t, but that is neither here nor there). This category is really the most design intensive of the three, because it is all about putting yourself in the place of a user who, by definition, doesn’t have the same level of knowledge you have. Think carefully, draft out your UX before you ever write a line of code, and have a number of beta testers!

Next week I will give some thoughts on how to think about developing libraries. These pieces of software are the opposite of a program, as they attempt to minimize primary knowledge cost at the price of requiring high secondary knowledge.

Cheers!

Teague

Software design is not software engineering

One of the issues I see in many open source software packages, particularly those developed by academics, is that they are designed very poorly. This is not to say they are not engineered well, and that is the point of this post. Every package/program I have used does what it says it will do, but often times it is quite difficult to figure out how to make it do what it says it does. So today, I wanted to highlight some software I think is well designed, some I think are poorly designed, and talk a bit about how to design open source software with users in mind.

UX is undertaught

Most of these issues with software design come into play at the user interface level. User experience or UX is not a factor of software design commonly taught in non-computer science fields (in my field, quant psych, there was absolutely no training in it). The fundamental difference between a well designed software package and a poorly designed software package, I believe, comes down to how much effort a user needs to make to use the package.

Before we get into the weeds with some examples, I want to highlight an excellent book about product design, “The Design of Everyday Things“ by Donald A. Norman. This is a fantastic discussion about the psychology of product design (in fact it was originally titled “The Psychology of Everyday Things“), and I am planning on making it required reading for any software design course I teach. In Chapter 1, the author describes the design of a thermometer control in a refrigerator/freezer. There are two dials, one to control the cooling unit, and one to control the valve that partitions the cold air between the fridge and the freezer. This setup is absolutely reasonable from a engineering standpoint, as you have two distinct mechanisms to control. However, from a user standpoint it is quite difficult. The author shows the user guide for the fridge, with combinations such as A on the first dial and 5 on the second for normal settings, C-7 for colder freezer, etc… To make matters worse, the dials are labeled “Freezer“ and “Fresh Food“. Again from an engineering perspective, this is a fine choice, but from a design perspective it is counter-intuitive and difficult to use. This tension between fidelity to the actual mechanisms and user experience is a constant in software design, and it requires careful thought and planning to create a package that performs complex tasks while being easy to use and understand. Let’s go through a couple of examples.

Simple vs. Comprehensive

For readers who are in psychology, or indeed, any social science, you probably have heard of structural equation modeling or SEM. SEM is a powerful tool for modeling relations between sets of variables, as well as latent constructs, but it requires the use of specialized programs. Two of these programs really highlight two approaches to user experience that in my opinion are equally valid, which I will term the simple approach and the comprehensive approach.

Lavaan, simple syntax for complex problems.

Lavaan is an open source R package (found here) for structural equation modeling that is extraordinarily usable, particularly for actual data analysis tasks (as opposed to simulation studies). The user experience follows the simplification approach, in that while you can customize any aspect of the model you want, the package is fairly intelligent in choosing correct defaults. That means you can quickly specify models and get to interpreting results. Below is the code for running a two factor model, where the two latent factors, f1 and f2 are correlated, and have three indicators each.

model <- "
  f1 =~ x1+x2+x3
  f2 =~ y1+y2+y3
  f1~~f2
"
model.fit <- sem(model = model, data = data)

To me, this is quite intuitive. Latent variables are defined as being “equal“ to a combination of indicators, covariances are specified using “~~“. All observed variables correspond to variables in the dataset “data“. Of course, if you are not steeped in SEM for years, this might be less intuitive. But you also have to think about who the users are going to be. Is lavaan going to be used by undergraduate lab assistants taking their first statistics course? I hope not. But a graduate student/advanced undergraduate who has seen SEM before or a faculty member who took SEM many years ago and now has a need to use it? This design will likely work very well for them.

The key to lavaan’s simplicity is sensible defaults, a term that I will likely use over and over again in this blog. Here, the sensible default is to use the first indicator of the latent factors as the scaling indicator. Lavaan automatically sets the loading of x1 and y1 to 1, which then pins the variance of the latent factors to be the same as the variance of x1 and y1 respectively. Note, the user did not need to specify this at all, rather the package has this as a default. Lavaan is also very flexible. For example, this is the argument list for the sem function:

sem(model = NULL, data = NULL, ordered = NULL, sampling.weights = NULL,
    sample.cov = NULL, sample.mean = NULL, sample.th = NULL,
    sample.nobs = NULL, group = NULL, cluster = NULL,
    constraints = "", WLS.V = NULL, NACOV = NULL,
    ...)

There are a number of different options, all of which will completely change how the model is estimated. In this way, Lavaan simplifies the life of the user by only asking for directions if the user explicitly needs more complex options. It provides an intuitive syntax for model specification that allows for quite a bit of flexibility, while being easy to use for more simple models. Finally, the capabilities of lavaan are quite extensive. Overall, lavaan is a very well designed and engineered package and is a joy to use. Now, lets take a look at a program that embodies a comprehensive approach to design, LISREL.

LISREL, the brussel sprouts of SEM.

I have to admit, I love both LISREL and brussel sprouts. My SEM training was all done in LISREL, and I credit that in developing a deeper understanding of the modeling process behind SEM. However, while it might be good for you to learn LISREL (or eat brussel sprouts), you might not necessarily enjoy it. This is because LISREL takes a comprehensive approach to user interface. Newer versions of LISREL have GUIs and a simplified syntax called SIMPLIS, but here I am going to focus on the classic LISREL syntax. LISREL is proprietary software, which you can find here.

I am not going to go into the full specification of a model, which involves file input and a couple of other things. Instead, here is the necessary syntax for specifying the previous two factor model (at least, I think, I don’t use LISREL anymore, due to it being proprietary, so if you see errors, reach out and I am happy to correct!):

LK
f1 f2
VA 1.0 LX(1,1) LX(4,2)
FR LX(2,1) LX(3,1) LX(5,2) LX(6,2) PS(1,2)

Let’s break this down. LISREL using matrix notation for SEM, which means the commands above are explicitly freeing or fixing parameters that correspond elements in the design matrices. The “LK“ command is simply labeling our two exogenous latent variables, “f1“ and “f2“. “VA“ is fixing our scaling indicators loadings to 1.0 in the lamba_X matrix, which is a 6x2 matrix of factor loadings. So, LX(1,1) corresponds to x1 loading onto f1 with a value of 1. “FR“ stands for free parameters. This is how we specify which indicators correspond to which factor. the PS(1,2) indicates freeing the covariance between factors “f1“ and “f2“.

As you might be thinking, this type of user interface is not the easiest to use for the uninitiated. But it is comprehensive. Unlike Lavaan, which simplifies the model specification away from the specific model matrices, LISREL requires you to completely specify the model in its mathematical form. This corresponds much more closely to the actual “engineering“ of SEM. I still consider this to be good design, because 1) it is comprehensive, allowing the user to access every possible part of the model, and 2) makes no claim to simplicity. The worst design occurs when you try and fail to simplify a complex process. Lavaan succeeds in simplification, while LISREL doesn’t attempt to simplify.

In determining if the design of a piece of software is good or not, we also have to consider the intended audience. I believe that both Lavaan and LISREL succeed with different audiences. Lavaan is accessible by new researchers, while LISREL is built for people who know the matrix algebra like the back of their hand. While somebody might not pick up LISREL as quickly as lavaan, both are excellent examples of consistent user designs that provide a map from mechanism to interface, albeit for different audiences.

Now lets get into a couple of objectively bad design choices.

Bad defaults and inconsistent arguments

One thing guaranteed to result in user difficulties is when your default options are in some sense “bad.” This is best illustrated with an example. In my own research, I use an R package called igraph quite a bit. This package can do everything and anything with networks and is well engineered as well, in that it runs quickly, and I trust the results that come out of it. I use it, I will continue to use it, and I suggest its use to colleagues. It does have some design issues,and the one I want to highlight is “diag = FALSE“.

When you create a network in igraph, you are often converting an adjacency matrix or edge list into an igraph graph object. For adjacency matrices, the command looks like this:

net <- graph.adjacency(your_matrix, mode = "directed", diag = FALSE)

What I want to focus on is the “diag” argument. This argument, when true, considers elements on the diagonal of an adjacency matrix (which refer to self loops) as valid. The issue here is that with a vast number of network applications, self loops are nonsensical. For example, in a social network, a self loop of a friendship nomination is nonsense, while in a correlation matrix, 1’s on the diagonal are not meaningful. The key issue here is that “diag“ is set to TRUE by default, and this has massive impacts on the results of algorithms down the line. The reason I bring this up as a prime example of poor design is that this is a default option, where the default is not common and the choice of TRUE vs FALSE has large consequences to subsequent analyses. This is not only annoying to work with, it is quite dangerous. I once had to redo several months worth of simulations because of this very option.

Now, a caveat. We have to consider the user base. igraph is a general R package for graph theoretic analysis, which is used in both social and hard sciences. It might be the case that in physics and biology, self loops are the default (I am fairly certain they are not, but let us give the benefit of the doubt here). If that were the case, then, okay, I can see an argument for having it be the default. However, a much much better default would be to not have a default at all. Force the user to explicitly decide to use the diagonal or not. Don’t ever make large decisions for your user without asking them first.

The previous example was of what I consider dangerously poor design, while the next is what I consider annoyingly poor design. Consider the two following functions for plotting a network:

plot_network(network, vertex.size, edge.size,
             vertex.color, edge.color, vertex.label, edge.labels)

plot(network, radius, labels, color, e.width, e.color, e.labs)

Which one is easier to work with? I hope you will agree with me and say it is the first one. The issue with the second one is not that it won’t work, but that the arguments are entirely inconsistent and vague. What does “radius“ refer to? Likely to the radius of individual nodes, but it could refer to the radius of the whole network. “labels“ and “e.labs“ likely refer to node and edge labels, but it is a) unclear and b) inconsistent. You will likely have to look at the documentation for the second function every time you use it, and that could have been avoided by having a consistent labeling scheme consistent with user expectations.

Good design is about knowing the user

So what is my point? Whenever I use software, I make the good faith assumption that it does what it says on the box. The thing that personally drives me away/to a given package, is ease of use. Now, I personally am a bit of a specialized user, in that I live in software and scientific programming, and my interface needs are very different than most. But that highlights my last point: When you are designing software, the target user is never you. As you build your program/package, make sure that you are thinking of how other people will use this, not how to make it intuitive for you to use. Let me wrap up with some summary suggestions:

  1. Know your user - Is this software going to be used by other quanty people or is it going to be used by applied researchers? What can you assume about the skill level of your user base? How complex can your user handle?

  2. Make it simple or make it comprehensive - Either spend the time to make the default options and workflow simple, or explicitly tell the user they need to specify everything. Don’t try to split the difference, and make parts of the package easy, and the other parts require extensive user input.

  3. Sensible defaults, but no dangerous defaults - It’s alright to put in sensible default options, but make sure that they are clearly labeled, and that the consequences of changing them are laid out. Don’t have default options that violate norms/expectations, and that have major consequences in other sections of the program.

  4. Be consistent - Decide on how your user will interact with your software, and stick with a single general schema. Don’t change gears mid workflow. Keep in mind the cognitive effort a user has to make to learn new interfaces. They should struggle with their problems, not with your software.

UX is an underappreciated aspect of open source software development, and I hope this post gave some reasonable suggestions! As always, get in touch if I made any errors or if you want to guest post about scientific software development!

About this blog

Over the past few years, I have been working on a variety of open source software packages, some my own and some I contribute to. Throughout that process, I learned about software development best practices, code optimization techniques, user experience design, and a variety of other topics that I never received formal training in. At the same time, I have seen more and more interest in software development in my field, quantitative psychology, both from early career researchers and from researchers further in their career. One of the inspirations for starting this blog was seeing the various job ads come out in quant, many of which mentioned software development experience as a positive.

                This is a big change from the perception of software development early in my graduate education, and a very welcome one. However, what I haven’t seen is formal training in scientific software development aimed at quantitative researchers. Best practices, tips and tricks, sticking points all tend to be shared informally. As such, this blog is a very small effort to get some information out about scientific programming for methodologists. I am planning on having the majority of posts be about issues that arise in open source software development, with maybe the occasional diversion into more statistical topics. I certainly hope that at some point in time people will want to guest post, and please, if you have a topic or some advice you want to give about scientific programming, send me an email!

Cheers,

Teague