Everything in Its Place: A Guide to Data Audits

After a bit of a hiatus due to starting at University of Virginia, I’ve finally sat down and written the next post in my series on Data Management for Social Scientists. For those of you who missed the first two, you can check them out here and here! As always, this guide is based off of my own experiences, and there are many ways to successfully and efficiently document a dataset. If I’ve missed anything, please feel free to let me know!

So, you have joined a new lab, started a new lab, or received a dataset from a collaborator, and you are looking forward to digging in. You quickly realize that because that new lab or that new dataset doesn’t look anything like what you are used to, you need to take time to better understand the data structure. This sounds like a good time to perform a Data Audit. Data auditing is a practice often used in corporate settings to evaluate the location, quality and consistency of their databases, with particular eye to how the data is being used. In an academic research setting, the overall goals of a data audit remain the same:

Determine where the data are. In many cases, this is a simple question to answer. If a collaborator sends you a single CSV file with their data, you probably have a good idea where that data is, but only if the data are complete, which brings us to our next goal.
Determine if the data are complete. Studies, particularly in the social or biomedical sciences and particularly when dealing with human subjects, have extensive study design documentation (this is almost always a requirement for getting ethics approval for human subjects studies.) This documentation tells you, the auditor, what should be in the data that you were directed to.
Determine if the data can be used for its specified purpose. In most studies, data will be analyzed, and this typically requires it to be formatted in a particular way. If, for example, the study collected free form responses in the form of a collection of .txt documents, this is less amenable to quantitative analyses than if those freeform responses were collected into a single tabular data file.
Determine if the data follows good data management practices. It is one thing to identify where the data are, and if the data is complete. In some cases, that portion of the data audit can be scripted. It is another thing entirely to determine how data either follows good data management practices, or which data management principles the data structure violates.

The end goal of any audit is not to restructure the data set. I want to repeat that, you, as the auditor, should not be changing how the data is managed. This even applies to heads of labs that want to perform their own data audit. If you change a data structure without the full buy in with the rest of the team, you will cause problems and might even make the data structure worse. Refactoring data is a distinct process, albeit one that is informed by the results of a data audit. The end goal of a data audit is the data audit report.

The Data Audit Report

A data audit report is a human readable document that describes the results of the data audit, identifies issues, and suggests a set of solutions. This is not scholarly work, and should be written as straight forwardly as possible. This is not a trivial requirement, as many of you who have been asked, or have planned a data audit, likely have more computer science/data management experience, and if you are not careful, might use more technical terminology then is useful. Remember, the goal of a data audit is not to create a document for you to reference (though this is a major advantage), it is to create a document that anybody can use to understand the dataset in question. Take for example the following scenario:

Scenario:

In performing a data audit of an longitudinal study, you find that the data from multiple timepoints are stored in wide format .SAV files. This makes them difficult to access using open source data analysis tools, and the wide format of the data makes it difficult to perform longitudinal modeling. You want to propose converting the master copy of the dataset to long format, writing a script that when run will produce a wide format datafile, and changing the file type to a common delimited file type, like a CSV. In your report you write:

Solution:

Convert wide to long, create reverse conversion script in R, change file format to CSV.

This is informative language, and if you handed me a report with that as a solution, I would be able to understand that. But that requires knowledge of wide/long formats and why one would use them, why would you create a reverse conversion script rather than simply create an additional copy of the data set, and why CSV is better than SAV as a file format. The solution to these issues to divide the description of a solution from the implementation of said solution, and to add rationale to the solution:

Solution:

First, the dataset needs to be converted from wide format (rows are subjects, columns are variable/timepoint combinations) to long format (rows are timepoints, variables that differ over time are specified by a single value column, and a single variable name column), which would improve the ability of analysts to run longitudinal models on the dataset. However, as wide format is useful in computing summary statistics, a script needs to be created that will take the long format dataset, and convert it over to a wide format dataset whenever necessary. The long format dataset acts as the immutable raw data, and the wide format dataset can be reconstructed whenever necessary. Finally, the long raw datafile should be stored in a delimited text format, such as a .csv and accompanied by a JSON codebook.

Implementation Details:

Conversion from wide to long in R (reshape/melt+cast)
Conversion script written as “sourceable” in R, hard coded to take long data
Conversion to CSV one-time non-automated via R and the foreign package
Codebook generated using R, filled in manually.

As you can see, while there is more writing, there are far more details, and the proposed solution can be evaluated by a non-technical researcher. The implementation details act as a guide for a technical researcher, with the aim of these details being to provide enough information that any reasonably experienced data-manager could perform them.

How to Write a Data Audit Report

I have a certain structure I like to use when I perform a data audit. Broadly, it is broken into three main sections:

Summary of the Project

This is a high level summary of the project, and is mainly included so that future readers can understand the context of the dataset itself. If, for example, the dataset in question is from a large longitudinal neuroimaging study, this summary would describe what that study was about, and also describe the relevant aspects of the study design. For example, if this neuroimaging dataset contained 4 tasks, the relevant information is what those tasks are called, how many individual runs of the tasks are there in a given dataset, and any aspect of the task that might lead to uncommon datatypes (i.e. was physiology collected during a given task?). It would not be useful to include scientific information about the study design in this summary. From a data management perspective, it makes no difference if one task is an inhibitory control task, and the other is a working memory task. That being said, this summary should point out where the actual study design documents are, so that the scientific information is accessible.

Data Locations

In the report, this section provides a high level overview of where all the data is. A machine readable file, preferably a spreadsheet, needs to be generated that contains a comprehensive list of files and a summary of their content, but this does not need to be contained in the written report itself.

I like to break this section out into meaningful divisions. For example, if you were auditing a study that had both baseline self report measures and ecological momentary assessment data, I would divide up my data locations into those two categories. Again, I wouldn’t structure this section on the basis of scientific similarity, e.g. Anxiety Measures (self report, EMA). This is because presumably, the divisions you come up with are similar in terms of their data format, which is the relevant aspect for data management.

Data Completeness

This is a checklist of every aspect of the data that you expected to be present. There are two ways I like to identify what data are expected to be present. First, I look at the design documents, usually an IRB protocol or a grant application. These list all types of data collected, but don’t necessarily describe the data format. Next, I talk to the PIs, lab managers and the RAs that run the study data collection itself. This is always an enlightening exercise, as there is usually a disconnect between what the PIs think has been collected (with respect to format), and what is actually collected and stored. If an aspect of the data is not present at all, then that needs to be noted. If data are missing for a subset of subjects, then that needs to be noted as well (this is not referring to missingness, rather, this refers to how the dataset itself is stored).

Issues and Solutions

This is a list of issues that arose during the audit, and proposed solutions. This should be as specific as possible, with screenshots and references as needed. It should be immediately apparent upon reading an issue a) what the auditor thinks the issue is and b) that the evidence overwhelmingly points to that issue being a real concern.

I break issues down into red flags and yellow flags. Red flag issues are serious data integrity problems: i.e. a survey is not where it is expected to be, some aspect of chain of data custody has been broken, neuroimaging files are in an unusable format, etc., etc. There is no question that these problems need to be fixed right away, or at the very least brought to somebody’s attention. Unfortunately, these are the issues that usually are the hardest to solve. For example, in a recent dataset I was working on, due to a series of drive failures on a workstation used to process neuroimaging data, all the neuroimaging data from that dataset was wiped clean. Fortunately we had backups, but we only backed up the raw data and not the processed data that had taken a previous postdoc several months to process. We only lost time, rather than losing data, but it was still problematic. As nobody had been looking at this dataset since the previous postdoc left, I was the one to detect this problem during my audit.

Yellow flag issues are a bit more of a touchy subject. These issues are ones that you have identified as sub-optimal. The problem with raising these issues though, is that they are often due to the well meaning practices of the people who collected the data, and have worked with the data for years. You are effectively telling the PI, lab manager, and RAs: “In my opinion, you did this wrong, here is a better way of doing it.” Well, quite frankly, most folks in academia don’t appreciate that sort of thing, and so it pays to be, for lack of a better work, politick, when raising these yellow flag issues. Here’s an example I’ve encountered a number of times:

SPSS is a commonly used statistical software. I won’t fault it, it does what it says on the tin, but I personally cannot stand using it. The reason I cannot stand using it is that its native file storage format, the .SAV file, has a “proprietary” structure. These files can be opened in SPSS, but opening them in another software like R takes additional packages. More to the point, I cannot open a .SAV file in a text editor. I like files that can be opened in a text editor, if at all possible. It makes it so much quicker to look for problems, or to get an understanding of how a dataset is structured. I also make an effort to only use open source tools, so I don’t actually have a copy of SPSS installed anywhere.

Now anybody working in psychological research will have encountered these files. For me, storing data in a .SAV (or a .mat, or any other proprietary format) is a big yellow flag issue. But I guarantee you that telling your PI they need to stop using SPSS and switch over to a simple file structure like .csv, will not go over as well as you might think. Yes, if they made the switch YOU would work faster, because presumably you are interested in automating all of your data management processes. But if everybody else is working with SPSS, then they are just not going to want to make that switch suddenly. So instead of making that very harsh suggestion, I would approach it like so:

Note the concern, and describe it: .SAV files are difficult to work with using most open source scripting languages.
Lay out the long term solution: In the long term, .SAV files should be converted to .csv files, and item metadata stored as .json codebooks.
Suggest a shorter term improvement: In the meantime, all .SAV files should have their names standardized (i.e. behav_ses-01_parent.sav. behav_ses-01_child.sav), and all variable names should have a standardized structure.
Note the advantages of this shorter term fix: Standardization would decrease analysis time and provide guarantees with respect to linking variables (variables that link cases across multiple datasets).

Foremost in your mind should be: How would this change in data structure improves the experience of everybody who will work with this data in the future, not just me. If you are performing a data audit, you are likely the most experienced data manager in the room, so these issues are things you know how to deal with on the fly. Your job is to smooth these issues over, so that less experienced analysts don’t get caught up on them.

Finally, I personally like to highlight things I liked about a dataset, green flags. I believe that you can’t really learn what is good practice if nobody points out what was done well, so I try to point out cases where I don’t see an issue in how the data is stored. Strictly speaking, this is not a requirement, but I’ve found it to be helpful in my own learning.

Closing Thoughts

So let’s return to the question: why perform a data audit? A good data audit produces a document that can be used to a) reference the dataset as it currently exists and b) guide a data refactor. The former is useful for anybody working with the dataset currently, the latter useful to anybody who might take on the task of actually improving how the data is stored. A data audit, in my view, is a useful service to your colleagues in the lab or your collaborators. A well documented dataset is easier to work with than a poorly documented one, and a well structured and documented dataset is even better.