data

Everything in Its Place: A Guide to Data Audits

After a bit of a hiatus due to starting at University of Virginia, I’ve finally sat down and written the next post in my series on Data Management for Social Scientists. For those of you who missed the first two, you can check them out here and here! As always, this guide is based off of my own experiences, and there are many ways to successfully and efficiently document a dataset. If I’ve missed anything, please feel free to let me know!


So, you have joined a new lab, started a new lab, or received a dataset from a collaborator, and you are looking forward to digging in. You quickly realize that because that new lab or that new dataset doesn’t look anything like what you are used to, you need to take time to better understand the data structure. This sounds like a good time to perform a Data Audit. Data auditing is a practice often used in corporate settings to evaluate the location, quality and consistency of their databases, with particular eye to how the data is being used. In an academic research setting, the overall goals of a data audit remain the same:

  1. Determine where the data are. In many cases, this is a simple question to answer. If a collaborator sends you a single CSV file with their data, you probably have a good idea where that data is, but only if the data are complete, which brings us to our next goal.

  2. Determine if the data are complete. Studies, particularly in the social or biomedical sciences and particularly when dealing with human subjects, have extensive study design documentation (this is almost always a requirement for getting ethics approval for human subjects studies.) This documentation tells you, the auditor, what should be in the data that you were directed to.

  3. Determine if the data can be used for its specified purpose. In most studies, data will be analyzed, and this typically requires it to be formatted in a particular way. If, for example, the study collected free form responses in the form of a collection of .txt documents, this is less amenable to quantitative analyses than if those freeform responses were collected into a single tabular data file. 

  4. Determine if the data follows good data management practices. It is one thing to identify where the data are, and if the data is complete. In some cases, that portion of the data audit can be scripted. It is another thing entirely to determine how data either follows good data management practices, or which data management principles the data structure violates.

The end goal of any audit is not to restructure the data set. I want to repeat that, you, as the auditor, should not be changing how the data is managed. This even applies to heads of labs that want to perform their own data audit. If you change a data structure without the full buy in with the rest of the team, you will cause problems and might even make the data structure worse. Refactoring data is a distinct process, albeit one that is informed by the results of a data audit. The end goal of a data audit is the data audit report. 


The Data Audit Report

A data audit report is a human readable document that describes the results of the data audit, identifies issues, and suggests a set of solutions. This is not scholarly work, and should be written as straight forwardly as possible. This is not a trivial requirement, as many of you who have been asked, or have planned a data audit, likely have more computer science/data management experience, and if you are not careful, might use more technical terminology then is useful. Remember, the goal of a data audit is not to create a document for you to reference (though this is a major advantage), it is to create a document that anybody can use to understand the dataset in question. Take for example the following scenario:

Scenario:

In performing a data audit of an longitudinal study, you find that the data from multiple timepoints are stored in wide format .SAV files. This makes them difficult to access using open source data analysis tools, and the wide format of the data makes it difficult to perform longitudinal modeling. You want to propose converting the master copy of the dataset to long format, writing a script that when run will produce a wide format datafile, and changing the file type to a common delimited file type, like a CSV. In your report you write:

Solution:

Convert wide to long, create reverse conversion script in R, change file format to CSV.

This is informative language, and if you handed me a report with that as a solution, I would be able to understand that. But that requires knowledge of wide/long formats and why one would use them, why would you create a reverse conversion script rather than simply create an additional copy of the data set, and why CSV is better than SAV as a file format. The solution to these issues to divide the description of a solution from the implementation of said solution, and to add rationale to the solution:

Solution:

First, the dataset needs to be converted from wide format (rows are subjects, columns are variable/timepoint combinations) to long format (rows are timepoints, variables that differ over time are specified by a single value column, and a single variable name column), which would improve the ability of analysts to run longitudinal models on the dataset. However, as wide format is useful in computing summary statistics, a script needs to be created that will take the long format dataset, and convert it over to a wide format dataset whenever necessary. The long format dataset acts as the immutable raw data, and the wide format dataset can be reconstructed whenever necessary. Finally, the long raw datafile should be stored in a delimited text format, such as a .csv and accompanied by a JSON codebook.

Implementation Details:

  • Conversion from wide to long in R (reshape/melt+cast)

  • Conversion script written as “sourceable” in R, hard coded to take long data

  • Conversion to CSV one-time non-automated via R and the foreign package

  • Codebook generated using R, filled in manually.

As you can see, while there is more writing, there are far more details, and the proposed solution can be evaluated by a non-technical researcher. The implementation details act as a guide for a technical researcher, with the aim of these details being to provide enough information that any reasonably experienced data-manager could perform them.


How to Write a Data Audit Report

I have a certain structure I like to use when I perform a data audit. Broadly, it is broken into three main sections:

Summary of the Project

This is a high level summary of the project, and is mainly included so that future readers can understand the context of the dataset itself. If, for example, the dataset in question is from a large longitudinal neuroimaging study, this summary would describe what that study was about, and also describe the relevant aspects of the study design. For example, if this neuroimaging dataset contained 4 tasks, the relevant information is what those tasks are called, how many individual runs of the tasks are there in a given dataset, and any aspect of the task that might lead to uncommon datatypes (i.e. was physiology collected during a given task?). It would not be useful to include scientific information about the study design in this summary. From a data management perspective, it makes no difference if one task is an inhibitory control task, and the other is a working memory task. That being said, this summary should point out where the actual study design documents are, so that the scientific information is accessible.

Data Locations

In the report, this section provides a high level overview of where all the data is. A machine readable file, preferably a spreadsheet, needs to be generated that contains a comprehensive list of files and a summary of their content, but this does not need to be contained in the written report itself.

I like to break this section out into meaningful divisions. For example, if you were auditing a study that had both baseline self report measures and ecological momentary assessment data, I would divide up my data locations into those two categories. Again, I wouldn’t structure this section on the basis of scientific similarity, e.g. Anxiety Measures (self report, EMA). This is because presumably, the divisions you come up with are similar in terms of their data format, which is the relevant aspect for data management.

Data Completeness

This is a checklist of every aspect of the data that you expected to be present. There are two ways I like to identify what data are expected to be present. First, I look at the design documents, usually an IRB protocol or a grant application. These list all types of data collected, but don’t necessarily describe the data format. Next, I talk to the PIs, lab managers and the RAs that run the study data collection itself. This is always an enlightening exercise, as there is usually a disconnect between what the PIs think has been collected (with respect to format), and what is actually collected and stored.  If an aspect of the data is not present at all, then that needs to be noted. If data are missing for a subset of subjects, then that needs to be noted as well (this is not referring to missingness, rather, this refers to how the dataset itself is stored). 

Issues and Solutions

This is a list of issues that arose during the audit, and proposed solutions. This should be as specific as possible, with screenshots and references as needed. It should be immediately apparent upon reading an issue a) what the auditor thinks the issue is and b) that the evidence overwhelmingly points to that issue being a real concern.

I break issues down into red flags and yellow flags. Red flag issues are serious data integrity problems: i.e. a survey is not where it is expected to be, some aspect of chain of data custody has been broken, neuroimaging files are in an unusable format, etc., etc. There is no question that these problems need to be fixed right away, or at the very least brought to somebody’s attention. Unfortunately, these are the issues that usually are the hardest to solve. For example, in a recent dataset I was working on, due to a series of drive failures on a workstation used to process neuroimaging data, all the neuroimaging data from that dataset was wiped clean. Fortunately we had backups, but we only backed up the raw data and not the processed data that had taken a previous postdoc several months to process. We only lost time, rather than losing data, but it was still problematic. As nobody had been looking at this dataset since the previous postdoc left, I was the one to detect this problem during my audit.

Yellow flag issues are a bit more of a touchy subject. These issues are ones that you have identified as sub-optimal. The problem with raising these issues though, is that they are often due to the well meaning practices of the people who collected the data, and have worked with the data for years. You are effectively telling the PI, lab manager, and RAs: “In my opinion, you did this wrong, here is a better way of doing it.” Well, quite frankly, most folks in academia don’t appreciate that sort of thing, and so it pays to be, for lack of a better work, politick, when raising these yellow flag issues. Here’s an example I’ve encountered a number of times: 

SPSS is a commonly used statistical software. I won’t fault it, it does what it says on the tin, but I personally cannot stand using it. The reason I cannot stand using it is that its native file storage format, the .SAV file, has a “proprietary” structure. These files can be opened in SPSS, but opening them in another software like R takes additional packages. More to the point, I cannot open a .SAV file in a text editor. I like files that can be opened in a text editor, if at all possible. It makes it so much quicker to look for problems, or to get an understanding of how a dataset is structured. I also make an effort to only use open source tools, so I don’t actually have a copy of SPSS installed anywhere. 

Now anybody working in psychological research will have encountered these files. For me, storing data in a .SAV (or a .mat, or any other proprietary format) is a big yellow flag issue. But I guarantee you that telling your PI they need to stop using SPSS and switch over to a simple file structure like .csv, will not go over as well as you might think. Yes, if they made the switch YOU would work faster, because presumably you are interested in automating all of your data management processes. But if everybody else is working with SPSS, then they are just not going to want to make that switch suddenly. So instead of making that very harsh suggestion, I would approach it like so:

  1. Note the concern, and describe it: .SAV files are difficult to work with using most open source scripting languages.

  2. Lay out the long term solution: In the long term, .SAV files should be converted to .csv files, and item metadata stored as .json codebooks. 

  3. Suggest a shorter term improvement: In the meantime, all .SAV files should have their names standardized (i.e. behav_ses-01_parent.sav. behav_ses-01_child.sav), and all variable names should have a standardized structure.

  4. Note the advantages of this shorter term fix: Standardization would decrease analysis time and provide guarantees with respect to linking variables (variables that link cases across multiple datasets). 

Foremost in your mind should be: How would this change in data structure improves the experience of everybody who will work with this data in the future, not just me. If you are performing a data audit, you are likely the most experienced data manager in the room, so these issues are things you know how to deal with on the fly. Your job is to smooth these issues over, so that less experienced analysts don’t get caught up on them.

Finally, I personally like to highlight things I liked about a dataset, green flags. I believe that you can’t really learn what is good practice if nobody points out what was done well, so I try to point out cases where I don’t see an issue in how the data is stored. Strictly speaking, this is not a requirement, but I’ve found it to be helpful in my own learning.


Closing Thoughts

So let’s return to the question: why perform a data audit? A good data audit produces a document that can be used to a) reference the dataset as it currently exists and b) guide a data refactor. The former is useful for anybody working with the dataset currently, the latter useful to anybody who might take on the task of actually improving how the data is stored. A data audit, in my view, is a useful service to your colleagues in the lab or your collaborators. A well documented dataset is easier to work with than a poorly documented one, and a well structured and documented dataset is even better.

Data Management for Researchers: Three Tales

This is part one of a series of blog posts about best practices in data management for academic research situations. Many of the issues/principles I talk about here are less applicable to large scale tech/industry data analysis pipelines, as data in those contexts tend to be managed by dedicated database engineers/managers, and the data storage setup tends to be fairly different than in an academic setting. I also don’t touch much on sensitive data issues, where data security is paramount. It goes without saying that if a good data management practices conflict with our ethical obligations to protect sensitive data, ethics should win every time.


Good Data Management is Good Research Practice 

Data are everywhere in research. If you are performing any sort of study, in any field of science, you will be generating data files. These files will then be used in your analyses and in your grant applications. Given how important the actual data files are, it is a shame that we as scientists don’t get more training in how to manage our data. While we often have excellent data analysis training, we usually have no training at all in how to organize data, how to build processing streams for our data, and how to curate and document our data so that future researchers can use it. 

However, researchers are not rewarded for being excellent at managing data. Reviewers are never going to say how beautifully a dataset is documented and you will never get back a summary statement from the NIH commenting on the choice of comma delimited vs tab delimited files. Quite frankly, if you do manage your data well, your reward will be the lack of comments. You will know you did well if you can send a single link to a new collaborator to your data set, and they respond with, “Everything looks like it is here, and the documentation answered all of my questions.” So, given that lack of extrinsic motivation, why should you take the time and effort to learn and practice good data management? Let me illustrate a couple of examples of why good data management matters. All of the examples below are issues that arose in real life projects (albeit with details changed both for anonymity and to improve the exposition), and each one of them could have been prevented with better data management.


Reverse coded, or was it?

I once had the pleasure of working on a large scale study that involved subjects visiting the lab to take a 2 hour battery of measures. This was a massive collection effort, which fortunately, as is so often the case with statisticians, I got to avoid, and come in when the data was already collected. This lab operated primarily in SPSS, which, for those not familiar with this software, is a very common statistical analysis software used throughout the social sciences. For many, many years, SPSS was the primary software that everybody in psychology was trained on, and to its credit, is quite flexible, has many features, and easy to use. The reason it is so easy to use however, is that it is a GUI based system, where users can specify statistical analyses through a series of dialog boxes. Layered under this is a robust syntax system that users can access, however this syntax is not a fully featured scripting language like R, and is, to put it mildly, difficult to understand.

In this particular instance, I was handed the full dataset and went about my merry way doing some scale validation. But then I ran into an issue. A set of items on one particular measure were not correlating in the expected direction with the rest of the items. In this particular measure, these items were part of a subscale that was typically reverse coded. The issue was, I couldn’t determine if the items had already been reverse coded! There were no notes, and the person who prepared the data couldn’t remember what they did, and couldn’t find any syntax. Originally, I was under the impression that the dataset I was handed was completely raw, but as it turns out it had gone through 3 different people, all of whom had recomputed summary statistics, scale scores, and made other changes to the dataset. Because we couldn’t determine if the items were reverse scored, we couldn’t use the subscale, and this particular subscale was one of the most important ones in the study (I believe it was listed as one of the ones we were going to analyze in the PI’s grant, which meant we had to report those results to the funding agency.) 

After a solid month of trying everything I could to determine if the items were reverse scored or not, I ran across a folder from a graduate student that had since left the lab. In that folder, I found a SPSS syntax file, which turned out to be the syntax file used to process this specific version of the dataset. However, the only reason I determined that, is because at the end of the syntax file, the data was output to a file named identically to the one I had. 

Fortunately, this story had a happy ending in terms of data analysis, but the journey through the abyss of data management was frustrating. I spent a month (albeit on and off) trying to determine if items were reverse coded or not! That was a great deal of time wasted! Now, many of you might be thinking, why didn’t I go back to the raw data? Well, the truly raw data had disappeared, and the dataset I was working with was considered raw, so verifying against the raw data was impossible. 

What I haven’t mentioned yet, is that this was my very first real data analysis project, and I was very green to the whole data management issue. This was a formative experience for me, and led me to switch entirely over to R from SPSS, in part to avoid this scenario in the future! This situation illustrated several violations of good data management practices (these will be explained in depth in a future post):

  • The data violated the Chain of Processing rule, in that nobody could determine how the dataset I was working with was derived from the original raw data.

  • It violated the Document Everything rule, in that there was no documentation at all, at least not for the dataset itself. The measures were well documented, but here I am referring to how the actual file itself was documented. 

  • The data management practices for that study as a whole violated the Immutable Raw / Deletable Derivatives rule, in that the raw data was changed, and if we had deleted the data file (which was a derivative file) I was working with, we would have lost everything.

  • It partially violated the Open Lab Access rule, in that the processing scripts were accessible to me, but were in the student’s personal working directory, rather than saved alongside of the datafile itself.

This particular case is an excellent example of data rot. Data rot is what happens when many different people work with a collection of data without a strong set of data management policies put in place. What happens is that, over time, more and more derivative files, scripts, subsets of data files, and even copies of the raw data are created as researchers work on various projects. This is natural, and with good data management policies in place, not a problem overall. But here, data rot led to a very time consuming situation. 

Data rot is the primary enemy that good data management can help to combat, and it is usually the phenomena that causes the largest, most intractable problems (i.e. nobody can find the original raw data). It is not the only problem that good data management practices can defend against, as we will see in the next vignette.


Inconsistent filenames make Teague a dull boy.

I am often asked to help out with data kludging issues people have, which is to say I help collaborators and colleagues get the data into a form that they can work with. In one particular instance, I was helping a colleague compute a measure of reliability between two undergraduate’s data cleaning efforts. I was given two folders filled with CSV files, a file per subject per data cleaner, and I went about writing a script to match the file names between both folders, and then to compute the necessary reliability statistics. When I went to run my script, it threw a series of errors. As it turns out, sometimes one data cleaner would have cleaned one subject’s datafile, while the other data cleaner missed that subject, which is expected. So I adjusted my script and ran it again. This time it ran perfectly, and I sent the results over to my colleague. They responded within 20 minutes to say that there were far fewer subjects with reliability statistics than they expected and asked me to double check my work. I went line by line through my script and responded that, given that the filenames were consistent between both data cleaners, my script picked out all subjects with files present in both cleaner’s folders. 

Now, some of you readers might be seeing my mistake. I assumed that the filenames were consistent, and like most assumptions I make, it made a fool out of me. Looking at the file names, I found cases that looked like:

s001_upps_rewarded.csv vs. s001_ upps_rewarded_2.csv

Note the space after the first underscore  and the _2 in the second file name. These sorts of issues were everywhere in the dataset. To my tired human eyes, they were minor enough that on a cursory examination of the two folders, I missed them (though I did notice several egregious examples and corrected them). But to my computer’s unfailingly rigid eyes, these were not the same file names, and therefore my scripts didn’t work. 

The reason this happened was because when this particular data was collected the RA’s running the session had to manually type in the filename before saving it. Humans are fallible, but can adjust for small errors, while computers will do exactly what you tell them to. In my case, there was nothing wrong with the script I wrote, it did exactly what I wanted it to do, ignore any unpaired files. The issue was that there was no guarantee to the file structure. So what data management principles did this case violate?

  • Absolute Consistency: The files were inconsistently named, which caused issues with how my script detected them.

  • Automate or Validate: The files were manually named, which means that there was no guarantee that they would be named consistently. Additionally, there was no validation tool to detect any violations of the naming convention (there is now, I had to write one).

Now, this was not a serious case. I didn’t spend a month of time fixing this issue, nor was it particularly difficult to fix. I did have to spend several hours of my time manually fixing the filenames, and any amount of time spent fixing data management issues is time wasted. This is because all data management issues can be preempted, or at the very least minimized, by using good data management principles. 


Lift your voices in praise of the sacred text fmri_processing_2007.sh

In addition to behavioral data, much of my work deals with neuroimaging data, and in fact, many of the ideas in this post came out of those experiences. Neuroimaging, be it EEG, MRI, FNIRS or some other modality they invented between the time this was posted and you read it, produces massive amounts of files. For example, one common way of storing imaging data is in DICOM format. DICOMS are not single files, but rather collections of files representing “slices” of a 3D image, along with a header that contains a variety of meta-data. There might be hundreds of files in a given DICOM, and multiple DICOMs can be in the same folder. This is not necessarily an issue, as most software can determine which file goes with which image, but now imagine those files, their conversions into a better data format, associated behavioral data (usually multiple files per scan, with usually multiple scans per person), and you can get a sense of my main issue with neuroimaging data: it can be stored in an infinite number of ways.

When I first started working with neuroimaging data, I was asked to preprocess a collection of raw functional MRI scans. Preprocessing is an important step in neuroimaging, because a) corrects for a variety of artifacts and b) fixes the small issue of people having differently shaped brains (by transforming their brain images into what is known as a standard space). Preprocessing fMRI images has quite literally thousands of decision points, and I wanted to see how the lab I received the data from did it. They proceeded to send me over a shell script titled fmri_processing_2007.sh. The 2007 in the file title was the year it was originally written. This occurred in 2020. The lab I was collaborating with was using a 13 year old shell script to process their neuroimaging data. 

As aghast as I was, I couldn’t change that fact, so I took the time to try to understand what processing steps were done, and I set the script running on my local copy of the dataset. It failed almost immediately. I realized that I had made the mistake of fixing what I considered issues in file names and organization, though I did attempt to do so in a way that wouldn’t break the script. After fixing the processing script, I managed to run it and it completed processing successfully. 

At the same time as this, I was working with a different neuroimaging group, and they requested processing code to run on their end. I sent over my modified script, as it was the only processing script I had on hand, and I felt like I had made it generalizable enough it should have handled most folder structures. I was severely mistaken. My folder structure looked something like this:

/project
    /fMRI/
        s001_gng_1.nii.gz
        … 
    /MPRAGE
        s001_mprage.nii.gz
        …

While the other labs folder structure looked like:

/project/
    s001/
        fMRI/
            gng_1.nii.gz
        MPRAGE/
            mprage.nii.gz

I had written my script to assume the first component of a file name was the subject ID, which it was in my data. In the other lab’s data however, their subject IDs were specified at the level of the folder. Obviously my script would not work without substantial alteration. I don’t think they ever did make those alterations.

There are two good data management principles violated here:

  1. Redundant Metadata: In the case of the other lab, the file names did not contain the subject information. What would have happened if those files were ever removed from the subject’s folder? 

  2. Standardization: This is more of a goal rather than a principle. Imagine if both I and the other lab had used a standardized way of storing our data, and written our scripts to fit. We would have been able to pass back and forth code without an issue, and that would have saved us time and trouble.

Neither data rot nor human fallibility were to blame for these issues. In fact, both datasets were extremely consistently organized, and there were no mistakes with naming. We simply didn’t use the same data structure and it is worthwhile to ask, why? In this case, it was a simple case of inertia. Both myself and the analysts at the other lab had scripts written for a given data structure. In my case, the scripts I had were handed down from PI to PI for years, until the original reason certain data design decisions were made faded from memory. I like to term this, the sacred text effect. This usually occurs with code or scripts, but can occur with any practice. Usually the conversation goes like this:

You: Why is this data organized this way? 

Them: Because that is how my PI organized data when I was in graduate school, and besides all of our analysis scripts are designed for this data structure.

You: Would you consider changing over to a more standardized data structure? There are several issues with the current structure that would be easily fixable, and if we use this standard, we can share the data more freely as well as use tools designed for this data structure. 

Them: Sure, I guess, but could you fix our current scripts to deal with the new structure?

Suddenly you signed up for more work! It is vital that labs do not get locked into a suboptimal data management practice simply due to inertia. If a practice doesn’t work, or causes time delays, take the time to fix it. It might take time now, but you will make up that time 10 fold. A great example of this, and a major inspiration for this work is the BIDS Standard, a data structuring scheme for storing neuroimaging data.


These three cases illustrate the consequences of bad data management, but there are many more examples I could write about. To adapt a common idiom about relationships:

Every example of good data management is good in the same ways, but every example of bad data management is bad in its own unique way.

But again, it is important to point out that this is not due to any incompetence on the part of researchers. I’ve spoken and worked with many researchers who do not have the same technical background I do, and each one recognizes the issues inherent with bad data management practices (and can usually come up with a startling number of examples from their own work). It is simply that they were never trained in good data management, so they have had to figure everything out on their own, and these are very busy people. In the next post, I’ll lay out what I see as 8 principles of good data management for researchers. These principles are based out of my experience in the social and biomedical sciences, and so they might not wholly apply to, for example, database management in a corporate setting.