This post is about Quality Control of AERS data. It seems that every quarter, there’s some type of SNAFU with the data release (last year they released AERS data partially contaminated with a previous quarter's data).
This quarter, we have the classic newline-characters-where-they-don’t-belong error that's screwing up my AERS parser.
A little Background:
FDA releases its data in 2 forms (ASCII and SGML). ASCII is the one that I use and each ASCII file consists of row after row after row of $-delimited Adverse Event Records.
2 sample rows might look something like this:
$12345$abcdef$somestuff here$blah$more blah$blah $34321$blahblah$doscum$etcetc$vixerunt$gaius$cicero
Each row should represent one particular database record and my parser dutifully goes through each row extracting all the little bits of information between the dollar $ign$.
But with this latest quarterly release FDA released its Drug data file (aka DRUG08Q4.txt) with 4 significant quality control errors (see sample screenshot below).
[For those who want gruesome details, the following lines in the DRUG.txt file contain errors: 537-538, 258909-258910, 281285-281286, 408948]
The gist of the issue is that whoever entered the data for these 4 drug-records forgot to remove the newline characters (“carriage returns”) and so the record is actually split across 2 or more lines.
While this doesn’t seem like a big deal, if your parser isn’t “smart” it could inadvertently stuff the wrong data into the wrong slots in your database.
And so, you have to design your parser to look for these types of errors--and then you have to have a human look at the problem just to assure yourself that there wasn’t a bigger error. This wastes time...especially when the file you’re looking at has 416,000 records.
‘t would be nicer if FDA did more quality control on their data releases.