[co-author: Kartikeya Thakur]
How FTI Consulting “discovered” the key to “hieroglyphic” client data dating from the 1960s and turned it into a readable format for today’s computers.
It’s no stretch to say data plays a massive role in the way we work and live, whether it’s sending rockets into space or determining the best place to grab lunch at noon. In many ways, you could say data is the lingua franca of our modern world. However, like any language, the roots of today’s data trace back to humble, often-forgotten beginnings.
To understand that concept, let’s get some context. All data is spoken in “files.” Typically, these files are formatted a certain way — in rows and columns — so that modern-day computers can easily read and process the information. If you want to get technical about it, see the example below.
With present-day data files, all the rows in one file belong to one table. Each column has fixed data types and character lengths, while the rows are differentiated using new lines. The columns are differentiated by fields in a table or separating characters (such as “~”, “|”, etc.)
This was not always the case. Back during data’s humble beginnings — around the 1950s and 1960’s — computer scientists, and by extension computers themselves, needed a more “hand’s on” approach to data. Often, a punched card would be used to represent digital data. The presence or absence of holes in predefined positions on these cards would tell the computer what data it should read.
Naturally, this led to the glorious creation of the Extended Binary Coded Decimal Interchange Code (EBCDIC). The what? The EBCDIC — an eight-bit character encoding that was used mainly on IBM mainframes and IBM midrange computer operating systems.
Flash forward to November 2020. FTI Consulting’s Applied Statistical Data Sciences group receives a set of massive files as part of multiple data productions for a client. Normally, this wouldn’t raise any eyebrows, but the files had no file extensions, a.k.a. the suffix at the end of a filename that indicates what type of file it is (think .pdf, .doc, etc.) Upon closer examination, the team uncovered what initially looked like a corrupted file.
After speaking with the client, the team learned that the “hieroglyphs” in front of them were, in fact, unprocessed EBCDIC files — 5.1TB of EBCDIC files, to be exact. That’s roughly equivalent to 45 billion punch cards full of data. Given that this data is so old and has been archived for so long, it is not surprising that the data owners never updated the files to be read by anything other than a mainframe. Why spend money updating something that will probably never be used anyway?
An Investigation That Goes Back Decades
Upon discovering the files’ true nature, FTI Consulting was left with a long string of numbers, basically known as Variable Blocked (VB) data, that made little sense to modern computers which use Fixed Blocked (FB) data consisting of keyboard characters. How could the team unlock its meaning?
The answer arrived on paper — that is, in the form of COBOL copybooks corresponding to each data file. COBOL, a computer language dating to the early 1980s, provided information about how the data in the EBCDIC data might be organized into columns. It was like stumbling upon a veritable Rosetta Stone that could bridge the gap between the various data languages.
FTI Consulting determined it could create a data file that would take the EBCDIC data and translate the files into modern-day files that computer systems of today could easily understand. Granted, creating the translated files was not easy. Each file needed to be converted a few characters a time. By using Python, the powerful open-source programming language, FTI Consulting was able to create programs to get all files translated in just one week.
Keep reading to discover what steps the team took to bring data’s dead language back to life.