WARC and WORM Digital Storage: Web Archiving Essentials

Hanzo
Contact

Hanzo

WARCs? WORMs? Is this a lost installment of the Lord of the Rings? Unfortunately, no. Rather, WARC and WORM are abbreviations that you should be familiar with if you’re maintaining digital archives for regulatory compliance, especially in the financial services industry.

WARC is a file format for web archives, while WORM describes a type of storage that’s protected from overwriting.

Let’s break these terms down a bit more and review why they’re important for archiving online communications and messages that live on your website and social media channels.

WEB ARCHIVING WITH WARC FILES

WARC, logically enough, stands for Web ARChive, a file format that fully captures the content of a website. Why is capturing the entire content of a website important? That’s a two-part answer.

For starters, both the Financial Industry Regulatory Authority (FINRA) and the Securities and Exchange Commission (SEC) require that you create and maintain archives of all your business communications, regardless of where they occur.

The supervision and record-keeping provisions of FINRA and the SEC explicitly include online communication: FINRA Regulatory Notice 10-06 extended the standard business records requirements to communications that occur via social media, while FINRA Regulatory Notice 17-18 clarified that those requirements apply to messages sent via text or chat applications as well. The bottom line is that all customer communications, whether they occur on Facebook and LinkedIn, over email, on your website, or another digital platform, need to be retained for compliance oversight.

Second, when archiving online content, you need to capture all of it, not just the “easy” parts.

Sure, you could take a screenshot of your website or save it as a PDF. That might look similar at first glance and might (or might not) display all of the words that are on the screen. But your website is almost certainly more complex and dynamic than a static image could ever capture. That’s because typical website design includes interactive components such as introductory videos, photo and text carousels, and fillable calculators.

The Ameriprise website is a good example: its menu options are interactive, only appearing via mouse-over selection, and it includes a “confident retirement” tool (shown below) with choose-your-own-adventure buttons, where the accompanying text changes depending on your selection.

ameriprise2019

Will a still-frame PDF adequately convey all of the communication happening on that website? No way.

It’s even worse with social media; how can you prove that the company itself didn’t like a client’s positive comment on its post, thereby adopting that comment, if you can’t investigate, within your archive, who responded and how? For that matter, how can you supervise the activity of associated persons if you can’t navigate your archives to explore the full context of a conversation? Oh, that’s right. You can’t. This is where the WARC file format comes into play.

WARC files are generated from a web crawl, in which software “crawls” through every link and component on a webpage and downloads that content along with its structure and description.

WARC files also download full supporting metadata, which allows confirmation that the archive was captured in its original format on a specific date. Each component on the page is captured in its own WARC file, which specifies not only what the content should include but also what it should look like and how it should respond to user interactions.

Those files can then be reassembled, creating a replica website that looks and operates exactly like the original site did. Hanzo's Time Machine will let you directly experience a WARC-file web archive and take it for a test-drive.

WARC archives can also be accessed from any platform and any operating system, and they’re future proof. They’ll work just as well in five years or 10 years or even 30 years, which is critical when you have to retain records for a decade or more.

How do we know WARC-based archives will stand the test of time? Because the structure and function of WARC files are memorialized in ISO standard 28500:2017 and maintained by professional archivists. They’re the archival format used by institutions that are in the business of maintaining records over the seriously long term, like the Library of Congress.

While WARC files aren’t technically required by FINRA or the SEC, they represent the best way to create compliant archives that enable full supervision of online communications and comprehensive record-keeping. For a more in-depth technical look at how Hanzo uses WARC files, check out our WARC FAQ. But there’s another component to fully compliant web archives, and this one isn’t optional: WORM storage.

CREATE IMMUTABLE ARCHIVES WITH WORM STORAGE

WORM—which stands for “write once, read many”—describes a type of file storage. Refreshingly, it means exactly what it says it means: WORM storage devices can be written on only once, but then their contents can be read many times.

If you remember the early days of CDs, before they were generally rewriteable, you’ve experienced WORM storage. You had to figure out every song you wanted on your “1995 Summer Beach Music” mix tape before you started burning the CD—or else you’d have to start over with a new disc. Similarly, but with less Hootie and the Blowfish, the traditional tape archives used to back up business data operated in a read-only format that couldn’t be overwritten.

As we mentioned, WORM storage isn’t just a good idea: it’s a requirement under the SEC’s Rule 17a-4.

In subsection (f)(2)(ii)(A), the rule notes that records that are stored on “electronic storage media” (rather than microfilm or microfiche) must be “preserve[d] … exclusively in a non-rewriteable, non-erasable format.” This isn’t surprising, since the purpose of an archive is to ensure that a correct record is maintained. If an archive could be overwritten, edited, or “cleaned up” after the fact, like a PDF file, it wouldn’t be much of an archive. Most of the digital storage we encounter today is rewriteable storage, though, such as USB thumb drives, internal or external hard drives, rewriteable or “R/W” CDs, and typical cloud storage repositories like Dropbox. The data on these storage media can be readily written and overwritten until the media itself becomes corrupted by overuse.

That means these commonly used storage devices aren’t appropriate for archiving business communications.

Written by:

Hanzo
Contact
more
less

Hanzo on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide