If you’re a regular reader of Hanzo, we’ve probably already convinced you that you need to be archiving any social media profiles that your organization maintains, along with your website. Chances are you’ve heard us talk about the benefits of native-format archiving using WARC (short for Web ARChive) files too.
If you’re new here, both the Financial Industry Regulatory Authority (FINRA) and the Securities and Exchange Commission (SEC) require records of all business communications, including those that happen online. Beyond the need to archive to maintain compliance, there are business and risk management benefits to archiving your web and social media presence too.
In the process of researching your web-archiving technology options to comply with these regulations, you’ve no doubt encountered a number of different technical approaches to capturing, archiving, and preserving online content. While we're firm believers in the WARC approach, many others tout their use of application programming interfaces, or APIs, to archive social media feeds and other websites.
So, just what are APIs, and what are the pros and cons of using them for web archiving? Let’s take a closer look.
WHAT ARE APIS?
An application programming interface, or API, provides a third-party developer with access to the data or functions of another application, system, or platform. For example, Facebook has APIs that allow companies to collect data about Facebook users so they can better target advertisements, enable developers to create games that work within the social media platform, and more.
In essence, an API operates as a side door, granting third parties access to some of the functions and information in the platform so that they can design add-ons or additional features not supported by the original platform.
These plug-in functions are generally supported because they’re beneficial to the API provider. For example, a ticket-purchasing app might use a Google calendar API to let you add an event onto your calendar directly from the purchase screen—which Google appreciates, because it keeps you using their calendar.
Note that APIs are created and controlled by platforms—and that they’re continually being used by third-party developers that range from the innocuous to the nefarious. We’ll circle back to these characteristics, since they lead directly to our biggest concern with using APIs for web archiving. But before we get there, let’s look at the bright side. What’s good about using APIs to archive web content?
THE PROS OF USING APIS TO ARCHIVE WEB CONTENT
When they work, APIs offer authorized access to user data that might not otherwise be available. They generally capture most of the data and the metadata contained on a site, and once they’ve exported that data to a standard file type like a JSON file, they’re fairly stable. They’ll work for as long as that file type is readily supported and can be accessed from any program that works with that file type.
For archiving, APIs offer a one-size-fits-all approach that doesn’t take a lot of setup. You don’t have to give a lot of thought to how deep your archives should be or how many page links you should include—you only get what the API provides access to, and that’s the end of that.
For compliance teams in financial services firms and other organizations archiving their websites, using an API may be the easy road on paper, but it's not always the best or most responsible path to take.
THE CONS OF USING APIS TO ARCHIVE WEB CONTENT
The first problem is that you don’t get a recognizable “rendering” of the data out of an API collection. Instead, you get a string of data that, while it includes the information you’d need to reconstruct the text that a customer would have read on the site, does not remotely resemble the website it’s supposedly “collecting.”
If you’re trying to replicate a customer’s experience of a website or show a regulator what you said and what the context for that statement was, be aware that you’re not going to get that from an API archive.
And there may not even be an API for a particular site you need to collect—or at least not a publicly accessible one. Or it may come with terms and conditions that you can’t agree to. Additionally, instead of using a single tool that can crawl and capture any online content, each website that’s archived using an API requires its own software to interact with that API.
This all brings us to our biggest complaint: APIs put you, the archiver, at the mercy of the platform developer providing your API and, by extension, at the mercy of other developers using that API.
If the API provider decides to change or remove its API in response to unintended use of its data or a data leak, your archiving tool will no longer work. And that means that—for however long you don’t have API access—you’re not getting updated web archives.
Oh, hello, FINRA and SEC regulators! Didn’t see you standing there. Hang on one second, please, while I try to restore my critical compliance function of creating and maintaining business records.
But wait, you say: this vendor promises that API access is stable. That changes to APIs are announced well in advance so that developers can prepare! Surely we are blowing this out of proportion, right? Most of the time, they’re right about that. Developers create APIs so that they don’t have to build every little function that a user might want; by providing some access to their platform, they allow other developers to share the load. So, yes, it’s to their advantage to maintain those APIs and give third-party developers time to adjust to upcoming changes.
The thing is, your compliance demands don’t just apply “most of the time.”
You can’t explain to a regulator that something went wrong with the API and you don’t have any archives from August but hey, you got it straightened out within a few weeks.
Not to mention, do you really want to be sitting on a data "time bomb" that could go off at the worst possible moment, creating an entirely avoidable archiving crisis that you now have to manage? We’ve already seen this happen.
In 2018, we were all pummeled with news about data privacy—in the EU, the General Data Protection Regulation (GDPR) went live, countless data breaches exposed practically everyone’s private personal information, and Facebook weathered scandal after scandal as users learned just how unprotected their social media data was.
In the midst of that storm, and specifically in response to the Cambridge Analytica news, Facebook announced in April 2018 that it was limiting the amount of user data that developers could access through its API. It reduced API access even further in July. Both Instagram and Twitter followed suit.
Can you afford to rely on APIs that you don’t control to capture your online business communications?
Or do you need a web archiving approach that works everywhere and produces fully functional archives of any site you need, in a form that can be navigated by supervisors or regulators as if it were the original live website that customers interacted with?