Authenticated Web Archives

Accurate, reliable, simple to use, and secure workflows for archiving web content.


The Problem

Online content disappears rapidly, erasing critical evidence for investigative journalism, accountability, and cultural preservation. Social media platforms and hosting providers face pressure to implement stricter content moderation, with automated filters and human moderators making rapid decisions about what stays online. Records documenting potential crimes – especially those with violent imagery – risk being permanently deleted. Restoring content is often impossible: original posters may be arrested, lose device access, or no longer be alive when investigations begin.

Existing archiving methods face three challenges: platforms actively block automated crawlers, preserved content lacks the cryptographic verification and chain-of-custody documentation required for legal admissibility, and saved material becomes unsearchable across large collections.

JOURNALISM
Strong web archives provide a tamper-evident way to capture online evidence, safeguarding reporting against censorship and the erosion of digital sources.

HISTORY
These archives create a trustworthy and resilient collection of digital primary sources, ensuring that the ephemeral nature of the web does not erase our collective memory.

LAW
This technology establishes an unbreakable digital chain of custody, transforming fleeting web content into verifiable, court-admissible evidence.


The Solution

Starling is developing workflows using open source software for archiving web content to ensure the preserved archives are accurate and reliable, taking into consideration the sensitivity of the data. We draw from the considerable expertise deployed by national libraries and legal deposits from around the world.

Our case studies have experimented with forensically-sound web archiving, focusing on capturing broad contextual snapshots of web material.

The WACZ standard and file format

The Web Archive Collection Zipped (WACZ) standard provides a portable packaging format for web archives that bundles WARC data, indexes, metadata, and verification information into a single ZIP file. Unlike traditional WARC files that lack contextual information and require complex server infrastructure for viewing, WACZ enables efficient browser-based rendering by organizing content with indexes that allow random access to only the data needed for each page.

Built-in Integrity Through Cryptographic Hashing

Every WACZ file includes a datapackage.json manifest that contains cryptographic hashes of all resources within the archive, providing a verifiable fingerprint to detect any unauthorized modifications. This hash-based integrity checking ensures that archived content remains tamper-evident throughout its lifecycle.

Authentication Through Digital Signatures

The specification adds optional authentication capabilities by allowing creators to digitally sign archives – notably using TLS certificates. These signatures validate both the identity of the entity creating the archive (using X.509 SSL certificates) and establish a trusted timestamp for when the capture occurred.

Privacy Preference Center