All posts

How to Find Original Files on Internet Archive

May 19, 2026

Open almost any item on archive.org and expand the full file list. A single scanned book might show thirty files: PDFs, EPUBs, DjVu files, plain text, page images, thumbnails, spectrograms, XML metadata, and more. Most of these were not uploaded by anyone — they were generated automatically by the Archive's processing pipeline. If you care about getting the highest-quality version, or if you need to cite the actual source material, you need to know which file is the original and which are derivatives.

The three categories

Every file in an Internet Archive item falls into one of three categories. Original files are what the uploader submitted — the source material in its native format. This is the highest-quality version and the authoritative copy. Derivative files are auto-generated from the original by the Archive's derivation process. These include format conversions for broader compatibility (PDF from scanned images, MP3 from FLAC, thumbnail from video), accessibility formats (DAISY, plain text via OCR), and analysis files (spectrograms, waveforms). Metadata files describe the item itself — they are created by the system, not the uploader.

What derivatives get created?

The derivation process depends on the media type. For audio, uploading a lossless file (FLAC, WAV, AIFF) generates VBR MP3, OGG Vorbis, spectrograms, and waveform images. For video, originals generate H.264 MP4 at various resolutions, thumbnail images, and sometimes animated GIFs. For texts and books, originals generate PDF, EPUB, plain text (via OCR), DAISY, and page-level scan thumbnails. DjVu files used to be generated but have not been created for new uploads since March 2016 — existing DjVu files remain available. For images, derivatives include various resolution thumbnails and display-optimized versions.

The system will not create a derivative that duplicates the uploaded format. If someone uploads an MP3, no additional MP3 will be generated. If someone uploads a PDF, no derived PDF will appear.

How to identify the original: web interface

On any item page, click Show All in the Download Options panel to see the full file listing. Look for an option or link labeled something like all original files or Original near the bottom of the panel. This filters the list to just the source material. If that option is not visible, you can usually distinguish originals by their format and context — the largest file in the primary format (the big FLAC, the full-resolution TIFF, the source PDF) is typically the original.

How to identify the original: _files.xml

The definitive way to check is the item's _files.xml metadata file. Navigate to archive.org/download/[identifier]/[identifier]_files.xml in your browser. This XML file lists every file in the item with a source field that says "original", "derivative", or "metadata". Derivative files also include an original field that points back to the source file they were generated from. Each file entry includes its format name, MD5 and SHA1 checksums, and file size. This is the authoritative record of what is what.

How to identify the original: ia command line tool

Using the ia tool: ia metadata [identifier] | jq '.files[] | select(.source=="original")' returns only original files. Or run ia list [identifier] to see all files and examine the output. For the full ia tool guide, see How to Use the IA Command Line Tool.

The key metadata files explained

Three system-generated files appear in virtually every item. _meta.xml contains item-level metadata — title, creator, description, date, subject, collection, and media type. This is the catalog record for the item. _files.xml lists every file with its source designation, format, checksums, and size. This is your definitive guide to which files are originals versus derivatives. _reviews.xml stores user reviews, if any exist. There may also be a _rules.conf file — this is an optional uploader-created file that can block the generation of certain lossy derivatives (for example, preventing MP3 creation for a high-fidelity audio item).

Practical tips

When downloading for preservation or archival purposes, always grab the original files rather than derivatives. Derivatives can be regenerated from originals, but not the reverse. When downloading for everyday use (reading, listening, watching), derivatives are usually the more convenient choice — they are in common formats and smaller file sizes. When citing an Internet Archive item in research, reference the original file specifically, including its filename and checksum from _files.xml if precision matters.

Never bookmark or link to the numeric server URLs that sometimes appear in download links (like ia600204.us.archive.org). These are internal server addresses that change over time and will eventually break. Always use the canonical form: archive.org/download/[identifier]/[filename].

Arkibber helps you evaluate items before downloading by giving you a clean view of what is available. Search and filter the Archive's collections through Arkibber, then navigate to archive.org to grab the specific files you need — originals for preservation, derivatives for convenience.

For the step-by-step download process, see How to Download from Internet Archive. For downloading everything from an item at once, see How to Download All Files from an Internet Archive Item.

How to Use the IA Command Line Tool