A Quick Introduction to PDF Syntax

Anatomy of a PDF File

June 2023

%PDF-1.4 1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type /Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Kids [4 0 R] /Count 1 >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R /Font << /F1 7 0 R >> >> >> endobj 5 0 obj << /Length 73 >> stream BT /F1 24 Tf 100 100 Td (Hello World) Tj ET endstream endobj 6 0 obj [/PDF /Text] endobj 7 0 obj << /Type /Font /Subtype /Type1 /Name /F1 /BaseFont /Helvetica /Encoding /MacRomanEncoding >> endobj xref 0 8 0000000000 65535 f 0000000009 00000 n 0000000080 00000 n 0000000129 00000 n 0000000192 00000 n 0000000376 00000 n 0000000498 00000 n 0000000526 00000 n trailer << /Size 8 /Root 1 0 R >> startxref 646 %%EOF 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R /Font << /F1 7 0 R >> >> /Annots 8 0 R >> endobj 8 0 obj [9 0 R] endobj 9 0 obj << /Type /Annot /Subtype /Text /Rect [44 616 162 735] /Contents (Text #1) /Open true >> endobj xref 0 1 0000000000 65535 f 4 1 0000000866 00000 n 8 2 0000001067 00000 n 0000001090 00000 n trailer << /Size 10 /Root 1 0 R /Prev 646 >> startxref 1205 %%EOF byte offset↓ 0 9 80 129 192 376 498 526 646

How do you read a PDF file?

This introductory memo will walk you through the process of decoding its internal structure. A very simple "Hello World" file similar to an example written in the PDF specification will serve as a material.

Syntax Overview

The PDF specification distinguishes 4 domains :

The following explanations will show you how reading a PDF file make use of these domains.

Header

The first line of a PDF file is a %PDF-X.Y header. These numbers indicate the version of the specification the file complies to. When the numbers are high, "modern" features may be used, but this is not an obligation because of PDF backward compatibily. For example, a PDF 1.2 document is also a valid PDF 1.7 document.

But do not take this header for granted because, since PDF 1.4, another object (the root/catalog) you will see later on may specify another version with a higher precedence.

%PDF-1.4

End of File

At the very end of the file sits a %%EOF line.

%%EOF

We must go up by a few lines to see a startxref keyword implying that something is actually starting here. In fact the number immediately following this keyword is the file offset - in bytes - of a structure named the Cross-Reference. This structure is an index that allows direct access to all parts (objects) and gives an entry point into the root of the document.

startxref 646

Why is the entry point located at the end of the document? This approach allows efficient incremental updates. More on that later.

Cross-Reference Table, and Trailer

When startxref points to a xref keyword, it means that the Cross-Reference is implemented as a table and immediately followed by a trailer. A table subsection starts with a line specifying the number of the first object mentioned, and the total number of the objects referenced in the subsection; then lines of fixed-length strings (20 bytes) that specify the location of each object and its status (in use, or freed).

xref trailer

The subsection lists 8 indirect objects starting at index 0, so object #7 is mentioned on the 8th line and can be found at file offset 526.

Indirect Objects

A N G obj line denotes an indirect object, where N is its object number (ID) and G is its generation number. These indirection properties are an envelope that allows to address the object. But the payload is just a "regular" object: this object is enclosed between obj and endobj keywords.

7 0 obj endobj

In the previous example, indirect object #7 contains a payload that is a dictionnary defining 5 key-value pairs. The following section is here to describe most of the object types.

Object Types

There are atomic types :

And there are collection types :

And there is a composite type for content :

Last but not least :

Filters

In this example the stream content is made of plain ASCII characters:

But very often some filter modifies the bytes sequence. A filter may compress the data or encode it, and several may be chained to form a pipeline. For example a stream dictionnary containing /Filter [/ASCII85Decode /FlateDecode] (besides the mandatory /Length attribute) should be decoded from ASCII Base85 into binary and then decompressed with the deflate algorithm.

Cross-Reference Stream

The most common type of Cross-Reference, as explained above, is a table. But since PDF 1.5 a cross-reference may be encoded as a Stream object:

This mecanism adds a feature that was not possible with Cross-Reference tables where all objects are accessed with file offset in bytes: an indirect object may be located inside another indirect object. In that case the terminology says that the container is an Object Stream that contains compressed objects.

Document Structure

The /Root attribute of the Trailer or Cross-Reference Stream indicates the reference of the /Catalog indirect object:

The Catalog object starts a tree of nested Pages (plural) objects. This hierarchy leads to Page (singular) objects. A Page have dimensions (/MediaBox), content and associated resources like fonts.

Incremental Updates

It is possible to build a new revision of a document without writing a whole new file: changes are appended to the original file. Changes consist in new or modified objects, a Cross-reference, and a startxref that points to it. The Cross-Reference (either its trailer or its stream dictionary) contains a /Prev attribute thats links the new revision to the original Cross-Reference.

append original xref

Conclusion

This was an overview of the main concepts and syntactic elements. To go further you can read chapter 7 of the freely available Adobe PDF 1.7 Specification or - if you can access it - the subsequent ISO 32000 Specification that took over.