How do you read a PDF file?
This introductory memo will walk you through the process of decoding its internal structure. A very simple "Hello World" file similar to an example written in the PDF specification will serve as a material.
The PDF specification distinguishes 4 domains :
The following explanations will show you how reading a PDF file make use of these domains.
The first line of a PDF file is a %PDF-X.Y
header.
These numbers indicate the version of the specification the file complies to.
When the numbers are high, "modern" features may be used, but this is not an obligation because of PDF backward compatibily.
For example, a PDF 1.2 document is also a valid PDF 1.7 document.
But do not take this header for granted because, since PDF 1.4, another object (the root/catalog) you will see later on may specify another version with a higher precedence.
At the very end of the file sits a %%EOF
line.
We must go up by a few lines to see a startxref
keyword implying that something is actually starting here.
In fact the number immediately following this keyword is the file offset - in bytes - of a structure named the Cross-Reference.
This structure is an index that allows direct access to all parts (objects) and gives an entry point into the root of the document.
Why is the entry point located at the end of the document? This approach allows efficient incremental updates. More on that later.
When startxref
points to a xref
keyword,
it means that the Cross-Reference is implemented as a table and immediately followed by a trailer
.
A table subsection starts with a line specifying the number of the first object mentioned, and the total number of the objects referenced in the subsection;
then lines of fixed-length strings (20 bytes) that specify the location of each object and its status (in use, or freed).
The subsection lists 8 indirect objects starting at index 0, so object #7 is mentioned on the 8th line and can be found at file offset 526.
A N G obj
line denotes an indirect object, where N is its object number (ID) and G is its generation number.
These indirection properties are an envelope that allows to address the object. But the payload is just a "regular" object:
this object is enclosed between obj
and endobj
keywords.
In the previous example, indirect object #7 contains a payload that is a dictionnary defining 5 key-value pairs. The following section is here to describe most of the object types.
There are atomic types :
true
or false
,800
,-3.14
,(ABC)
,<414243>
(3 ASCII bytes for "ABC"),/Something
,%
and the end of the line, like % some comment
.And there are collection types :
[true 800 (ABC) /Something]
,<< /Key1 (Value1) /Key2 (Value2) >>
;
Note that the same separator (for example space or carriage return) may occur bewteen a key and a value and bewteen distinct pairs:
a parser needs to keep a context in order to determine if the next token is a key or a value.And there is a composite type for content :
stream
and endstream
keywords;
It typically conveys either a sequence of commands that write content on a page or a blob used in a sequence of commands (font file, image).Last but not least :
R
keyword that references an indirect object,
like 7 0 R
for object #7 in its generation 0;
This sequence is not enclosed in delimiters (unlike an array), therefore a special attention is needed when parsing it in order to correctly group tokens.
For example the array [3 0 R 4 0 R 5 0 R]
does not begin with 2 integers and does not contain 9 items: it contains 3 indirect references to objects #3, #4 and #5.In this example the stream content is made of plain ASCII characters:
But very often some filter modifies the bytes sequence. A filter may compress the data or encode it, and several may be chained to form a pipeline.
For example a stream dictionnary containing /Filter [/ASCII85Decode /FlateDecode]
(besides the mandatory /Length
attribute)
should be decoded from ASCII Base85 into binary and then decompressed with the deflate algorithm.
The most common type of Cross-Reference, as explained above, is a table. But since PDF 1.5 a cross-reference may be encoded as a Stream object:
/Type /XRef
and contains the same /Root
attribute that occurs in a trailer,This mecanism adds a feature that was not possible with Cross-Reference tables where all objects are accessed with file offset in bytes: an indirect object may be located inside another indirect object. In that case the terminology says that the container is an Object Stream that contains compressed objects.
The /Root
attribute of the Trailer or Cross-Reference Stream indicates the reference of the /Catalog
indirect object:
The Catalog object starts a tree of nested Pages (plural) objects. This hierarchy leads to Page (singular) objects.
A Page have dimensions (/MediaBox
), content and associated resources like fonts.
It is possible to build a new revision of a document without writing a whole new file: changes are appended to the original file.
Changes consist in new or modified objects, a Cross-reference, and a startxref
that points to it.
The Cross-Reference (either its trailer or its stream dictionary) contains a /Prev
attribute thats links the new revision to the original Cross-Reference.
This was an overview of the main concepts and syntactic elements. To go further you can read chapter 7 of the freely available Adobe PDF 1.7 Specification or - if you can access it - the subsequent ISO 32000 Specification that took over.