Browse

Introduction

Inspecting the internal structure of a PDF file involves a lot of things (decompression, parsing, xref indexing, etc...) in order to make sense of the raw bytes.

PDFSyntax takes care of the processing and proposes a visualization approach that consists in adding information and hyperlinks on top of a text that is a mostly a pretty-print of the PDF data once uncompressed. It respects the physical flow of the file while offering a logical navigation between revisions (incremental updates) and between objects.

Architecture

PDFSyntax is a self-contained Python package - without any dependency - and is principally a low-level PDF library.The browse command is its highest and most visible part. It produces static HTML content that offers sufficient interactivity: JavaScript may be disabled.

Demo

Please try the LIVE DEMO of a full static HTML output that you can browse, at https://pdfsyntax.dev/simple_text_string.html (hosted on GitHub Pages).

Here is the same example, as a partial screenshot:PDFSyntax screenshot

NB: this is the output produced for the Simple Text String example file from the PDF Specification.

Usage

PDFSyntax can be installed from the GitHub repo (no dependency) or from PyPI:

    pip install pdfsyntax

Redirect the standard output to a file that you can open in your browser:

    python3 -m pdfsyntax browse file.pdf > inspection_file.html

Features

The generated HTML "looks" like an augmented raw PDF file with the following additional work:

WARNING: Encrypted files are not supported yet