High-level API
File information
structure
and metadata
are functions showing general information about the document.
>>> #File structure >>> structure(doc) {'Version': '1.4', 'Pages': 1, 'Revisions': 1, 'Encrypted': False, 'Paper of 1st page': '215x279mm or 8.5x11.0in (US Letter)'} >>> #File metadata >>> metadata(doc) {'Title': None, 'Author': None, 'Subject': None, 'Keywords': None, 'Creator': None, 'Producer': None, 'CreationDate': None, 'ModDate': None}
Basic text extraction
The function outputs a full extract of the text content, with a spatial awareness: the algorithm tries to respect the original layout, as if characters of all sizes were approximately rendered on a fixed-size grid.
>>> #Extracting text of first page >>> text = pdf.extract_page_text(doc, 0) >>> print(text) Hello World
High-level transformation
rotate
turns pages relatively to their current position by multiples of 90 degrees clockwise. NB: It takes into account the inherited attributes from the page hierarchy.
>>> #Default rotation applies 90 degrees to all pages >>> doc90 = rotate(doc) >>> #Apply 180 degrees to first two page >>> doc180 = doc.rotate(180, [0, 1])
WARNING: To REMOVE something means it still exists but it is hidden.
remove_pages
cuts a set of pages from the document as incremental update: they are not permanently deleted because it is still possible to revert to the previous revision.
>>> #Remove first 3 pages of a 6-page doc >>> second_half_doc = pdf.remove_pages(doc, {0, 1, 2})
keep_pages
does the opposite:
>>> #Keep last 3 pages of a 6-page doc >>> second_half_doc = pdf.keep_pages(doc, {3, 4, 5})
concat
merges documents:
>>> #Concatenate doc2 pages after doc1 pages into a new doc >>> doc = pdf.concat(doc1, doc2)
A Doc object can also be seen as a virtual list of pages. It is possible to use operators to slice or concatenate:
>>> #Equivalent to pdf.keep_pages(doc, {3, 4, 5}) >>> last_3_pages = doc[3:] >>> #Equivalent to pdf.concat(doc1, doc2) >>> doc = doc1 + doc2
add_text_annotation
inserts a simple text annotation in a page.
>>> annotated_doc = add_text_annotation(doc, 0, "abcdefg", [100, 100, 100, 100])
TO BE CONTINUED