Definition
The use of the PyPDF2 module to programmatically read, merge, split, and manipulate PDF documents.
Why It Matters
PDFs are the “digital paper” of the world, designed to be read but not manipulated. Without automation, tasks like merging thousands of invoices or redacting sensitive data from archives are impossible at scale. Mastering this is the only way to bridge the gap between human-readable documents and machine-processable data, preventing the “bottleneck of the manual” in data-heavy industries like law, finance, and research.
Core Concepts
- The “Problematic Format” Model: PDFs are optimized for Display/Print, not for data extraction. As a result, text extraction (
extractText()) is often imperfect and may miss layout details. - The “Copy-Modify-Write” Pattern: You cannot directly edit a PDF file. Instead, you must:
- Read the source PDF into a
PdfFileReader. - Create a new
PdfFileWriter. - Copy and transform pages into the writer.
- Write the final result to a new file.
- Read the source PDF into a
- Security:
decrypt('password')for reading andencrypt('password')for writing. - Transformations:
rotateClockwise(deg)andmergePage(other_page)(The Watermark mental model). - Indexing: Uses 0-based indexing for pages (e.g.,
getPage(0)is the first page).
import PyPDF2
# Open a PDF file (using legacy PyPDF2 syntax as per note source)
with open('sample.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfFileReader(pdf_file)
page = reader.getPage(0)
print(page.extractText())