The Clause
There are 200 PDFs in the folder. They arrived this morning from legal. Somewhere inside them is the clause that determines whether your company owes $340,000 or $0. You have until end of day to find it.
You open the first PDF. It is 47 pages. The table of contents is an image, not text. The clause numbering restarted at Section 1 after an amendment on page 23. You search for "indemnification" and get 14 hits across 9 pages, each in a slightly different context. You are on PDF one of two hundred.
This is the reality of PDF processing. PDFs are containers, not documents. They store text as positioned glyphs, tables as invisible grids, images as embedded binaries. There is no DOM. There is no query language. A PDF is a printed page that happens to live on a hard drive. It resists every form of automation that works on every other file format.
Processing PDFs manually is reading every book in a library to find one sentence. What you need is an index that builds itself.
This article builds that index. A terminal pipeline that takes a directory of PDFs, feeds each one to an AI agent, extracts structured data, classifies documents by type, flags important clauses, and generates a summary report with page references. The entire pipeline is a shell script. It runs in minutes, not days.
What You Will Build
A three-stage CLI pipeline:
- Extract -- Feed each PDF to Claude Code, which reads the document natively and extracts text, tables, key clauses, and metadata into structured JSON.
- Classify -- The agent categorizes each document (contract, invoice, compliance filing, amendment, correspondence) and flags critical terms.
- Report -- A summary report aggregates findings across all documents, with page references back to the source PDFs.
The output is a directory of JSON extraction files, a classification index, and a markdown report you can hand to legal, compliance, or your manager.
Prerequisites
- Claude Code installed and authenticated. See First Hour with Claude Code for setup.
- A directory of PDFs. The pipeline works on any count, but the examples use 200 documents as the reference scenario.
- Basic terminal comfort. You will run shell scripts and read JSON output.
mkdir -p ~/pdf-pipeline/source ~/pdf-pipeline/output
# Copy or symlink your PDFs into ~/pdf-pipeline/source/
Why PDFs Break Every Pipeline
Before building the solution, it is worth understanding why PDFs are uniquely difficult. This is not academic -- it determines the design of the pipeline.
Text extraction fails silently. Tools like pdftotext extract character sequences, but they lose table structure, header hierarchy, and reading order. A two-column layout comes out as interleaved garbage. A table becomes a stream of numbers with no column association.
OCR adds noise. Scanned PDFs require optical character recognition. OCR on legal documents typically achieves 95-98% accuracy. That sounds good until you realize a 50-page contract has roughly 15,000 words. At 97% accuracy, that is 450 wrong characters -- enough to corrupt dollar amounts, dates, and party names.
Structure is implicit. A Word document has heading styles, table objects, and numbered list elements. A PDF has none of this. "Section 4.2" is just text rendered at a certain font size. There is no metadata saying "this is a subsection." Extracting the document hierarchy requires understanding visual layout, not parsing a data structure.
Traditional toolchains require per-format code. Building a PDF pipeline with PyPDF2, Tabula, Tesseract, and spaCy means writing format-specific extraction logic, table detection heuristics, and entity recognition rules. When the document format changes -- different contract template, different table layout -- the code breaks.
AI changes the equation. Claude reads PDFs natively. It sees the rendered page the way a human does: text in context, tables as grids, headers as structural elements. It does not need OCR for text-based PDFs. It does not need table detection algorithms. It reads the page.
Step 1: Configure CLAUDE.md for Document Extraction
The foundation is a CLAUDE.md that tells the agent how to process documents. Create this in your pipeline directory:
# PDF Document Processing Rules
## Extraction Standards
- Read each PDF in full. Do not skip pages.
- For each document, extract:
- Title (from cover page or first header)
- Document type (contract, invoice, amendment, compliance, correspondence, report)
- Date (effective date, signing date, or issue date)
- Parties (all named entities that are parties to the document)
- Key clauses (indemnification, limitation of liability, termination, payment terms, confidentiality, governing law)
- Tables (preserve structure as arrays of objects)
- Dollar amounts (every monetary figure with its context)
- Deadlines and dates (every date mentioned with its context)
## Classification Rules
1. Read the full document before classifying. Do not classify based on filename alone.
2. If a document contains amendment language, classify as "amendment" even if the base document is a contract.
3. Flag any clause that modifies standard liability, indemnification, or payment terms.
4. Flag any dollar amount above $100,000 with surrounding context.
## Output Format
- One JSON file per PDF, named {original-filename}.json
- All monetary values in cents (integer) with currency code
- All dates in ISO 8601 format
- Page references for every extracted element
## Safety
- Never modify source PDFs
- Log every file processed with timestamp and status
This configuration encodes your extraction rules once. Every run follows the same standards, whether you are processing 10 documents or 10,000.
Step 2: The Extraction Script
Here is the single-file extraction script. It iterates over every PDF in the source directory, feeds each one to Claude Code with extraction instructions, and saves structured JSON output.
#!/usr/bin/env bash
set -euo pipefail
# === Configuration ===
SOURCE_DIR="${1:?Usage: pdf-extract.sh <source-dir>}"
OUTPUT_DIR="${SOURCE_DIR}/../output"
REPORT_DIR="${SOURCE_DIR}/../reports"
LOG_FILE="${OUTPUT_DIR}/processing.log"
mkdir -p "$OUTPUT_DIR" "$REPORT_DIR"
echo "PDF Processing started at $(date -Iseconds)" > "$LOG_FILE"
# === Count files ===
PDF_COUNT=$(find "$SOURCE_DIR" -maxdepth 1 -name "*.pdf" -type f | wc -l | tr -d ' ')
echo "Found $PDF_COUNT PDF files in $SOURCE_DIR"
if [[ "$PDF_COUNT" -eq 0 ]]; then
echo "No PDF files found."
exit 0
fi
PROCESSED=0
FAILED=0
# === Process each PDF ===
for pdf_file in "$SOURCE_DIR"/*.pdf; do
filename=$(basename "$pdf_file" .pdf)
output_file="$OUTPUT_DIR/${filename}.json"
# Skip already processed files
if [[ -f "$output_file" ]]; then
echo " skip: $filename (already processed)"
((PROCESSED++))
continue
fi
echo " processing: $filename..."
if claude -p "Read this PDF file completely: $pdf_file
Extract the following into a JSON object:
{
\"filename\": \"original filename\",
\"document_type\": \"contract|invoice|amendment|compliance|correspondence|report|other\",
\"title\": \"document title\",
\"date\": \"ISO 8601 date\",
\"parties\": [\"Party A\", \"Party B\"],
\"page_count\": number,
\"summary\": \"2-3 sentence summary\",
\"key_clauses\": [
{
\"type\": \"indemnification|liability|termination|payment|confidentiality|governing_law|other\",
\"text\": \"exact clause text\",
\"page\": number,
\"flag\": \"normal|attention|critical\"
}
],
\"tables\": [
{
\"description\": \"what the table contains\",
\"page\": number,
\"headers\": [\"col1\", \"col2\"],
\"rows\": [[\"val1\", \"val2\"]]
}
],
\"monetary_values\": [
{
\"amount_cents\": number,
\"currency\": \"USD\",
\"context\": \"what this amount refers to\",
\"page\": number
}
],
\"dates_mentioned\": [
{
\"date\": \"ISO 8601\",
\"context\": \"what this date refers to\",
\"page\": number
}
],
\"flags\": [\"list of anything unusual or requiring attention\"]
}
Output ONLY valid JSON. No markdown fences. No explanation." \
--output-format json > "$output_file" 2>/dev/null; then
echo "$(date -Iseconds) OK $filename" >> "$LOG_FILE"
((PROCESSED++))
else
echo " FAILED: $filename"
echo "$(date -Iseconds) FAIL $filename" >> "$LOG_FILE"
rm -f "$output_file"
((FAILED++))
fi
done
echo ""
echo "=== Extraction Complete ==="
echo "Processed: $PROCESSED"
echo "Failed: $FAILED"
echo "Output: $OUTPUT_DIR/"
echo "Log: $LOG_FILE"
Save this as pdf-extract.sh and make it executable:
chmod +x pdf-extract.sh
./pdf-extract.sh ~/pdf-pipeline/source
Key design decisions:
- Skip already processed files. If the script fails midway through 200 PDFs, rerun it. It picks up where it left off.
- One JSON file per PDF. Each extraction is independent. You can inspect, re-process, or discard individual results without affecting others.
- Structured logging. The processing log tells you exactly which files succeeded and which failed, with timestamps.
Step 3: The Classification Script
Extraction gives you raw data per document. Classification gives you the big picture: what types of documents are in the collection, and which ones need attention.
#!/usr/bin/env bash
set -euo pipefail
OUTPUT_DIR="${1:?Usage: pdf-classify.sh <output-dir>}"
REPORT_DIR="${OUTPUT_DIR}/../reports"
INDEX_FILE="$REPORT_DIR/classification_index.json"
mkdir -p "$REPORT_DIR"
JSON_COUNT=$(find "$OUTPUT_DIR" -maxdepth 1 -name "*.json" ! -name "processing.log" -type f | wc -l | tr -d ' ')
echo "Classifying $JSON_COUNT extracted documents..."
# Merge all extraction JSONs into a single array
echo "[" > /tmp/all_extractions.json
first=true
for json_file in "$OUTPUT_DIR"/*.json; do
[[ "$(basename "$json_file")" == "processing.log" ]] && continue
if $first; then
first=false
else
echo "," >> /tmp/all_extractions.json
fi
cat "$json_file" >> /tmp/all_extractions.json
done
echo "]" >> /tmp/all_extractions.json
claude -p "You are a document analyst. Here is a JSON array of extracted PDF data:
$(cat /tmp/all_extractions.json)
Create a classification index as a JSON object:
{
\"total_documents\": number,
\"by_type\": {
\"contract\": {\"count\": number, \"files\": [\"filename1\", \"filename2\"]},
\"invoice\": {\"count\": number, \"files\": [...]},
...
},
\"flagged_documents\": [
{
\"filename\": \"name\",
\"reason\": \"why flagged\",
\"severity\": \"attention|critical\",
\"details\": \"specific clause or amount\"
}
],
\"total_monetary_value_cents\": number,
\"date_range\": {\"earliest\": \"ISO date\", \"latest\": \"ISO date\"},
\"parties_involved\": [\"unique list of all parties\"]
}
Output ONLY valid JSON." --output-format json > "$INDEX_FILE"
echo "Classification index saved to $INDEX_FILE"
# Show summary
echo ""
echo "=== Document Classification ==="
python3 -c "
import json
idx = json.load(open('$INDEX_FILE'))
print(f\"Total documents: {idx['total_documents']}\")
print(f\"\\nBy type:\")
for dtype, info in idx.get('by_type', {}).items():
print(f\" {dtype}: {info['count']}\")
flagged = idx.get('flagged_documents', [])
if flagged:
print(f\"\\nFlagged documents: {len(flagged)}\")
for f in flagged:
print(f\" [{f['severity'].upper()}] {f['filename']}: {f['reason']}\")
"
When you run this on the output of Step 2, you get a single classification index that tells you: 87 contracts, 45 invoices, 32 amendments, 19 compliance filings, 17 correspondence documents. And critically: 6 documents flagged, 2 critical.
Step 4: The Summary Report
The final stage generates a human-readable report with page references.
#!/usr/bin/env bash
set -euo pipefail
OUTPUT_DIR="${1:?Usage: pdf-report.sh <output-dir>}"
REPORT_DIR="${OUTPUT_DIR}/../reports"
INDEX_FILE="$REPORT_DIR/classification_index.json"
REPORT_FILE="$REPORT_DIR/summary_report.md"
if [[ ! -f "$INDEX_FILE" ]]; then
echo "Error: Run pdf-classify.sh first."
exit 1
fi
echo "Generating summary report..."
claude -p "You are a senior document analyst preparing a summary report for legal review.
Classification index:
$(cat "$INDEX_FILE")
Full extraction data:
$(cat /tmp/all_extractions.json)
Generate a comprehensive markdown report with these sections:
# Document Collection Summary Report
## Executive Summary
- Total documents, types breakdown, date range, parties involved
- Total monetary value across all documents
- Number of flagged items requiring attention
## Critical Flags
For each flagged document:
- Document name and type
- What was flagged and why
- Exact clause text with page reference
- Recommended action
## Document Type Breakdown
For each document type:
- Count and list of files
- Key terms and dates
- Notable variations from standard language
## Monetary Summary
- All monetary values grouped by document
- Total obligations by party
- Payment timelines
## Key Dates and Deadlines
- Chronological list of all deadlines
- Documents expiring within 90 days
- Renewal dates
## Appendix: Document Index
Table with: filename, type, date, parties, page count, flags
Every claim must include the source filename and page number.
Format as clean markdown with tables where appropriate." > "$REPORT_FILE"
echo "Report saved to $REPORT_FILE"
echo ""
head -30 "$REPORT_FILE"
The output is a markdown report that a lawyer, compliance officer, or executive can read without opening a single PDF. Every finding cites a specific document and page number. The report answers the question you started with: where is the clause that matters, and what does it say.
The Full Pipeline: One Command
Wrap all three stages into a single orchestration script:
#!/usr/bin/env bash
set -euo pipefail
SOURCE_DIR="${1:?Usage: pdf-pipeline.sh <source-dir>}"
echo "=== PDF Processing Pipeline ==="
echo "Source: $SOURCE_DIR"
echo ""
echo "--- Stage 1: Extraction ---"
./pdf-extract.sh "$SOURCE_DIR"
echo ""
echo "--- Stage 2: Classification ---"
./pdf-classify.sh "$SOURCE_DIR/../output"
echo ""
echo "--- Stage 3: Report Generation ---"
./pdf-report.sh "$SOURCE_DIR/../output"
echo ""
echo "=== Pipeline Complete ==="
echo "Extractions: $(ls "$SOURCE_DIR/../output/"*.json 2>/dev/null | wc -l | tr -d ' ') files"
echo "Report: $SOURCE_DIR/../reports/summary_report.md"
Run it:
chmod +x pdf-pipeline.sh pdf-extract.sh pdf-classify.sh pdf-report.sh
./pdf-pipeline.sh ~/pdf-pipeline/source
200 PDFs. Three stages. One command. The output directory contains individual JSON extractions, a classification index, and a summary report. The lawyer who sent you those PDFs this morning gets a structured answer before lunch.
Extended Thinking for Complex Documents
Some PDFs are not simple. A 200-page master services agreement with three amendments and a side letter requires the agent to hold contradictory clauses in context and determine which version governs. This is where extended thinking pays for itself.
claude -p "Read this PDF: $pdf_file
This is a complex legal document with amendments. Use careful reasoning:
1. Identify the base agreement and all amendments
2. For each clause that was amended, determine the CURRENT governing version
3. Flag any contradictions between the base agreement and amendments
4. Identify any clauses where the amendment language is ambiguous
Think step by step before producing the final extraction." --thinking extended \
--output-format json > "$OUTPUT_DIR/${filename}.json"
The --thinking extended flag allocates more reasoning tokens. The agent works through the document structure, tracks amendment chains, and resolves conflicts before producing the final output. This costs more tokens per document, but for a contract that determines six-figure liability, the cost is trivial.
Use extended thinking selectively. Simple invoices and one-page correspondence do not need it. Reserve it for:
- Multi-party contracts with amendment histories
- Compliance documents with cross-references to regulations
- Financial statements with complex table structures
- Any document where clauses reference other clauses
Parallel Processing with Split Terminals
Processing 200 PDFs sequentially works but is slow. Each PDF takes 10-30 seconds depending on length and complexity. For 200 documents, that is 30 minutes to an hour.
Speed this up by running multiple extraction processes in parallel. Split your terminal into panes, each running the extraction script on a subset of the files:
# Split the PDFs into batches
ls ~/pdf-pipeline/source/*.pdf | head -100 > /tmp/batch1.txt
ls ~/pdf-pipeline/source/*.pdf | tail -100 > /tmp/batch2.txt
# Pane 1:
while read -r pdf; do
./pdf-extract-single.sh "$pdf"
done < /tmp/batch1.txt
# Pane 2:
while read -r pdf; do
./pdf-extract-single.sh "$pdf"
done < /tmp/batch2.txt
Two panes, two parallel streams, half the time. Four panes, quarter the time. The extraction outputs are independent JSON files, so there are no race conditions. The classification and report stages run after all extractions complete.
A split-terminal layout also lets you monitor progress in one pane while reviewing completed extractions in another. Open a JSON extraction file in the right pane as the left pane continues processing. Spot-check the agent's work in real time.
Adapting the Pipeline
The scripts above are a starting point. Here are common modifications for specific use cases:
Contract Review
Add clause-specific extraction to the CLAUDE.md:
## Contract-Specific Rules
- Extract the entire indemnification section verbatim
- Flag any limitation of liability below $1,000,000
- Identify governing law and dispute resolution mechanism
- Note any non-standard termination conditions
- Extract all defined terms and their definitions
Invoice Processing
Switch the extraction focus to line items and payment terms:
## Invoice-Specific Rules
- Extract every line item: description, quantity, unit price, total
- Identify payment terms (Net 30, Net 60, etc.)
- Extract bank details and payment instructions
- Flag any invoice with a total exceeding $50,000
- Match invoice numbers to PO numbers where present
Compliance Audit
Focus on regulatory references and requirement mapping:
## Compliance Rules
- Identify all regulatory references (CFR, GDPR articles, SOX sections)
- Map each document to the specific compliance requirement it satisfies
- Flag any document that is expired or missing required signatures
- Extract certification dates and expiry dates
- Note any gaps: requirements referenced but not evidenced
When to Use This (and When Not To)
This pipeline works well for:
- Due diligence document review (M&A, investment, vendor assessment)
- Contract portfolio analysis (what are our total obligations?)
- Compliance audit preparation (do we have evidence for every requirement?)
- Invoice batch processing (extract line items for accounting systems)
- Legal discovery (find specific clauses across a document collection)
Use specialized tools when:
- You need certified OCR output for regulatory submission (use ABBYY or similar)
- The PDFs contain primarily images with minimal text (use dedicated OCR pipelines first)
- You need real-time processing of incoming documents (build an event-driven system)
- The document count exceeds 1,000 and processing time matters (add queueing infrastructure)
The sweet spot is ad-hoc batch processing of document collections in the 10-500 range. Large enough that manual review is impractical. Small enough that a terminal pipeline handles it without infrastructure.
Key Takeaways
PDFs resist automation because they are containers, not data. Traditional toolchains require per-format extraction logic that breaks when document templates change. An AI agent reads the rendered page and extracts structured data without format-specific code.
The pipeline pattern:
- Extract -- one PDF in, one JSON out, with page references for every finding
- Classify -- aggregate extractions into a typed index with severity flags
- Report -- generate a human-readable summary with citations to source documents
The core design principles:
- Idempotent extraction. Skip already-processed files. Rerun safely after failures.
- Structured output. JSON per document. Machine-readable. Composable with other tools.
- Extended thinking for complexity. Simple documents get fast extraction. Complex documents get careful reasoning.
- Parallel processing. Independent outputs mean independent processes. Scale with terminal panes.
Two hundred PDFs arrived this morning. By lunch, you have a report that tells you exactly which clause matters, on which page, in which document. The library indexed itself.
Ready to streamline your terminal workflow?
Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.