2025-12-21 – Multimodal Document Analysis with Claude

💡 Claude and PDFs — Extracting Structure from Complex Documents

Claude's native PDF support goes well beyond simple text extraction. When you pass a PDF as a file attachment (or base64-encoded in the API), Claude sees the document as it was intended to be read — with layout awareness, table recognition, and figure understanding. The practical result is that Claude can answer questions about a 200-page financial report, extract all table data into structured JSON, or summarise individual sections of a dense technical specification without you having to parse the PDF yourself. The 200,000-token context window means even very large documents can often fit in a single call.

High-value PDF workflows

Structured extraction: "Parse every table in this PDF and return each as a JSON array, with the table heading as the key." Claude handles merged cells, spanning headers, and footnotes better than most dedicated PDF parsers.
Contract analysis: "List all clauses relating to termination, liability, and IP ownership. For each, quote the exact text and its page number." Cite-back is one of Claude's strongest document analysis patterns — it reduces hallucination risk because you can verify the source.
Cross-document comparison: Pass two PDFs (e.g. two versions of a contract) and ask Claude to identify every material difference. Works surprisingly well for legal and regulatory review.
Research paper digestion: Ask Claude to produce: (1) a one-paragraph lay summary, (2) the three most important findings, (3) the limitations the authors acknowledge, (4) the datasets used, and (5) what open questions remain. A structured ask gets a structured answer.

For very large PDFs

If a document exceeds the context window, chunk it by section rather than by character count. Ask Claude to extract chapter headings first, then process one chapter at a time, accumulating results. This preserves semantic coherence across chunks.

💡 Image Understanding — What Claude Sees and How to Ask About It

Claude's vision capabilities handle a wide variety of image types — screenshots, diagrams, charts, photographs, hand-drawn wireframes, and scanned documents. The model can describe, interpret, compare, and reason about images in the same conversation as text. But how you frame the question significantly affects the quality of the response. Here are the patterns that produce the most reliable results.

Vision prompting patterns

Be specific about what you want extracted: "List every UI element visible in this screenshot, its label (if any), and its approximate position (top/middle/bottom × left/centre/right)." Specificity beats open-ended description for data extraction.
Use images to anchor code review: Paste a screenshot of an error, a broken UI, or a failing chart alongside the relevant code. "Given the error in the screenshot and the code below, what is causing this?" is a highly effective debugging prompt.
Charts and graphs: Claude can extract data series from a chart image and produce a table or describe trends. For critical data, always verify numerically — chart-reading is approximate.
Diagrams and architecture: Paste an architecture diagram and ask Claude to (a) describe the data flow, (b) identify single points of failure, or (c) compare it against a described target architecture.

Resolution matters

Claude performs best on images where the relevant content is clearly legible. Small text in screenshots, compressed JPEGs, or low-contrast diagrams increase error rates. If accuracy is critical, pass the image at the highest available resolution.