Problem → SolutionApril 2, 20265 min read

Copying Text From PDF Pastes as Garbled Characters

PDF text that pastes as random symbols, boxes, or wrong characters has broken font encoding. Learn why it happens and how to recover readable text.

Selecting text in a PDF, pressing Ctrl+C, and pasting to see random symbols, boxes, question marks, or phonetically similar but wrong characters is a font encoding problem. The PDF stores glyphs (visual letter shapes) but has an incorrect or missing table that maps those glyphs back to their corresponding Unicode text characters.

Why PDF Text Extraction Fails

A PDF glyph stream stores the visual shapes of characters, not the characters themselves. To extract text, a PDF viewer reads a ToUnicode or Encoding table in the font dictionary to translate "glyph index 42" back to "the letter A." If this table is missing, wrong, or uses a custom encoding (common in PDFs generated from TeX/LaTeX with custom fonts, some engineering tools, and older PostScript printers), the translation produces garbage. The document looks correct visually but cannot be read programmatically.

Fix 1: OCR the PDF to Get Correct Text

The most reliable recovery method: re-OCR the PDF even if it already has a text layer. Use FixMyPDF OCR — this replaces the broken text layer with freshly OCR'd text derived from the visual appearance of the page. The new text layer is accurate to what is printed. OCR accuracy depends on font clarity and scan quality — clear printed fonts achieve 98-99% accuracy. After re-OCR, copy/paste and search work correctly.

Fix 2: Use Acrobat's "Export as Text" Option

Adobe Acrobat has a more sophisticated text extraction engine that handles more encoding edge cases than basic Ctrl+C. In Acrobat: File → Export To → Text (Plain). Acrobat will attempt to extract the complete document text and save it as a .txt file. This does not always fix encoding problems (if the ToUnicode table is truly broken, no extraction tool can fix it), but for many documents with partial encoding issues, Acrobat's extraction produces cleaner results than clipboard copy.

Fix 3: Check for a Different Version of the Document

If the PDF was sent to you by a colleague or organization, ask for either a Word/text version of the document or a re-generated PDF. Any PDF where you cannot reliably extract text has a generation problem that should be fixed at the source. For TeX-generated PDFs: the TeX source needs to compile with the cmap package and proper font configuration to embed correct ToUnicode tables. Most modern TeX distributions do this automatically, but legacy documents may not.

Identify the Source of the PDF

In Acrobat: File → Properties → Description tab → check "Application" and "PDF Producer." Common sources of garbled-text PDFs: "GPL Ghostscript" (some versions), old "pdfTeX" (without cmap), some "doPDF" versions, and various "Microsoft Print to PDF" edge cases with custom fonts. If you created the PDF yourself with a tool listed here, switch to a different export method: if using LaTeX, add \usepackage[T1]{fontenc} and \usepackage{cmap}; if using a print driver, try exporting from the application directly (File → Save as PDF) instead.

Try PDF to Text Now — Free

Browser-based, private, and instant. No account or software required.

Open PDF to Text
Report Bug
Send Feedback
Feature Request