Problem → SolutionApril 2, 20265 min read

Copying Text From PDF Shows Wrong or Garbled Characters — How to Fix

PDF text that copies as random symbols, question marks, or wrong letters has an encoding problem. Learn why it happens and how to extract the correct text.

Copying text from a PDF and pasting it to find random symbols, reversed characters, ligature replacements (copying "fi" gets a single symbol), or completely wrong characters is an encoding problem — specifically a missing or incorrect ToUnicode mapping in the PDF's font definition. The text looks correct on screen because the renderer uses glyph positions, but the underlying character codes are wrong or missing.

Why This Happens

PDF fonts map glyph codes (numbers) to visual shapes. A separate ToUnicode map tells the viewer "glyph code 65 corresponds to Unicode character U+0041 (A)." When ToUnicode is missing or wrong, the viewer renders the correct glyph (so it looks right) but cannot tell you what character it represents when you copy. This is common in: PDFs exported from old InDesign or Quark versions, PDFs from certain Asian-language typesetting systems, PDFs with ligature glyphs (fi, fl, ffi) not mapped to their Unicode equivalents, and scanned-then-OCR'd PDFs with imprecise character mapping.

Fix 1: Re-Export From the Source Application

If you have the source document (InDesign, Word, Publisher), re-export the PDF with Unicode text encoding enabled. In InDesign (CS6+): Export PDF → in the Advanced tab, ensure "Include Hyperlinks" and standard encoding options are on. In older InDesign versions, this was a known bug that was fixed in CS5.5. Exporting again from a current version of InDesign, Illustrator, or Word produces a PDF with correct ToUnicode maps, making text fully copyable.

Fix 2: Run OCR to Replace the Text Layer

For PDFs where re-export is not possible, running OCR on the file replaces the broken font-encoded text with freshly recognized Unicode characters. In Acrobat Pro: Tools → Enhance Scans → Recognize Text → In This File. Choose the correct language and run recognition. The OCR engine reads the visual glyphs (not the broken encoding) and writes correct Unicode characters. After OCR, copy-paste works correctly. The trade-off: OCR introduces recognition errors for unusual fonts or small text.

Fix 3: Use a PDF Text Extractor That Handles Encoding

Some PDF text extraction tools handle broken ToUnicode maps better than clipboard copy. Try: pdftotext (from the Poppler library, command line: pdftotext -enc UTF-8 file.pdf), which attempts ToUnicode reconstruction. Apache PDFBox's ExtractText command also handles some encoding recovery. These tools are not perfect for severely broken encodings but often recover more readable text than clipboard copy.

Fix 4: Ligature and Special Character Lookup

If only certain characters copy wrong — specifically sequences like "fi," "fl," "ffi," "ffl" showing as single symbols — this is a ligature encoding issue. The font uses combined ligature glyphs but maps them to a private-use Unicode code point rather than the component characters. Acrobat Pro's "Copy with Formatting" sometimes handles ligatures better than plain copy. Alternatively: search-and-replace in the pasted text — replace the symbol for each ligature with its component letters (copy the symbol → paste into Find → type the correct letters in Replace).

Try Edit Pages Now — Free

Browser-based, private, and instant. No account or software required.

Open Edit Pages
Report Bug
Send Feedback
Feature Request