PDF text is designed for beautiful on-screen display rather than optimized structured data extraction, making text extraction from PDFs challenging.
Besides simple text extraction, pypdf also knows about fonts, encodings, and typical character distance, which enhances the accuracy of text extraction from PDFs.