PolyglotPDF is an advanced PDF processing tool that employs specialized techniques for ultra-fast text, table, and formula recognition in PDF documents, typically completing processing within 1 second. It features OCR capabilities and layout-preserving translation, with full document translations usually completed within 10 seconds (speed may vary depending on the translation API provider).
- Ultra-Fast Recognition: Processes text, tables, and formulas in PDFs within ~1 second
- Layout-Preserving Translation: Maintains original document formatting while translating content
- OCR Support: Handles scanned documents efficiently
- Text-based PDF:No GPU required
- Quick Translation: Complete PDF translation in approximately 10 seconds
- Flexible API Integration: Compatible with various translation service providers
- Web-based Comparison Interface: Side-by-side comparison of original and translated documents
- Enhanced OCR Capabilities: Improved accuracy in text recognition and processing
- Support for offline translation: Use smaller translation model
- Clone the repository:
git clone https://github.com/yourusername/polyglotpdf.git
cd polyglotpdf
- Install required packages:
pip install -r requirements.txt
-
Configure your API key in config.json. The alicloud translation API is not recommended.
-
Run the application:
python app.py
- Access the web interface:
Open your browser and navigate to
http://127.0.0.1:8000
- Python 3.8+
- alibabacloud-alimt20181012==1.3.0
- alibabacloud-tea-openapi==0.3.12
- alibabacloud-tea-util==0.3.13
- deepl==1.17.0
- Flask==2.0.1
- Flask-Cors==5.0.0
- langdetect==1.0.9
- Pillow==10.2.0
- PyMuPDF==1.24.0
- pytesseract==0.3.10
- requests==2.31.0
- tencentcloud-sdk-python==3.0.1300
- tiktoken==0.6.0
- Werkzeug==2.0.1
This project leverages PyMuPDF's capabilities for efficient PDF processing and layout preservation.
- PDF chat functionality
- Academic PDF search integration
- Optimization for even faster processing speeds
- Issue Description: Error during text re-editing:
code=4: only Gray, RGB, and CMYK colorspaces supported
- Symptom: Unsupported color space encountered during text block editing
- Current Workaround: Skip text blocks with unsupported color spaces
- Proposed Solution: Switch to OCR mode for entire pages containing unsupported color spaces
- Example: View PDF sample with unsupported color spaces
Current font configuration in the start
function of main.py
:
# Current configuration
css=f"* {{font-family:{get_font_by_language(self.target_language)};font-size:auto;color: #111111 ;font-weight:normal;}}"
You can optimize font display through the following methods:
- Modify Default Font Configuration
# Custom font styles
css=f"""* {{
font-family: {get_font_by_language(self.target_language)};
font-size: auto;
color: #111111;
font-weight: normal;
letter-spacing: 0.5px; # Adjust letter spacing
line-height: 1.5; # Adjust line height
}}"""
- Embed Custom Fonts You can embed custom fonts by following these steps:
- Place font files (.ttf, .otf) in the project's
fonts
directory - Use
@font-face
to declare custom fonts in CSS
css=f"""
@font-face {{
font-family: 'CustomFont';
src: url('fonts/your-font.ttf') format('truetype');
}}
* {{
font-family: 'CustomFont', {get_font_by_language(self.target_language)};
font-size: auto;
font-weight: normal;
}}
"""
This project follows similar basic principles as Adobe Acrobat DC's PDF editing, using PyMuPDF for text block recognition and manipulation:
- Core Process:
# Get text blocks from the page
blocks = page.get_text("dict")["blocks"]
# Process each text block
for block in blocks:
if block.get("type") == 0: # text block
bbox = block["bbox"] # get text block boundary
text = ""
font_info = None
# Collect text and font information
for line in block["lines"]:
for span in line["spans"]:
text += span["text"] + " "
This approach directly processes PDF text blocks, maintaining the original layout while achieving efficient text extraction and modification.
-
Technical Choices:
- Utilizes PyMuPDF for PDF parsing and editing
- Focuses on text processing
- Avoids complex operations like AI formula recognition, table processing, or page restructuring
-
Why Avoid Complex Processing:
- AI recognition of formulas, tables, and PDF restructuring faces severe performance bottlenecks
- Complex AI processing leads to high computational costs
- Significantly increased processing time (potentially tens of seconds or more)
- Difficult to deploy at scale with low costs in production environments
- Not suitable for online services requiring quick response times
-
Project Scope:
- This project only serves to demonstrate the correct approach for layout-preserved PDF translation and AI-assisted PDF reading. Converting PDF files to markdown format for large language models to read, in my opinion, is not a wise approach.
- Aims for optimal performance-to-cost ratio
-
Performance:
- PolyglotPDF API response time: ~1 second per page
- Low computational resource requirements, suitable for scale deployment
- High cost-effectiveness for commercial applications