OCRonos
Property | Value |
---|---|
Parameter Count | 8.03B |
Model Type | Text Generation / OCR Correction |
Architecture | LLaMA-based Transformer |
Languages | French, English, German, Spanish, Italian |
License | Apache 2.0 |
Tensor Type | BF16 |
What is OCRonos?
OCRonos is an advanced language model specifically designed for correcting poorly digitized texts. Developed by PleIAs as part of their Bad Data Toolbox initiative, it represents a significant advancement in OCR error correction technology. The model is built on the LLaMA-3-8B architecture and has been extensively trained using HPC resources from GENCI-IDRIS.
Implementation Details
The model implements a specialized instruction structure for text correction, utilizing a custom format: "### Text ###\n[text]\n\n### Correction ###\n" with a specific EOS token "#END#". It employs BF16 tensor types for efficient processing and is optimized for text-generation-inference endpoints.
- Built on LLaMA-3-8B architecture
- Trained on diverse cultural heritage and financial document sources
- Supports 5 major European languages
- Implements custom instruction structure for correction tasks
Core Capabilities
- OCR error correction and text restoration
- Wrong word cut/merge fixing
- Broken text structure repair
- Multi-language support
- Preservation of original content meaning
- Handle highly deteriorated content with synthetic rewriting
Frequently Asked Questions
Q: What makes this model unique?
OCRonos stands out for its specialized training in handling OCR artifacts while maintaining content fidelity. Unlike general-purpose LLMs, it's specifically designed to handle digitization errors and rarely rewrites correct words, making it ideal for document restoration tasks.
Q: What are the recommended use cases?
The model is particularly suitable for correcting damaged PDF sources, making challenging resources usable for LLM applications and search retrieval. It's especially valuable when original sources are too damaged for conventional OCR or are inaccessible.