TF-ID-large-no-caption
Property | Value |
---|---|
Parameter Count | 823M parameters |
Model Type | Image-Text-to-Text Transformer |
License | MIT |
Framework | PyTorch |
Author | Yifei Hu |
What is TF-ID-large-no-caption?
TF-ID-large-no-caption is a specialized object detection model designed to identify and extract tables and figures from academic papers without including their caption text. Built on Microsoft's Florence-2 architecture, this model represents the largest variant in the TF-ID family, offering superior performance with 97.32% accuracy on test datasets.
Implementation Details
The model utilizes a transformer-based architecture fine-tuned on manually annotated academic papers from the Hugging Face Daily Papers dataset. It processes single page images and outputs precise bounding box coordinates for tables and figures, excluding caption text areas.
- Architecture based on Florence-2 with 823M parameters
- Uses F32 tensor type for computation
- Implements image-text-to-text pipeline for detection tasks
- Trained on human-verified annotations
Core Capabilities
- High-precision table and figure detection (97.32% success rate)
- Outputs bounding box coordinates in format [x1, y1, x2, y2]
- Processes full academic paper pages
- Excludes caption text from detection boxes
- Supports batch processing with PyTorch integration
Frequently Asked Questions
Q: What makes this model unique?
This model specifically focuses on academic paper content extraction without caption text, offering higher accuracy compared to base models and specialized handling of complex academic layouts.
Q: What are the recommended use cases?
The model is ideal for automated academic paper processing, dataset creation, and research content extraction where precise figure and table boundaries are needed without caption text interference.