TF-ID-large-no-caption

Property	Value
Parameter Count	823M parameters
Model Type	Image-Text-to-Text Transformer
License	MIT
Framework	PyTorch
Author	Yifei Hu

What is TF-ID-large-no-caption?

TF-ID-large-no-caption is a specialized object detection model designed to identify and extract tables and figures from academic papers without including their caption text. Built on Microsoft's Florence-2 architecture, this model represents the largest variant in the TF-ID family, offering superior performance with 97.32% accuracy on test datasets.

Implementation Details

The model utilizes a transformer-based architecture fine-tuned on manually annotated academic papers from the Hugging Face Daily Papers dataset. It processes single page images and outputs precise bounding box coordinates for tables and figures, excluding caption text areas.

Architecture based on Florence-2 with 823M parameters
Uses F32 tensor type for computation
Implements image-text-to-text pipeline for detection tasks
Trained on human-verified annotations

Core Capabilities

High-precision table and figure detection (97.32% success rate)
Outputs bounding box coordinates in format [x1, y1, x2, y2]
Processes full academic paper pages
Excludes caption text from detection boxes
Supports batch processing with PyTorch integration

Frequently Asked Questions

Q: What makes this model unique?

This model specifically focuses on academic paper content extraction without caption text, offering higher accuracy compared to base models and specialized handling of complex academic layouts.

Q: What are the recommended use cases?

The model is ideal for automated academic paper processing, dataset creation, and research content extraction where precise figure and table boundaries are needed without caption text interference.