TF-ID-large-no-caption

Maintained By
yifeihu

TF-ID-large-no-caption

PropertyValue
Parameter Count823M parameters
Model TypeImage-Text-to-Text Transformer
LicenseMIT
FrameworkPyTorch
AuthorYifei Hu

What is TF-ID-large-no-caption?

TF-ID-large-no-caption is a specialized object detection model designed to identify and extract tables and figures from academic papers without including their caption text. Built on Microsoft's Florence-2 architecture, this model represents the largest variant in the TF-ID family, offering superior performance with 97.32% accuracy on test datasets.

Implementation Details

The model utilizes a transformer-based architecture fine-tuned on manually annotated academic papers from the Hugging Face Daily Papers dataset. It processes single page images and outputs precise bounding box coordinates for tables and figures, excluding caption text areas.

  • Architecture based on Florence-2 with 823M parameters
  • Uses F32 tensor type for computation
  • Implements image-text-to-text pipeline for detection tasks
  • Trained on human-verified annotations

Core Capabilities

  • High-precision table and figure detection (97.32% success rate)
  • Outputs bounding box coordinates in format [x1, y1, x2, y2]
  • Processes full academic paper pages
  • Excludes caption text from detection boxes
  • Supports batch processing with PyTorch integration

Frequently Asked Questions

Q: What makes this model unique?

This model specifically focuses on academic paper content extraction without caption text, offering higher accuracy compared to base models and specialized handling of complex academic layouts.

Q: What are the recommended use cases?

The model is ideal for automated academic paper processing, dataset creation, and research content extraction where precise figure and table boundaries are needed without caption text interference.

The first platform built for prompt engineering