Pegasus CNN/DailyMail
Property | Value |
---|---|
Author | |
Paper | arXiv:1912.08777 |
Task | Text Summarization |
Framework | PyTorch, Transformers |
What is pegasus-cnn_dailymail?
Pegasus-cnn_dailymail is a state-of-the-art abstractive text summarization model that's part of Google's Pegasus family. It's specifically fine-tuned on the CNN/DailyMail dataset, achieving impressive ROUGE scores of 44.16/21.56/41.30. The model implements the Mixed & Stochastic approach, which combines training on both C4 and HugeNews datasets.
Implementation Details
The model utilizes a transformer-based architecture with several innovative features. It employs a gap-sentence generation pre-training objective, where important sentences are masked and the model learns to generate them. The implementation includes stochastic sampling of gap sentence ratios between 15% and 45%, and uses a 20% uniform noise for importance scoring.
- Trained on combined C4 and HugeNews datasets
- 1.5M training steps (3x longer than original)
- Updated sentencepiece tokenizer with newline character support
- Implements dynamic gap sentence ratio sampling
Core Capabilities
- High-quality abstractive text summarization
- Effective handling of news articles and long-form content
- Balanced precision and recall in generated summaries
- Robust performance across various domains
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its Mixed & Stochastic training approach, which combines multiple datasets and implements dynamic sentence sampling. The extended training period of 1.5M steps and specialized tokenizer modifications make it particularly effective for news summarization tasks.
Q: What are the recommended use cases?
The model is best suited for summarizing news articles, long-form content, and other formal documents where high-quality abstractive summaries are needed. It's particularly effective for content similar to CNN and DailyMail articles.