FineWeb-Edu Classifier
Property | Value |
---|---|
Parameter Count | 109M |
License | Apache 2.0 |
Tensor Type | F32 |
Training Data | 450,000 annotated samples |
What is fineweb-edu-classifier?
FineWeb-Edu classifier is a specialized AI model designed to evaluate the educational value of web content. Built on the Snowflake-arctic-embed architecture, this model processes text and assigns an educational quality score from 0 to 5, where 0 indicates non-educational content and 5 represents highly educational material. The model was trained on 450,000 web samples annotated by Llama3-70B-instruct, making it particularly effective for filtering and curating educational content from web datasets.
Implementation Details
The model leverages a modified Snowflake-arctic-embed architecture with an added classification head for regression output. Training was conducted over 20 epochs with a 3e-4 learning rate, while keeping the embedding and encoder layers frozen. The model achieves an impressive 82% F1 score when used as a binary classifier with a threshold of 3.
- Architecture: Snowflake-arctic-embed with custom classification head
- Training Dataset: 450k Llama3-annotated samples
- Evaluation Metric: F1 score of 82% (binary classification)
- Output: Continuous score from 0-5 with integer conversion option
Core Capabilities
- Educational content scoring and filtering
- Web content quality assessment
- Dataset curation for educational materials
- Binary classification of educational vs non-educational content
Frequently Asked Questions
Q: What makes this model unique?
The model's unique strength lies in its specialized training on Llama3-annotated educational content, making it particularly effective for identifying and scoring educational web content. Its regression-based approach provides granular scoring while maintaining high accuracy in binary classification scenarios.
Q: What are the recommended use cases?
The model is best suited for filtering educational content from web datasets, curating learning materials, and assessing the educational value of web pages. It's particularly effective for primary and grade school level content, though care should be taken with higher education or specialized domain content.