Sabia-7B

Property	Value
Parameter Count	6.74B
Model Type	Text Generation
Architecture	LLaMA-based
Paper	Sabiá: Portuguese Large Language Models
License	Same as LLaMA-1 (Research Only)
Context Length	2048 tokens

What is sabia-7b?

Sabia-7B is a specialized Portuguese language model developed by Maritaca AI, built upon the LLaMA-1-7B architecture. It represents a significant advancement in Portuguese natural language processing, trained on 7 billion tokens from the Portuguese subset of ClueWeb22 and further refined with an additional 10 billion tokens of training.

Implementation Details

The model utilizes the LLaMA-1-7B architecture and tokenizer, optimized for Portuguese language understanding and generation. It employs BF16 precision and supports a maximum sequence length of 2048 tokens, with data freshness up to mid-2022.

Pretrained on ClueWeb22 Portuguese subset
Approximately 1.4 epochs of additional training
Maintains LLaMA's architecture while specializing in Portuguese
Supports few-shot learning tasks

Core Capabilities

Strong performance on ENEM Challenge (55.07% accuracy)
Effective hate speech detection (64.13% F1-score on PT Hate Speech Binary)
Robust sentiment analysis (46.64% F1-score on tweetSentBR)
Natural language inference tasks (58.34% F1-score on FaQuAD NLI)

Frequently Asked Questions

Q: What makes this model unique?

Sabia-7B is specifically optimized for Portuguese language tasks while maintaining competitive performance in English. It achieves a 48.5 NPM score on the Poeta benchmark, surpassing both LLaMA-1 and LLaMA-2 in Portuguese language tasks.

Q: What are the recommended use cases?

The model excels in few-shot learning scenarios rather than zero-shot tasks. It's particularly effective for Portuguese text generation, classification tasks, and natural language understanding applications in academic and research contexts.

sabia-7b