japanese-gpt-neox-3.6b-instruction-ppo
Property | Value |
---|---|
Parameter Count | 3.6B |
Model Type | GPT-NeoX |
License | MIT |
Paper | Link |
Architecture | 36-layer, 2816-hidden-size transformer |
What is japanese-gpt-neox-3.6b-instruction-ppo?
This is an advanced Japanese language model that implements Reinforcement Learning from Human Feedback (RLHF) using PPO (Proximal Policy Optimization). Built upon the SFT variant, it has been specifically aligned to better follow instructions and engage in natural conversations. Human evaluation shows a 47% win rate compared to its SFT counterpart, with ChatGPT-based evaluation showing even better results at 63%.
Implementation Details
The model utilizes a sophisticated two-stage training approach: first Supervised Fine-Tuning (SFT), followed by reinforcement learning using PPO. It's built on CarperAI/trlx's implementation and trained on Japanese-translated Anthropic HH RLHF data.
- Advanced tokenization using SentencePiece with 32,000 vocabulary size
- Byte fallback feature to handle unknown text
- Customized generation parameters with temperature=0.7 and repetition_penalty=1.1
- Special input format using ユーザー/システム conversation structure
Core Capabilities
- Instruction-following in Japanese language
- Natural conversation handling with structured input/output
- Improved response quality compared to SFT version
- Efficient handling of unknown characters through byte fallback
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its successful application of RLHF in Japanese language modeling, showing measurable improvements over its SFT variant. It uses a special conversation format and has been optimized for instruction-following tasks.
Q: What are the recommended use cases?
The model is ideal for Japanese language conversation systems, chatbots, and instruction-following applications. It's particularly suited for scenarios requiring natural dialogue flow and accurate response generation in Japanese.