japanese-gpt-neox-3.6b-instruction-ppo

Property	Value
Parameter Count	3.6B
Model Type	GPT-NeoX
License	MIT
Paper	Link
Architecture	36-layer, 2816-hidden-size transformer

What is japanese-gpt-neox-3.6b-instruction-ppo?

This is an advanced Japanese language model that implements Reinforcement Learning from Human Feedback (RLHF) using PPO (Proximal Policy Optimization). Built upon the SFT variant, it has been specifically aligned to better follow instructions and engage in natural conversations. Human evaluation shows a 47% win rate compared to its SFT counterpart, with ChatGPT-based evaluation showing even better results at 63%.

Implementation Details

The model utilizes a sophisticated two-stage training approach: first Supervised Fine-Tuning (SFT), followed by reinforcement learning using PPO. It's built on CarperAI/trlx's implementation and trained on Japanese-translated Anthropic HH RLHF data.

Advanced tokenization using SentencePiece with 32,000 vocabulary size
Byte fallback feature to handle unknown text
Customized generation parameters with temperature=0.7 and repetition_penalty=1.1
Special input format using ユーザー/システム conversation structure

Core Capabilities

Instruction-following in Japanese language
Natural conversation handling with structured input/output
Improved response quality compared to SFT version
Efficient handling of unknown characters through byte fallback

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its successful application of RLHF in Japanese language modeling, showing measurable improvements over its SFT variant. It uses a special conversation format and has been optimized for instruction-following tasks.

Q: What are the recommended use cases?

The model is ideal for Japanese language conversation systems, chatbots, and instruction-following applications. It's particularly suited for scenarios requiring natural dialogue flow and accurate response generation in Japanese.