Japanese CLIP ViT-B/16
Property | Value |
---|---|
Parameter Count | 197M |
License | Apache 2.0 |
Paper | CLIP Paper |
Author | rinna |
What is japanese-clip-vit-b-16?
Japanese CLIP is a bilingual vision-language model developed by rinna Co., Ltd. that enables understanding relationships between images and Japanese text. Built on the CLIP architecture, it uses a ViT-B/16 Transformer for image encoding and a 12-layer BERT for text processing.
Implementation Details
The model combines a Vision Transformer (ViT-B/16) architecture initialized from AugReg for image processing with a BERT-based text encoder. It was trained on the CC12M dataset with Japanese-translated captions, enabling zero-shot image classification and cross-modal understanding in Japanese.
- ViT-B/16 image encoder with patch size 16
- 12-layer BERT text encoder
- Trained on translated CC12M dataset
- Supports zero-shot classification
Core Capabilities
- Japanese text-image similarity scoring
- Zero-shot image classification
- Feature extraction for both images and text
- Cross-modal understanding in Japanese
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Japanese language understanding, making it one of the few CLIP models trained explicitly for Japanese text-image relationships. It uses a proven ViT architecture while adapting to Japanese language nuances.
Q: What are the recommended use cases?
The model excels at Japanese image-text matching, zero-shot image classification, and cross-modal search applications. It's particularly useful for applications requiring Japanese language understanding in visual contexts, such as image search, content moderation, and automated tagging systems.