japanese-clip-vit-b-16

Maintained By
rinna

Japanese CLIP ViT-B/16

PropertyValue
Parameter Count197M
LicenseApache 2.0
PaperCLIP Paper
Authorrinna

What is japanese-clip-vit-b-16?

Japanese CLIP is a bilingual vision-language model developed by rinna Co., Ltd. that enables understanding relationships between images and Japanese text. Built on the CLIP architecture, it uses a ViT-B/16 Transformer for image encoding and a 12-layer BERT for text processing.

Implementation Details

The model combines a Vision Transformer (ViT-B/16) architecture initialized from AugReg for image processing with a BERT-based text encoder. It was trained on the CC12M dataset with Japanese-translated captions, enabling zero-shot image classification and cross-modal understanding in Japanese.

  • ViT-B/16 image encoder with patch size 16
  • 12-layer BERT text encoder
  • Trained on translated CC12M dataset
  • Supports zero-shot classification

Core Capabilities

  • Japanese text-image similarity scoring
  • Zero-shot image classification
  • Feature extraction for both images and text
  • Cross-modal understanding in Japanese

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Japanese language understanding, making it one of the few CLIP models trained explicitly for Japanese text-image relationships. It uses a proven ViT architecture while adapting to Japanese language nuances.

Q: What are the recommended use cases?

The model excels at Japanese image-text matching, zero-shot image classification, and cross-modal search applications. It's particularly useful for applications requiring Japanese language understanding in visual contexts, such as image search, content moderation, and automated tagging systems.

The first platform built for prompt engineering