clip-japanese-base

Maintained By
line-corporation

clip-japanese-base

PropertyValue
Parameter Count197M
LicenseApache 2.0
PaperCLIP Paper
AuthorLINE Corporation

What is clip-japanese-base?

clip-japanese-base is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. Trained on approximately 1 billion web-collected image-text pairs, this model represents a significant advancement in Japanese visual-language understanding. It combines an Eva02-B Transformer architecture for image encoding with a 12-layer BERT model for text processing.

Implementation Details

The model architecture consists of two main components: an 86M parameter Eva02-B Transformer for image encoding and a 100M parameter BERT model for text encoding. The text encoder was initialized from rinna/japanese-clip-vit-b-16, providing a strong foundation for Japanese language understanding.

  • Dual-encoder architecture with image and text processing capabilities
  • F32 tensor type for precise computations
  • Supports zero-shot image classification
  • Enables text-to-image and image-to-text retrieval

Core Capabilities

  • State-of-the-art performance on STAIR Captions with 0.30 R@1
  • 89% accuracy on Recruit Datasets for image classification
  • 58% accuracy on ImageNet-1K with Japanese class names
  • Efficient processing with 197M total parameters

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimal balance between model size and performance, achieving comparable or better results than larger models while using fewer parameters. It particularly excels in Japanese-specific tasks, outperforming other Japanese CLIP models on the STAIR Captions and Recruit Datasets.

Q: What are the recommended use cases?

The model is ideal for Japanese-language visual tasks including zero-shot image classification, cross-modal retrieval (text-to-image and image-to-text), and visual semantic understanding. It's particularly well-suited for applications requiring Japanese language understanding in conjunction with visual processing.

The first platform built for prompt engineering