CogVideoX-2b
Property | Value |
---|---|
License | Apache 2.0 |
Paper | arXiv:2408.06072 |
Framework | Diffusers |
Task | Text-to-Video Generation |
What is CogVideoX-2b?
CogVideoX-2b is an entry-level text-to-video generation model designed for efficient video creation with minimal computational requirements. It represents the lightweight version of the CogVideoX family, capable of generating 6-second videos at 720x480 resolution with 8 frames per second.
Implementation Details
The model utilizes FP16 precision and features remarkable VRAM optimization, requiring as little as 4GB when using diffusers with optimizations enabled. It employs 3d_sincos_pos_embed positional encoding and supports various precision formats including FP16, BF16, FP32, and INT8.
- Inference speed: ~90 seconds on A100, ~45 seconds on H100 (50 steps)
- VRAM usage: 18GB with SAT, 4GB with diffusers (FP16)
- Supports English prompts up to 226 tokens
- Compatible with PytorchAO and Optimum-quanto for quantization
Core Capabilities
- High-quality video generation from text descriptions
- Efficient memory management with multiple optimization options
- Support for various precision formats and quantization methods
- Multi-GPU inference support
- Fine-tuning capabilities with LORA and SFT options
Frequently Asked Questions
Q: What makes this model unique?
CogVideoX-2b stands out for its efficient balance between performance and resource requirements, making it accessible for users with limited computational resources while maintaining good video generation quality.
Q: What are the recommended use cases?
The model is ideal for standard text-to-video generation tasks, particularly suited for development and testing environments, content creation, and scenarios where computational resources are limited but quality video generation is still required.