Paraformer-large

Property	Value
Author	funasr
Model Type	Non-autoregressive Speech Recognition
Training Data	60,000 hours of Mandarin
Paper	Paraformer: Fast and Accurate Parallel Transformer (INTERSPEECH 2022)

What is Paraformer-large?

Paraformer-large is a groundbreaking non-autoregressive end-to-end speech recognition model that revolutionizes the way we process speech to text. Unlike traditional autoregressive models, it can generate entire sentence transcriptions in parallel, leading to exceptional inference efficiency and reduced computational costs.

Implementation Details

The model is implemented using the funasr_onnx runtime and supports both CPU and GPU inference. It features quantization options for optimized performance and flexible batch processing capabilities. The model can be easily deployed using pip installation and supports various input formats including string paths and numpy arrays.

Parallel text generation for entire sentences
Support for both CPU and GPU inference
Quantization options for optimized performance
Flexible batch processing capabilities
Easy deployment through pip installation

Core Capabilities

State-of-the-art speech recognition performance (Top ranked on SpeechIO leaderboard)
10x reduction in machine costs for cloud services
Efficient parallel inference using GPUs
Support for industrial-scale speech processing
Integration with comprehensive speech processing pipeline

Frequently Asked Questions

Q: What makes this model unique?

Paraformer-large's non-autoregressive architecture enables parallel processing of entire sentences, dramatically improving inference efficiency while maintaining accuracy comparable to traditional models. This makes it particularly valuable for large-scale deployment scenarios.

Q: What are the recommended use cases?

The model is ideal for industrial-scale speech recognition applications, particularly in Mandarin language processing. It's especially suitable for cloud services where computational efficiency is crucial, and for applications requiring real-time or near-real-time speech recognition.