Parakeet is an NVIDIA speech-to-text model that converts English audio into text with high accuracy. It supports punctuation and capitalization, and can process up to 24 minutes of audio in a single pass.
Speech recognition
Built on the FastConformer architecture, Parakeet focuses on fast transcription while preserving speech details. It’s designed to handle long recordings and noisy audio, and can be used for tasks like subtitles, voice assistants, and call analytics.
Key capabilities include:
- Transcribing up to 60 minutes of audio in about 1 second
- Punctuation and capitalization in the output
- Word-level timestamps
- Better robustness to background noise
- Long-form audio support (up to 24 minutes per pass)
- Python and PyTorch compatibility
- Batch processing for multiple audio files
- Integration with the NVIDIA NeMo toolkit
Parakeet ranks on the Hugging Face Open ASR Leaderboard with a 6.05% word error rate.
How to use Parakeet
Parakeet is available via Hugging Face as a web demo and as a model you can run locally with NVIDIA NeMo. Use WAV or FLAC audio at 16,000 Hz. Hugging Face access is free with processing limits; local use is also free but requires an NVIDIA GPU. English only.

