What is StepFun AI Step-Audio 2 Mini: Complete Guide

  • End-to-end model for speech understanding, reasoning, and generation with Tool Calling and multimodal RAG.
  • 2:3 dual tokenization, prosodic control, and local/online demo with ready-made scripts.
  • SOTA results in ASR, paralinguistics, MMAU, and translation; competitive speech.

stepfun ai

StepFun AI Step-Audio 2 Mini It is an end-to-end speech model that unifies audio understanding, reasoning, and generation into a single architecture. Designed for natural conversations and deep speech analysis, it masters tasks such as ASR, paralinguistic understanding, sound reasoning, translation, and voice-to-voice dialogue, reducing latency and minimizing hallucinations thanks to tool calls and multimodal retrieval.

Beyond the theory, Step-Audio 2 Mini It shines in public benchmarks and real-life scenarios: it understands accents and dialects, captures emotions and prosody, and is capable of adjust timbre, rhythm and style, even singing or rapping. In addition, it integrates with web search and audio, and is available in the open via GitHub and Hugging Face, making it easy to test, audit, and adapt to product or research needs.

What is StepFun AI Step-Audio 2 Mini

In short, it is the compact version of the Step-Audio 2 family, a end-to-end multimodal voice model Production-ready that unifies classic tasks (ASR and TTS) with advanced dialog capabilities and tools. Unlike the ASR + LLM + TTS in stages, its direct audio-audio/text design reduces complexity and latency, preserving paralinguistic details (intonation, timbre, rhythm) and non-vocal signals.

Its pillars include: intelligent conversation with long context and prosodic sensitivity, Native Tool Calling with multimodal RAG (text and audio) to inject updated knowledge and change of doorbell according to retrieved references. This combination decreases hallucinations and makes the answers more useful and natural.

The family is completed with Step-Audio 2 (higher capacity) and related components of the Step-Audio ecosystem, including a base model 130B parameters used for contextual pre-training with audio and an efficient TTS (Step-Audio-TTS-3B). Although Mini does not require the massive infrastructure of the 130B, it inherits its generative data pipeline and fine voice control guidelines.

Architecture and technical keys

step audio 2 mini

The system adopts dual tokenization and interleaved: a semantic codebook of 1024 entries at ~16,7 Hz and another acoustic codebook of 4096 at ~25 Hz, synchronized with a temporal ratio 2:3This token-level integration allows for greater detail in representing both linguistic content and sound texture at the same time.

For generation, a hybrid voice decoder which combines a flow matching model with a mel-to-wave vocoder. When trained with the interleaved double codebook scheme, the system retains intelligibility and naturalness of speech during synthesis, even when controlling emotion, speed or style.

Streaming architecture relies on a Controller which coordinates VAD (Voice Activity Detection), real-time audio tokenization, the Step-Audio language model, and the decoder. It incorporates speculative generation (compromising ~40% of tokens) and text-based context management with 14:1 compression, which helps maintain coherence in long dialogues with manageable costs.

In further training, SFTs are combined to ASR and TTS with Reinforcement by Human Feedback (RLHF) and reasoning Chain-of-Thought focused on paralinguistics. This improves the model's ability to interpret signals such as emotions, tone or music and respond in a nuanced and controllable way.

Download, installation and local use

The model is available in hugging face and the official repository, with inference-ready scripts and a local web demo. The steps for preparing the environment (conda + pip) and downloading with Git LFS are straightforward, and on modern computers, quick to replicate.

conda create -n stepaudio2 python=3.10
conda activate stepaudio2
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml

# Repositorio y pesos
git clone https://github.com/stepfun-ai/Step-Audio2.git
cd Step-Audio2

# Modelos en Hugging Face
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini

To perform a first test, simply run the example script: Inference works with audio and text and allows you to validate the environment configuration without complications.

python examples.py

There is also a local web demo with a simple interface that is built with Gradio, ideal for evaluating voice interaction in a browser.

pip install gradio
python web_demo.py

Online demos, console, and mobile app

StepFun offers a real-time console to test the model from the browser, as well as a mobile assistant with built-in web and audio search. In the app, simply download it from the store, open it, and tap the phone icon in the top right corner to activate voice mode.

The community can join a WeChat group via QR code to discuss, share results, and resolve questions. And if you prefer, the direct download links are as follows: GitHub (Step-Audio2), hugging face (Step-Audio-2-mini) and ModelScope (namesake model). On some external listings, you'll see cookie warnings or browser compatibility messages (like on Reddit or X), which is common on social platforms.

  • GitHub: https://github.com/stepfun-ai/Step-Audio2
  • hugging face: https://huggingface.co/stepfun-ai/Step-Audio-2-mini
  • ModelScope: https://www.modelscope.cn/models/stepfun-ai/Step-Audio-2-mini

Benchmark performance: comprehension, paralinguistics, and more

In public and home tests, Step-Audio 2 Mini and its bigger brother show benchmark resultsBelow, we review the key points compared to commercial and open source systems: GPT-4o Audio, Qwen-Omni/Qwen2.5-Omni, Kimi-Audio, Omni-R1, Audio Flamingo 3, Doubao LLM ASR, among others.

Multilingual ASR (lower CER/WER rates are better)

In English, the average WER is Step-Audio 2 in 3,14 already 2 Mini at 3,50, with ensembles such as Common Voice, FLEURS, and LibriSpeech (clean/other). LibriSpeech "other" stands out with 2,42 for Step-Audio 2, below open and commercial alternatives. Chinese, averages 3,08 (Step-Audio 2) and 3,19 (Mini), with good results in AISHELL/AISHELL-2, KeSpeech and WenetSpeech.

For scenarios multilingual Additionally, it shines in Japanese (FLEURS) with 3,18 (Step-Audio 2) and 4,67 (Mini), and competes in Cantonese (Common Voice yue). In the “in-house” set with Chinese accents and dialects, the average falls to 8,85 (Step-Audio 2) and 9,85 (Mini), with clear improvements in demanding dialects such as Shanghainese (17,77 vs 19,30 compared to other options that exceed 58).

Paralinguistic understanding

In the StepEval-Audio-Paralinguistic suite, Step-Audio 2 reaches 83,09 on average and 2 Mini 80,00. By dimensions: gender and age reach 100/96 (2) and 100/94 (Mini); timbre 82/80; stage 78/78; emotion 86/82; rhythm 86/68; speed 88/74; style 88/86; and vocal 68/76. The leap compared to previous systems demonstrates fine prosodic control and perceptual robustness.

Audio Reasoning and Comprehension (MMAU)

In the MMAU benchmark, Step-Audio 2 leads with an average of 78,0 (83,5 in sound, 76,9 in voice, 73,7 in music), while 2 Mini mark 73,2. Among those compared: Omni-R1 77,0, Audio Flamingo 3 73,1, Gemini 2.5 Pro 71,6, Qwen2.5-Omni 71,5 and GPT-4o Audio 58,1. This shows a competitive auditory reasoning even in the face of commercial alternatives.

Voice translation

In CoVoST 2 (S2TT), the averages amount to 39,29 for Step-Audio 2 Mini and 39,26 for Step-Audio 2, with greater strength in English→Chinese (~49). In CVSS (S2ST), Step-Audio 2 leads with an average score of 30,87, while Mini scores 29,08; GPT-4o Audio scores around 23,68. Overall, these results consolidate the cross-language competence in text and generated speech.

Native Tool Calling

In StepEval-Audio-Toolcall (audio, date/time, weather and web search), Step-Audio 2 achieves high precisions/recalls trigger and 100% in type/parameter identification when applicable. For example, in audio search, its trigger averages 86,8/99,5; in web search, 88,4/95,5; and in weather, 92,2/100. Against a strong baseline (Qwen3-32B), it maintains very solid balances between trigger, type and parameters.

Voice-to-Voice Conversation (URO-Bench)

For Chinese (basic/pro), Step-Audio 2 scores 83,32/68,25 and 2 Mini 77,81/69,57. In English, GPT-4o Audio scores 84,54/90,41 on average, but Step-Audio 2 follows closely behind in understanding and reasoning (92,72/76,51 in basic U/R and 64,86/67,75 in pro), while Mini offers 74,36 basic average, remarkable for a system open end-to-end.

Relationship with Step-Audio (130B) and TTS 3B

The Step-Audio ecosystem includes a model 130B which serves as a textual basis, with continuous contextualized audio pre-training and task-based post-training. Thanks to a generative data engine, high-quality audios are synthesized to train and publicly release an efficient 3B TTS (Step-Audio-TTS-3B) with very granular control of instructions (emotions, dialects, styles).

In ASR, compared to references such as Whisper Large-v3 and Qwen2-Audio, the Step-Audio Pretrain and Step-Audio-Chat variants record Competitive CER/WER in Aishell-1/2, WenetSpeech, and LibriSpeech. For example, in Aishell-1, Step-Audio Pretrain reaches 0,87% CER; and in LibriSpeech test-clean, Step-Audio-Chat achieves 3,11% WER, with Qwen2-Audio at 1,6% as a reference. These figures reflect that discrete tokenization audio can match or outperform hidden feature approaches across different sets.

In TTS, the Step-Audio-TTS-3B and "Single" variants show low error rates and speaker similarity (SS) high compared to FireRedTTS, MaskGCT and CosyVoice/2. In test-zh, for example, Step-Audio-TTS reaches 1,17% CER; in test-en, 2,0% WER, with competitive SS. Furthermore, when evaluating generation from discrete tokens, Step-Audio-TTS-3B achieves 2,192% CER (zh) and 3,585% WER (en), with SS around 0,784/0,742, values ​​that reveal clarity and stability voice.

Requirements and deployment

For the complete Step-Audio family, we recommend NVIDIA GPUs with CUDA. The reference configuration for Step-Audio-Chat (130B) is four A800/H800 80 GB. A Dockerfile to prepare the environment and recommendations such as using vLLM with tensor parallelism for 130B (taking into account that the official branch may not yet support the Step 1 model, and that a personalized flash attention by the ALiBi variant used).

In the case of Step-Audio 2 Mini, the requirements are more content and local inference is viable for testing and prototyping. The web demo and example scripts make it easy to validate the stack without the need for complex orchestrations or distributed infrastructure.

Use cases and practical examples

Step-Audio 2 Mini has already proven itself capable of detect natural sounds and professional voiceovers, control the speech tempo on demand, and perform real-time searches to bring you breaking news. Faced with philosophical dilemmas, it turns abstract queries in clear methods and steps, reflecting their auditory and verbal reasoning power.

There are also fluent multilingual examples (Chinese, English, Japanese), language games and idioms such as “It's raining cats and dogs”, capable of being explained with simplicity and a natural tone. Public displays include speed control (very fast/very slow), demonstrating that the model not only understands the content, but governs prosody on demand.

License and citation

The code and models in the repository are published under Apache 2.0 licenseThe associated technical report can be cited as Step-Audio 2 Technical Report (arXiv: 2507.16632), with extensive authorship led by Boyong Wu et al., and affiliation with StepFun AI. For more details, see the arXiv entry and the official BibTeX.

@misc{wu2025stepaudio2technicalreport,
  title={Step-Audio 2 Technical Report},
  author={Boyong Wu et al.},
  year={2025},
  eprint={2507.16632},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.16632}
}

Step-Audio 2 Mini offers a very rare blend of ASR precision, paralinguistic understanding, auditory reasoning and natural synthesis, packaged in an end-to-end framework ready for practical deployments; with tools, multimodal RAG and fine-grained voice control, it is positioned as an open, versatile and effective option. JACK in several key tasks.