Skip to content

[Draft] Parakeet export#1977

Draft
jiafatom wants to merge 7 commits intomicrosoft:mainfrom
jiafatom:parakeet_export
Draft

[Draft] Parakeet export#1977
jiafatom wants to merge 7 commits intomicrosoft:mainfrom
jiafatom:parakeet_export

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

No description provided.

- C++ implementation: NemotronModel + NemotronState with 3-session RNNT
  pipeline (encoder, decoder, joint) and greedy decode
- Model type registration: nemotron_asr added to ALM types
- Config parsing: encoder/decoder I/O names, audio parameters
- Export tooling: ONNX export, tokenizer conversion, graph fusion +
  INT4 quantization (3.6x encoder size reduction)
- E2E tests: dummy audio + real speech validation
- Documentation: architecture overview and usage guide
Add RunStreamingEncoder with cache carry-forward (MHA + causal conv).
Add GreedyDecodeIncremental for per-chunk RNNT decode.
Auto-detect streaming mode via encoder ONNX input probing.
Batch mode fallback supports per-chunk re-inference.

Export script: --streaming flag wraps encoder with cache I/O.
Streaming encoder: 5 inputs (audio + length + 3 caches), 5 outputs.
Cache shapes: channel [B,24,70,1024], time [B,24,1024,8], len [B].
forward_for_export() handles [B, n_layers, ...] <-> [n_layers, B, ...]
transposition internally. The wrapper was incorrectly adding another
transpose, causing RuntimeError in multi_head_attention during export.

Fix: Remove transpose calls from StreamingEncoderWrapper.forward().
ONNX I/O consistently uses [B, n_layers, ...] format for caches.

Also: clarify cache format comments in nemotron.h/cpp.
- export_nemotron_to_onnx.py: add generate_genai_config() and
  generate_audio_processor_config() that extract model dimensions,
  I/O names, and audio params from the loaded NeMo model
- optimize_encoder.py: annotate genai_config.json with optimization
  metadata (fusion type, quantization method) when INT4 is applied
…d quantization

- Mel replay buffer: save mel chunks during blank periods, replay after
  encoder+decoder reset to recover lost audio instead of hallucinating
- Encoder cache reset: zero all 3 encoder cache tensors when stuck detector
  triggers (2+ consecutive blank chunks), not just decoder LSTM state
- Hybrid decoder reset: reset decoder state on first blank chunk, encoder
  caches on second+ blank, then replay buffered mel data
- k_quant_mixed quantization: mixed-precision INT4 that preserves FP32 for
  sensitive layers (attention Q/K/V/Out, first/last encoder, pre_encode)
- HQQ quantization option: Half-Quadratic Quantization support
- optimize_encoder.py: --quant_method flag (rtn|k_quant_mixed|hqq),
  sensitive node detection, external data filename preservation
- generators.cpp: graceful stop via shouldStop flag, drain-on-stop,
  CommitAudio + polling for clean shutdown
…ig parser

Support streaming cache-aware encoder inputs (cache_last_channel,
cache_last_time, cache_last_channel_len) and outputs (*_next variants)
in genai_config.json parsing. Also add optimization section sink to
silently consume encoder.optimization metadata.
@jiafatom jiafatom changed the title Parakeet export [Draft] Parakeet export Feb 12, 2026
@jiafatom jiafatom marked this pull request as draft February 12, 2026 22:23
"""

import argparse
import os

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'os' is not used.
print(f" ✓ Full sequence ({len(token_list)} tokens): {token_list[:20]}{'...' if len(token_list) > 20 else ''}")

# Decode tokens to text (skip the first token which is the dummy BOS)
decoded_ids = np.array(token_list[1:], dtype=np.int32) # Skip dummy BOS

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable decoded_ids is not used.
resampler = torchaudio.transforms.Resample(sr, 16000)
waveform_t = resampler(waveform_t)
waveform_np = waveform_t.squeeze(0).numpy()
sr = 16000

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable sr is not used.
resampler = torchaudio.transforms.Resample(sr, 16000)
waveform_t = resampler(waveform_t)
waveform_np = waveform_t.squeeze(0).numpy()
sr = 16000

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable sr is not used.


if __name__ == "__main__":
success = main()

Check notice

Code scanning / CodeQL

Unused global variable Note test

The global variable 'success' is not used.
@jiafatom jiafatom force-pushed the parakeet_export branch 2 times, most recently from 4fb3784 to f456a3d Compare February 17, 2026 17:36
import argparse
import json
import os
import shutil

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'shutil' is not used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants