Draft
Conversation
- C++ implementation: NemotronModel + NemotronState with 3-session RNNT pipeline (encoder, decoder, joint) and greedy decode - Model type registration: nemotron_asr added to ALM types - Config parsing: encoder/decoder I/O names, audio parameters - Export tooling: ONNX export, tokenizer conversion, graph fusion + INT4 quantization (3.6x encoder size reduction) - E2E tests: dummy audio + real speech validation - Documentation: architecture overview and usage guide
Add RunStreamingEncoder with cache carry-forward (MHA + causal conv). Add GreedyDecodeIncremental for per-chunk RNNT decode. Auto-detect streaming mode via encoder ONNX input probing. Batch mode fallback supports per-chunk re-inference. Export script: --streaming flag wraps encoder with cache I/O. Streaming encoder: 5 inputs (audio + length + 3 caches), 5 outputs. Cache shapes: channel [B,24,70,1024], time [B,24,1024,8], len [B].
forward_for_export() handles [B, n_layers, ...] <-> [n_layers, B, ...] transposition internally. The wrapper was incorrectly adding another transpose, causing RuntimeError in multi_head_attention during export. Fix: Remove transpose calls from StreamingEncoderWrapper.forward(). ONNX I/O consistently uses [B, n_layers, ...] format for caches. Also: clarify cache format comments in nemotron.h/cpp.
- export_nemotron_to_onnx.py: add generate_genai_config() and generate_audio_processor_config() that extract model dimensions, I/O names, and audio params from the loaded NeMo model - optimize_encoder.py: annotate genai_config.json with optimization metadata (fusion type, quantization method) when INT4 is applied
…d quantization - Mel replay buffer: save mel chunks during blank periods, replay after encoder+decoder reset to recover lost audio instead of hallucinating - Encoder cache reset: zero all 3 encoder cache tensors when stuck detector triggers (2+ consecutive blank chunks), not just decoder LSTM state - Hybrid decoder reset: reset decoder state on first blank chunk, encoder caches on second+ blank, then replay buffered mel data - k_quant_mixed quantization: mixed-precision INT4 that preserves FP32 for sensitive layers (attention Q/K/V/Out, first/last encoder, pre_encode) - HQQ quantization option: Half-Quadratic Quantization support - optimize_encoder.py: --quant_method flag (rtn|k_quant_mixed|hqq), sensitive node detection, external data filename preservation - generators.cpp: graceful stop via shouldStop flag, drain-on-stop, CommitAudio + polling for clean shutdown
…ig parser Support streaming cache-aware encoder inputs (cache_last_channel, cache_last_time, cache_last_channel_len) and outputs (*_next variants) in genai_config.json parsing. Also add optimization section sink to silently consume encoder.optimization metadata.
| """ | ||
|
|
||
| import argparse | ||
| import os |
Check notice
Code scanning / CodeQL
Unused import Note
| print(f" ✓ Full sequence ({len(token_list)} tokens): {token_list[:20]}{'...' if len(token_list) > 20 else ''}") | ||
|
|
||
| # Decode tokens to text (skip the first token which is the dummy BOS) | ||
| decoded_ids = np.array(token_list[1:], dtype=np.int32) # Skip dummy BOS |
Check notice
Code scanning / CodeQL
Unused local variable Note test
| resampler = torchaudio.transforms.Resample(sr, 16000) | ||
| waveform_t = resampler(waveform_t) | ||
| waveform_np = waveform_t.squeeze(0).numpy() | ||
| sr = 16000 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
| resampler = torchaudio.transforms.Resample(sr, 16000) | ||
| waveform_t = resampler(waveform_t) | ||
| waveform_np = waveform_t.squeeze(0).numpy() | ||
| sr = 16000 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| success = main() |
Check notice
Code scanning / CodeQL
Unused global variable Note test
4fb3784 to
f456a3d
Compare
f456a3d to
79c5025
Compare
79c5025 to
f9160cd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.