Skip to content

Fix YaRN RoPE bugs in model builder and add parity tests#2076

Open
titaiwangms wants to merge 2 commits intomainfrom
pr/fix-yarn-config-resolution
Open

Fix YaRN RoPE bugs in model builder and add parity tests#2076
titaiwangms wants to merge 2 commits intomainfrom
pr/fix-yarn-config-resolution

Conversation

@titaiwangms
Copy link
Copy Markdown

@titaiwangms titaiwangms commented Apr 8, 2026

Summary

Fixes four bugs in the ModelBuilder's YaRN RoPE configuration resolution that caused completely wrong cos/sin caches, producing garbage output for ALL YaRN-based models (Ministral-3-3B, OpenAI OS-minier, and any model using beta_fast/beta_slow in rope_scaling).

Also adds comprehensive parity tests and addresses review feedback.

Bugs Fixed

Bug 1: hasattr on dict for original_max_position_embeddings

hasattr(config.rope_scaling, 'original_max_position_embeddings') always returns False for dicts. Fixed to use isinstance(Mapping) + in operator.

Bug 2: rope_theta fallback

Models that store rope_theta only in rope_scaling dict (not as a top-level config attribute) fell through to default theta=10000. Added fallback chain: config.rope_thetaconfig.rope_embedding_baseconfig.rope_scaling["rope_theta"]10000.

Bug 3: YaRN mscale override

Always computed mscale from factor via make_mscale(), ignoring explicit mscale=1.0 in config. Ministral-3-3B sets mscale=1.0 but was getting mscale≈1.277. Now respects the config value when > 0.

Bug 4: inv_freq double-inversion

make_inv_freq_rescaled_with_ntk computed 1/(factor * inv_freq) which double-inverts since inv_freq is already 1/pos_freqs, giving pos_freqs/factor (wrong). Fixed to inv_freq / factor.

Review Feedback Addressed

  • Changed isinstance(rope_scaling, dict)isinstance(rope_scaling, Mapping) in 2 locations (handles FrozenDict, MappingProxyType, dict subclasses)
  • Added Mapping import from collections.abc
  • Added documentation comment for mscale_all_dim handling

Tests Added

Added test/python/test_yarn_rope_parity.py with 8 tests verifying cos/sin cache parity against HuggingFace transformers reference:

  1. test_ministral_3b_cos_sin_match — end-to-end parity (128 positions)
  2. test_bug_a_hasattr_on_dict — regression test for dict key access
  3. test_bug_b_rope_theta_fallback — regression test for theta resolution
  4. test_bug_c_mscale_override — regression test for explicit mscale
  5. test_mscale_fallback_when_absent — tests .get("mscale", 0) fallback path
  6. test_bug_d_inv_freq_no_double_inversion — regression test for NTK scaling
  7. test_full_cache_length — parity at 2048 positions
  8. test_mapping_isinstance_with_frozen_dict — validates Mapping check with MappingProxyType

Impact

  • Fixes ALL models using YaRN RoPE (beta_fast/beta_slow in rope_scaling)
  • Verified: cos/sin caches match HF transformers reference (atol=1e-5)
  • Verified: Ministral-3-3B generates correct output ("The capital of France is Paris.")

Files Changed

  • src/python/py/models/builders/base.py — 4 bug fixes + Mapping import + mscale docs
  • test/python/test_yarn_rope_parity.py — 8 new parity tests (new file)

Fix four bugs in the YaRN RoPE configuration that caused completely
wrong cos/sin caches, producing garbage output for YaRN-based models
(e.g. Ministral-3-3B, OpenAI OS-minier).

Bug 1 - hasattr on dict (line 60):
  hasattr(config.rope_scaling, 'original_max_position_embeddings')
  always returns False for dict objects. Fixed to use 'in' operator:
  'original_max_position_embeddings' in config.rope_scaling

Bug 2 - rope_theta fallback (line 231):
  Models that store rope_theta only in rope_scaling dict (not as a
  top-level config attribute) fell through to default theta=10000.
  Added fallback to check config.rope_scaling['rope_theta'].

Bug 3 - mscale override (line 464):
  Always computed mscale from factor via make_mscale_yarn(), ignoring
  the explicit mscale value from config. Models like Ministral-3-3B
  set mscale=1.0 to disable scaling, but got mscale=1.277 instead.
  Now respects the config value when explicitly provided.

Bug 4 - inv_freq double-inversion (line 1750):
  make_inv_freq_rescaled_with_ntk computed:
    interpolation = 1.0 / (factor * inv_freq)
    extrapolation = 1.0 / inv_freq
  Since inv_freq is already 1/pos_freqs, this produced pos_freqs/factor
  and pos_freqs respectively (the raw frequencies, not their inverses).
  Fixed to match HF transformers _compute_yarn_parameters:
    interpolation = inv_freq / factor
    extrapolation = inv_freq

Impact: Affects ALL models using YaRN RoPE (beta_fast/beta_slow config).
Verified: cos/sin caches now match HuggingFace reference implementation.
Tested with Ministral-3-3B: 'The capital of France is **Paris**.'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 23:15
@titaiwangms titaiwangms marked this pull request as draft April 8, 2026 23:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes YaRN RoPE configuration resolution in the Python ModelBuilder to prevent incorrect rotary cos/sin cache generation (which can produce unusable outputs for YaRN-based models).

Changes:

  • Fixes original_max_position_embeddings detection when rope_scaling is stored as a dict-like object.
  • Adds rope_theta fallback to read from rope_scaling when not present as a top-level config field.
  • Respects explicit YaRN mscale from config and corrects NTK rescaling math to avoid double-inversion.

@titaiwangms titaiwangms force-pushed the pr/fix-yarn-config-resolution branch from 2bb8e6d to dbb6147 Compare April 9, 2026 20:25
- Change isinstance(rope_scaling, dict) to isinstance(rope_scaling, Mapping)
  in 2 locations in base.py. This handles dict subclasses and MappingProxyType
  (e.g., FrozenDict) that HuggingFace configs may use.

- Add test/python/test_yarn_rope_parity.py with 7 tests verifying cos/sin
  cache parity against HuggingFace reference for YaRN RoPE. Covers all 4
  bugs fixed in this PR:
  (a) hasattr on dict — original_context_length from rope_scaling dict
  (b) rope_theta fallback — theta from rope_scaling when not top-level
  (c) mscale=1.0 override — explicit mscale respected, not recomputed
  (d) inv_freq double-inversion — uses inv_freq/factor, not 1/(factor*inv_freq)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the pr/fix-yarn-config-resolution branch from dbb6147 to d786f35 Compare April 9, 2026 20:28
@titaiwangms titaiwangms marked this pull request as ready for review April 9, 2026 21:00
@titaiwangms titaiwangms changed the title Fix 4 ModelBuilder config resolution bugs for YaRN RoPE models Fix YaRN RoPE bugs in model builder and add parity tests Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants