Fix YaRN RoPE bugs in model builder and add parity tests#2076
Open
titaiwangms wants to merge 2 commits intomainfrom
Open
Fix YaRN RoPE bugs in model builder and add parity tests#2076titaiwangms wants to merge 2 commits intomainfrom
titaiwangms wants to merge 2 commits intomainfrom
Conversation
Fix four bugs in the YaRN RoPE configuration that caused completely
wrong cos/sin caches, producing garbage output for YaRN-based models
(e.g. Ministral-3-3B, OpenAI OS-minier).
Bug 1 - hasattr on dict (line 60):
hasattr(config.rope_scaling, 'original_max_position_embeddings')
always returns False for dict objects. Fixed to use 'in' operator:
'original_max_position_embeddings' in config.rope_scaling
Bug 2 - rope_theta fallback (line 231):
Models that store rope_theta only in rope_scaling dict (not as a
top-level config attribute) fell through to default theta=10000.
Added fallback to check config.rope_scaling['rope_theta'].
Bug 3 - mscale override (line 464):
Always computed mscale from factor via make_mscale_yarn(), ignoring
the explicit mscale value from config. Models like Ministral-3-3B
set mscale=1.0 to disable scaling, but got mscale=1.277 instead.
Now respects the config value when explicitly provided.
Bug 4 - inv_freq double-inversion (line 1750):
make_inv_freq_rescaled_with_ntk computed:
interpolation = 1.0 / (factor * inv_freq)
extrapolation = 1.0 / inv_freq
Since inv_freq is already 1/pos_freqs, this produced pos_freqs/factor
and pos_freqs respectively (the raw frequencies, not their inverses).
Fixed to match HF transformers _compute_yarn_parameters:
interpolation = inv_freq / factor
extrapolation = inv_freq
Impact: Affects ALL models using YaRN RoPE (beta_fast/beta_slow config).
Verified: cos/sin caches now match HuggingFace reference implementation.
Tested with Ministral-3-3B: 'The capital of France is **Paris**.'
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes YaRN RoPE configuration resolution in the Python ModelBuilder to prevent incorrect rotary cos/sin cache generation (which can produce unusable outputs for YaRN-based models).
Changes:
- Fixes
original_max_position_embeddingsdetection whenrope_scalingis stored as a dict-like object. - Adds
rope_thetafallback to read fromrope_scalingwhen not present as a top-level config field. - Respects explicit YaRN
mscalefrom config and corrects NTK rescaling math to avoid double-inversion.
2bb8e6d to
dbb6147
Compare
- Change isinstance(rope_scaling, dict) to isinstance(rope_scaling, Mapping) in 2 locations in base.py. This handles dict subclasses and MappingProxyType (e.g., FrozenDict) that HuggingFace configs may use. - Add test/python/test_yarn_rope_parity.py with 7 tests verifying cos/sin cache parity against HuggingFace reference for YaRN RoPE. Covers all 4 bugs fixed in this PR: (a) hasattr on dict — original_context_length from rope_scaling dict (b) rope_theta fallback — theta from rope_scaling when not top-level (c) mscale=1.0 override — explicit mscale respected, not recomputed (d) inv_freq double-inversion — uses inv_freq/factor, not 1/(factor*inv_freq) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dbb6147 to
d786f35
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes four bugs in the ModelBuilder's YaRN RoPE configuration resolution that caused completely wrong cos/sin caches, producing garbage output for ALL YaRN-based models (Ministral-3-3B, OpenAI OS-minier, and any model using
beta_fast/beta_slowinrope_scaling).Also adds comprehensive parity tests and addresses review feedback.
Bugs Fixed
Bug 1:
hasattron dict fororiginal_max_position_embeddingshasattr(config.rope_scaling, 'original_max_position_embeddings')always returnsFalsefor dicts. Fixed to useisinstance(Mapping)+inoperator.Bug 2:
rope_thetafallbackModels that store
rope_thetaonly inrope_scalingdict (not as a top-level config attribute) fell through to defaulttheta=10000. Added fallback chain:config.rope_theta→config.rope_embedding_base→config.rope_scaling["rope_theta"]→10000.Bug 3: YaRN
mscaleoverrideAlways computed mscale from factor via
make_mscale(), ignoring explicitmscale=1.0in config. Ministral-3-3B setsmscale=1.0but was gettingmscale≈1.277. Now respects the config value when> 0.Bug 4:
inv_freqdouble-inversionmake_inv_freq_rescaled_with_ntkcomputed1/(factor * inv_freq)which double-inverts sinceinv_freqis already1/pos_freqs, givingpos_freqs/factor(wrong). Fixed toinv_freq / factor.Review Feedback Addressed
isinstance(rope_scaling, dict)→isinstance(rope_scaling, Mapping)in 2 locations (handlesFrozenDict,MappingProxyType, dict subclasses)Mappingimport fromcollections.abcmscale_all_dimhandlingTests Added
Added
test/python/test_yarn_rope_parity.pywith 8 tests verifying cos/sin cache parity against HuggingFace transformers reference:test_ministral_3b_cos_sin_match— end-to-end parity (128 positions)test_bug_a_hasattr_on_dict— regression test for dict key accesstest_bug_b_rope_theta_fallback— regression test for theta resolutiontest_bug_c_mscale_override— regression test for explicit mscaletest_mscale_fallback_when_absent— tests.get("mscale", 0)fallback pathtest_bug_d_inv_freq_no_double_inversion— regression test for NTK scalingtest_full_cache_length— parity at 2048 positionstest_mapping_isinstance_with_frozen_dict— validates Mapping check with MappingProxyTypeImpact
beta_fast/beta_slowinrope_scaling)atol=1e-5)Files Changed
src/python/py/models/builders/base.py— 4 bug fixes + Mapping import + mscale docstest/python/test_yarn_rope_parity.py— 8 new parity tests (new file)