Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: audioflamingo3, granite_speech, qwen2_audio |
|
This comment contains models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"] |
eustlb
left a comment
There was a problem hiding this comment.
This is cool!! 🔥
Models that should be covered by this PR:
- audioflamingo3
- glmasr
- granite_speech
- higgs_audio_v2
- kyutai_speech_to_text
- qwen2_audio
- vibevoice_asr
- voxtral
- voxtral_realtime
- musicflamingo
might:
- gemma3n
- gemma4
- qwen2_5_omni
- qwen3_omni_moe
|
|
||
| class AudioModelTester: | ||
| # If the model follows standard naming conventions, only `config_class` and |
There was a problem hiding this comment.
Even if ALMs acronym is not as conventionally accepted as VLMs, such a notation is used, so I'd rather align fully with VLMs here, it would be clearer imo
| class AudioModelTester: | |
| # If the model follows standard naming conventions, only `config_class` and | |
| class ALMModelTester: | |
| # If the model follows standard naming conventions, only `config_class` and |
and rename the test file to alm_tester.py
| base_model_class = None | ||
| sequence_classification_class = None | ||
|
|
There was a problem hiding this comment.
I don't see a model for which sequence_classification_class would be set. Not sure we should keep it.
| kwargs.setdefault("audio_token_id", 0) | ||
| kwargs.setdefault("audio_token_index", 0) # Alias for models that use this name | ||
| kwargs.setdefault("ignore_index", -100) |
There was a problem hiding this comment.
I would rather have it added directly when init is overwritten
| kwargs.setdefault("audio_token_id", 0) | |
| kwargs.setdefault("audio_token_index", 0) # Alias for models that use this name | |
| kwargs.setdefault("ignore_index", -100) | |
| kwargs.setdefault("audio_token_id", 0) | |
| kwargs.setdefault("ignore_index", -100) |
There was a problem hiding this comment.
+1, i think we won't even need audio_token_index because config has an attribute_map
| raise ValueError( | ||
| f"You have inherited from AudioModelTester but did not set the {required_attribute} attribute." | ||
| ) | ||
|
|
There was a problem hiding this comment.
we should remove the defaults for text_config and audio_config no? and rather use the same as for VLMs
| for required_attribute in [ | |
| "base_model_class", | |
| "config_class", | |
| "conditional_generation_class", | |
| "text_config_class", | |
| "audio_config_class", | |
| ]: | |
| if getattr(self, required_attribute) is None: | |
| raise ValueError( | |
| f"You have inherited from VLMModelTester but did not set the {required_attribute} attribute." | |
| ) |
| def get_num_audio_tokens(self, audio_features): | ||
| """Compute number of audio placeholder tokens from features. Override for different subsampling.""" | ||
| # Default: 2-stage pooling (common for Whisper-style encoders) | ||
| input_length = (audio_features.shape[-1] - 1) // 2 + 1 | ||
| return (input_length - 2) // 2 + 1 |
There was a problem hiding this comment.
we shouldn't put whisper defaults here but rather force sub classes to write this method
| config = self.get_config() | ||
| audio_features = self.create_audio_features() | ||
| num_audio_tokens = self.get_num_audio_tokens(audio_features) |
There was a problem hiding this comment.
some models take input_values: eg vibevoice_asr. We should handle those too
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, granite_speech, qwen2_audio |
| # Text config defaults (small Qwen2-style backbone) | ||
| kwargs.setdefault( | ||
| "text_config", | ||
| { | ||
| "model_type": "qwen2", | ||
| "intermediate_size": 36, | ||
| "initializer_range": 0.02, | ||
| "hidden_size": 32, | ||
| "max_position_embeddings": 52, | ||
| "num_hidden_layers": 2, | ||
| "num_attention_heads": 4, | ||
| "num_key_value_heads": 2, | ||
| "vocab_size": 99, | ||
| "pad_token_id": 1, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
can we really have a default for text/audio configs?
| def test_sdpa_can_dispatch_composite_models(self): | ||
| """Verify SDPA toggles propagate correctly to audio and text sub-modules.""" | ||
| if not self.has_attentions: |
There was a problem hiding this comment.
hmm why is this not handled by the Mixin, afair it gets base-model and the common attributes?
What does this PR do?
Similarly to the VLM tester, this patch introduces a audio tester class, used in
Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.