Skip to content

audio tester class#45391

Open
tarekziade wants to merge 3 commits intomainfrom
tarekziade-audio-test
Open

audio tester class#45391
tarekziade wants to merge 3 commits intomainfrom
tarekziade-audio-test

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

What does this PR do?

Similarly to the VLM tester, this patch introduces a audio tester class, used in

  • Qwen2Audio
  • AudioFlamingo3
  • GraniteSpeech

Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.

@tarekziade tarekziade self-assigned this Apr 13, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

run-slow: audioflamingo3, granite_speech, qwen2_audio

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 5472faa4 workflow commit (merge commit)
PR 0817bdbd branch commit (from PR)
main a5533957 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!! 🔥
Models that should be covered by this PR:

  • audioflamingo3
  • glmasr
  • granite_speech
  • higgs_audio_v2
  • kyutai_speech_to_text
  • qwen2_audio
  • vibevoice_asr
  • voxtral
  • voxtral_realtime
  • musicflamingo

might:

  • gemma3n
  • gemma4
  • qwen2_5_omni
  • qwen3_omni_moe

Comment on lines +35 to +37

class AudioModelTester:
# If the model follows standard naming conventions, only `config_class` and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if ALMs acronym is not as conventionally accepted as VLMs, such a notation is used, so I'd rather align fully with VLMs here, it would be clearer imo

Suggested change
class AudioModelTester:
# If the model follows standard naming conventions, only `config_class` and
class ALMModelTester:
# If the model follows standard naming conventions, only `config_class` and

and rename the test file to alm_tester.py

Comment on lines +41 to +43
base_model_class = None
sequence_classification_class = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a model for which sequence_classification_class would be set. Not sure we should keep it.

Comment on lines +86 to +88
kwargs.setdefault("audio_token_id", 0)
kwargs.setdefault("audio_token_index", 0) # Alias for models that use this name
kwargs.setdefault("ignore_index", -100)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have it added directly when init is overwritten

Suggested change
kwargs.setdefault("audio_token_id", 0)
kwargs.setdefault("audio_token_index", 0) # Alias for models that use this name
kwargs.setdefault("ignore_index", -100)
kwargs.setdefault("audio_token_id", 0)
kwargs.setdefault("ignore_index", -100)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i think we won't even need audio_token_index because config has an attribute_map

raise ValueError(
f"You have inherited from AudioModelTester but did not set the {required_attribute} attribute."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove the defaults for text_config and audio_config no? and rather use the same as for VLMs

Suggested change
for required_attribute in [
"base_model_class",
"config_class",
"conditional_generation_class",
"text_config_class",
"audio_config_class",
]:
if getattr(self, required_attribute) is None:
raise ValueError(
f"You have inherited from VLMModelTester but did not set the {required_attribute} attribute."
)

Comment on lines +156 to +160
def get_num_audio_tokens(self, audio_features):
"""Compute number of audio placeholder tokens from features. Override for different subsampling."""
# Default: 2-stage pooling (common for Whisper-style encoders)
input_length = (audio_features.shape[-1] - 1) // 2 + 1
return (input_length - 2) // 2 + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't put whisper defaults here but rather force sub classes to write this method

Comment on lines +210 to +212
config = self.get_config()
audio_features = self.create_audio_features()
num_audio_tokens = self.get_num_audio_tokens(audio_features)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some models take input_values: eg vibevoice_asr. We should handle those too

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, granite_speech, qwen2_audio

Comment on lines +91 to +106
# Text config defaults (small Qwen2-style backbone)
kwargs.setdefault(
"text_config",
{
"model_type": "qwen2",
"intermediate_size": 36,
"initializer_range": 0.02,
"hidden_size": 32,
"max_position_embeddings": 52,
"num_hidden_layers": 2,
"num_attention_heads": 4,
"num_key_value_heads": 2,
"vocab_size": 99,
"pad_token_id": 1,
},
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we really have a default for text/audio configs?

Comment on lines +276 to +278
def test_sdpa_can_dispatch_composite_models(self):
"""Verify SDPA toggles propagate correctly to audio and text sub-modules."""
if not self.has_attentions:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why is this not handled by the Mixin, afair it gets base-model and the common attributes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants