Skip to content

Conversation

@amitport
Copy link
Contributor

@amitport amitport commented Mar 1, 2025

Hi,

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.

example:

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"

This PR ensure that max_seq_length is overwritten, even when it exists

when loading an auto-model, max_seq_length is read directedly from huggingface and it cannot be overwritten easily.
@tomaarsen
Copy link
Member

Hello!

This seems to be an issue only for the models where a sentence_bert_config.json specifies a max_seq_length: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/main/sentence_bert_config.json

This value is indeed seen as "user-provided", which has priority over any values from transformers (including the values you provide with tokenizer_kwargs). This is indeed a bit frustrating, but I don't really want to change it to just min(...) as I'd like to allow users to set whatever maximum sequence length they want (even if the tokenizer/model disagrees). I also risk backwards incompatibility if I change this.

You can avoid this with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", tokenizer_kwargs={"model_max_length": 32})
model.max_seq_length = 32

assert model.max_seq_length == 32, f"expected 32, but got {model.max_seq_length=}"
assert model[0].max_seq_length == 32, f"expected 32, but got {model[0].max_seq_length=}"

but that too isn't ideal.

  • Tom Aarsen

@amitport
Copy link
Contributor Author

@tomaarsen I used the workaround and it's fine, but the current behavior is still a bug in IMHO (and a silent one, that may make models fail unexpectedly fro the user)

@tomaarsen
Copy link
Member

Fair enough, I'll try to revisit this PR and see if there's a solid solution that doesn't break backwards compatibility, but also fixes this issue.
I merged master to avoid some issues with the CI tests.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants