Skip to content

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Sep 2, 2025

Since we are refactoring chat template, we will use template of internlm2 for now.

from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig, ChatTemplateConfig


from lmdeploy.pytorch.tools.utils import Timer, visualize_pipe_out


if __name__ == '__main__':
    model_path = 'JetLM/SDAR-1.7B-Chat'
    chat_template_config = ChatTemplateConfig('internlm2')

    log_level = 'WARNING'

    dllm_unmasking_strategy='low_confidence_dynamic'
    # dllm_unmasking_strategy='sequential'

    prompts = [
        'hakuna matata!',
        'The quick brown fox jumps over the lazy dog.'
        ]

    backend_config = PytorchEngineConfig(
        tp=1,
        dllm_block_length=4,
        dllm_unmasking_strategy=dllm_unmasking_strategy,
    )

    gen_config = GenerationConfig(
        max_new_tokens=512,
    )


    with pipeline(model_path, backend_config=backend_config,
                  chat_template_config=chat_template_config, log_level=log_level) as pipe:
        outputs = pipe(prompts, gen_config=gen_config)
        print(outputs)

@grimoire grimoire changed the title [POC]Support SDAR Support SDAR Sep 8, 2025
@grimoire grimoire marked this pull request as ready for review September 8, 2025 07:14
@lvhan028 lvhan028 self-requested a review September 9, 2025 07:21
@lvhan028 lvhan028 added the enhancement New feature or request label Sep 12, 2025
@@ -179,7 +183,7 @@ def _reorder_migrating():
return running_migration

@logging_timer('SchedulePrefilling', logger)
def _schedule_prefill(self):
def _schedule_prefill(self, prealloc_size: int = 0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of prealloc_size? I see it's always passed as 0 from the schedule method. Is this intended for future use or should it be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over allocate kv cache might be useful for future models.

@lvhan028
Copy link
Collaborator

Does SDAR conflict with --quant-policy fp8? I tried it, it repeated meaningless words.
FYI, --quant-policy for this model is not required in this PR

@lvhan028
Copy link
Collaborator

@zhulinJulia24 may put JetLM/SDAR-8B-Chat, JetLM/SDAR-30B-A3B-Chat into CI

1,
num_tokens,
), dtype=torch.long, device=device)
seq_length = torch.ones((batch_size, ), dtype=torch.long, device=device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_length should be all max_q_seqlen, not all 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants