Skip to content

Conversation

@xhcao
Copy link
Contributor

@xhcao xhcao commented Nov 14, 2025

In order to use transpose-shared instead transpose-naive, we could split transpose perm{2310} in two steps, which benifits Conv operator.

Description

Motivation and Context

In order to use transpose-shared instead transpose-naive,
we could split transpose perm{2310} in two steps, which
benifits Conv operator.
@xhcao
Copy link
Contributor Author

xhcao commented Nov 14, 2025

The PR gets performance on sdunet-v1.5-demo-layernorm model, all Conv|Transpose time is from 224ms to 135ms

@xhcao
Copy link
Contributor Author

xhcao commented Nov 14, 2025

@jchen10 @daijh PTAL

@jchen10
Copy link
Contributor

jchen10 commented Nov 14, 2025

Looks great. As we discussed in #26554 (comment), we are going to cache the transposed kernel. This PR could be less beneficial for Conv|Transpose. Maybe we could find other place to apply this optimization later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants