I am experiencing a major performance difference when running on an NVIDIA Jetson Orin (ARM64) compared to a modern x86-64 system. The same code, with the same data input, runs slower on the ARM platform. The difference mainly because of the decoding part of t2s.infer_panel_batch_infer
I am seeking the guidance on ARM-specific optimizations for this code section.