-
Notifications
You must be signed in to change notification settings - Fork 149
Description
Docker Container Version/Tag : r23.07-tf-2.12.0-onednn-acl
ARM System : Graviton 3 (c7g.8xlarge)
Architecture: aarch64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 32
Caches (sum of all):
L1d: 2 MiB (32 instances)
L1i: 2 MiB (32 instances)
L2: 32 MiB (32 instances)
L3: 32 MiB (1 instance)
Intel System: Icelake (c6i.8xlarge)
Architecture: x86_64
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 6
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 768 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 20 MiB (16 instances)
L3: 54 MiB (1 instance)
As per this blog, ACL inference should be faster than intel systems for transformer models,
We ran Tensorflow Hugging Face BERT model for Inference (Python Code Attached as txt file here)
TF_bert_inf - Copy.txt
Below are results for inference speeds in seconds :
| Env Variables | Graviton | Icelake |
|---|---|---|
| No Opts | 0.2294 | 0.145099 |
| TF_ENABLE_ONEDNN_OPTS=1 | 0.2191 | 0.144636 |
| ONEDNN_DEFAULT_FPMATH_MODE=BF16 | 1.49034 | 0.145511 |
From the results above, we see that the performance is almost 1.8x worse for ARM cores compared to Intel ones.
The code is run on 2 cores for both the Intel and ARM systems.
Another issue is enabling FPMATH mode to BF16 degrades the performance.
From the oneDNN logs, we see that when BF16 is enabled, there are overheads while executing reorder for ARM cores
| Env Variables | Reorder Time (msecs) |
|---|---|
| TF_ENABLE_ONEDNN_OPTS=1 | 0.582031 |
| ONEDNN_DEFAULT_FPMATH_MODE=BF16 | 11.1628 |
This is observed only for larger sized Matmul operations. Here the size was 768x768 and the reorder uses "simple:any" implementation instead of "jit:uni" in oneDNN.
Attaching oneDNN Verbose for both scenarios
Bert_TF12_issue_verbose_BF16.txt
Bert_TF12_issue_verbose_OPTS.txt
Request your views and comments on whether we need any other settings to improve the performance