Tensorflow huggingface BERT Model is slower in ARM compared to Intel

**Docker Container Version/Tag : r23.07-tf-2.12.0-onednn-acl**
**ARM System :  Graviton 3 (c7g.8xlarge)** 
Architecture:          aarch64
CPU(s):                32
  On-line CPU(s) list: 0-31
Vendor ID:             ARM
  Model:               1
  Thread(s) per core:  1
  Core(s) per socket:  32
Caches (sum of all):   
  L1d:                 2 MiB (32 instances)
  L1i:                 2 MiB (32 instances)
  L2:                  32 MiB (32 instances)
  L3:                  32 MiB (1 instance)

**Intel System: Icelake (c6i.8xlarge)**
Architecture:            x86_64
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            6
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    54 MiB (1 instance)

As per [this ](https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/machine-learning-inference-on-aws-graviton3)blog,  ACL inference should be faster than intel systems for transformer models,

We ran Tensorflow Hugging Face BERT model for Inference (Python Code Attached as txt file here)
[TF_bert_inf - Copy.txt](https://github.com/ARM-software/Tool-Solutions/files/12461125/TF_bert_inf.-.Copy.txt)
Below are results for inference speeds in seconds :

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/ABHISH~1/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/ABHISH~1/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">



</head>

<body link="#0563C1" vlink="#954F72">



Env   Variables | Graviton | Icelake
-- | -- | --
No Opts | 0.2294 | 0.145099
TF_ENABLE_ONEDNN_OPTS=1 | 0.2191 | 0.144636
ONEDNN_DEFAULT_FPMATH_MODE=BF16 | 1.49034 | 0.145511




</body>

</html>

From the results above, we see that the performance is almost 1.8x worse for ARM cores compared to Intel ones.
The code is run on 2 cores for both the Intel and ARM systems.
Another issue is enabling FPMATH mode to BF16 degrades the performance.
From the oneDNN logs, we see that when BF16 is enabled, there are overheads while executing reorder for ARM cores

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/ABHISH~1/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/ABHISH~1/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">



</head>

<body link="#0563C1" vlink="#954F72">



Env   Variables | Reorder Time (msecs)
-- | --
TF_ENABLE_ONEDNN_OPTS=1 | 0.582031
ONEDNN_DEFAULT_FPMATH_MODE=BF16 | 11.1628




</body>

</html>

This is observed only for larger sized Matmul operations. Here the size was 768x768 and the reorder uses "simple:any" implementation instead of "jit:uni" in oneDNN.
Attaching oneDNN Verbose for both scenarios
[Bert_TF12_issue_verbose_BF16.txt](https://github.com/ARM-software/Tool-Solutions/files/12461110/Bert_TF12_issue_verbose_BF16.txt)
[Bert_TF12_issue_verbose_OPTS.txt](https://github.com/ARM-software/Tool-Solutions/files/12461111/Bert_TF12_issue_verbose_OPTS.txt)

Request your views and comments on whether we need any other settings to improve the performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensorflow huggingface BERT Model is slower in ARM compared to Intel #194

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Env Variables	Graviton	Icelake
No Opts	0.2294	0.145099
TF_ENABLE_ONEDNN_OPTS=1	0.2191	0.144636
ONEDNN_DEFAULT_FPMATH_MODE=BF16	1.49034	0.145511

Env Variables	Reorder Time (msecs)
TF_ENABLE_ONEDNN_OPTS=1	0.582031
ONEDNN_DEFAULT_FPMATH_MODE=BF16	11.1628

Tensorflow huggingface BERT Model is slower in ARM compared to Intel #194

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions