Poor thread scaling when constructing instances or accessing attributes

# Bug report

### Bug description:

When constructing dataclass or NamedTuple instances on multiple threads (on a free threading build), or accessing enum class attributes, performance doesn't scale when using multiple threads.

Regular class example (scales well):
```python
# b_regular_class.py
from threading import Thread
from time import time
import sys

class Foo:
    def __init__(self, x):
        self.x = x

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")
```

Dataclass example (doesn't scale well):
```py
# b_dataclass.py
from threading import Thread
from dataclasses import dataclass
from time import time
import sys

@dataclass
class Foo:
    x: int

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")
```

Named tuple example (doesn't scale well):
```py
# b_namedtuple.py
from threading import Thread
from typing import NamedTuple
from time import time
import sys

class Foo(NamedTuple):
    x: int

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")
```

Enum example (doesn't scale well):
```py
# b_enum.py
from threading import Thread
from time import time
from enum import Enum
import sys

class Foo(Enum):
    X = 1
    Y = 2

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo.X
        Foo.Y.value

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")
```
Results on recent main branch (running on an EC2 instance):
```
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_regular_class.py
nth=1 1.1085155010223389
nth=4 0.2796591520309448
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_dataclass.py
nth=1 1.1910037994384766
nth=4 1.0931583642959595
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_namedtuple.py
nth=1 1.5688557624816895
nth=4 2.0257126092910767
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_enum.py
nth=1 0.9439797401428223
nth=4 2.272495985031128
```
The expected behavior is that when using 4 threads (`nth=4`), the elapsed time per benchmark iteration (the second printed value) goes down significantly compared to when using a single thread (`nth=1`), which happens with the first benchmark (`b_regular_class.py`) but not the others.

cc @colesbury (we discussed this at CPython Core Dev Sprint in person)

### CPython versions tested on:

CPython main branch

### Operating systems tested on:

Linux

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor thread scaling when constructing instances or accessing attributes #139103

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Poor thread scaling when constructing instances or accessing attributes #139103

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions