Skip to content

Poor thread scaling when constructing instances or accessing attributes #139103

@JukkaL

Description

@JukkaL

Bug report

Bug description:

When constructing dataclass or NamedTuple instances on multiple threads (on a free threading build), or accessing enum class attributes, performance doesn't scale when using multiple threads.

Regular class example (scales well):

# b_regular_class.py
from threading import Thread
from time import time
import sys

class Foo:
    def __init__(self, x):
        self.x = x

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")

Dataclass example (doesn't scale well):

# b_dataclass.py
from threading import Thread
from dataclasses import dataclass
from time import time
import sys

@dataclass
class Foo:
    x: int

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")

Named tuple example (doesn't scale well):

# b_namedtuple.py
from threading import Thread
from typing import NamedTuple
from time import time
import sys

class Foo(NamedTuple):
    x: int

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo(x=1)

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")

Enum example (doesn't scale well):

# b_enum.py
from threading import Thread
from time import time
from enum import Enum
import sys

class Foo(Enum):
    X = 1
    Y = 2

niter = 5 * 1000 * 1000

def benchmark(n):
    for i in range(n):
        Foo.X
        Foo.Y.value

for nth in (1, 4):
    t0 = time()
    threads = [Thread(target=benchmark, args=(niter,)) for _ in range(nth)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print(f"{nth=} {(time() - t0) / nth}")

Results on recent main branch (running on an EC2 instance):

(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_regular_class.py
nth=1 1.1085155010223389
nth=4 0.2796591520309448
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_dataclass.py
nth=1 1.1910037994384766
nth=4 1.0931583642959595
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_namedtuple.py
nth=1 1.5688557624816895
nth=4 2.0257126092910767
(cpython-dev) jukka@jukka-coder-dbx free-threading-benchmarks $ py b_enum.py
nth=1 0.9439797401428223
nth=4 2.272495985031128

The expected behavior is that when using 4 threads (nth=4), the elapsed time per benchmark iteration (the second printed value) goes down significantly compared to when using a single thread (nth=1), which happens with the first benchmark (b_regular_class.py) but not the others.

cc @colesbury (we discussed this at CPython Core Dev Sprint in person)

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions