Skip to content

[Bug] Segfault in bytes::bytes_mut::shared_v_drop running workflows under pytest on Alpine Linux #1015

@superimposition

Description

@superimposition

What are you really trying to do?

I'm running unit tests of my workflow and activities in a CI environment. I'm using the local dev server + mocked activities pattern to do so.

I figured this would be interesting bug report, even though it's super weird and probably not immediately helpful for anyone except for me.

Describe the bug

temporal_sdk_bridge.abi3.so is segfaulting in tokio's stream reading. This seems to occur on the first unittest that would use a temporal client. Oddly enough, this doesn't always happen. By our math it occurs in 8-12% of test runs, but seems to happen more when the runner nodes are more active.

I suspect I'm doing something wrong, and opening myself to some sort of thread safety issue a-la the warning in https://python.temporal.io/temporalio.client.Client.html but so far as I can tell everything should be correct and pinning the event loop to the session correctly.

Here's a backtrace of the segfault-

Core was generated by `python3 -m pytest --cov=netfs --cov-report=term-missing --cov-report=xml:report'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __restore_sigs (set=set@entry=0x7f055fee8520) at ./arch/x86_64/syscall_arch.h:40

warning: 40	./arch/x86_64/syscall_arch.h: No such file or directory
[Current thread is 1 (LWP 134)]
(gdb) bt
#0  __restore_sigs (set=set@entry=0x7f055fee8520) at ./arch/x86_64/syscall_arch.h:40
temporalio/sdk-python#1  0x00007f056d326e1b in raise (sig=<optimized out>) at src/signal/raise.c:11
temporalio/sdk-python#2  <signal handler called>
temporalio/sdk-python#3  get_meta (p=p@entry=0x7f05610f2e40 "") at src/malloc/mallocng/meta.h:141
temporalio/sdk-python#4  0x00007f056d30eb9f in __libc_free (p=0x7f05610f2e40) at src/malloc/mallocng/free.c:105
temporalio/sdk-python#5  0x00007f0566d3debf in bytes::bytes_mut::shared_v_drop () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#6  0x00007f0566ceea06 in <tokio_util::io::stream_reader::StreamReader<S,B> as tokio::io::async_read::AsyncRead>::poll_read ()
   from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#7  0x00007f056767ccc4 in <tokio_util::io::sync_bridge::SyncIoBridge<T> as std::io::Read>::read ()
   from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#8  0x00007f0567865708 in <flate2::gz::read::GzDecoder<R> as std::io::Read>::read () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#9  0x00007f056715a587 in <tar::entry::EntryFields as std::io::Read>::read () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#10 0x00007f056786ea9c in temporal_sdk_core::ephemeral_server::download_and_extract::{{closure}}::{{closure}} ()
   from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#11 0x00007f056769018e in tokio::runtime::task::raw::poll () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#12 0x00007f0567c6e6a2 in std::sys::backtrace::__rust_begin_short_backtrace () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#13 0x00007f0567c6f453 in core::ops::function::FnOnce::call_once{{vtable.shim}} () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#14 0x00007f05671546db in std::sys::pal::unix::thread::Thread::new::thread_start () from /root/.cache/pypoetry/virtualenvs/netfs-JYslTpQ3-py3.12/lib/python3.12/site-packages/temporalio/bridge/temporal_sdk_bridge.abi3.so
temporalio/sdk-python#15 0x00007f056d32f9d2 in start (p=0x7f055fefb490) at src/thread/pthread_create.c:207
temporalio/sdk-python#16 0x00007f056d331314 in __clone () at src/thread/x86_64/clone.s:22

At time of failure there are 4 threads, all but one are running temporal_sdk_bridge.abi3.so, the odd one out being Python.

Minimal Reproduction

So I'm actually unsure how to reproduce this. I have a very hard time doing so locally, I can only seem to reproduce it reliably running it in a CI environment on an node with 1 CPU and ~256M - 4G RAM max. Unfortunately, I can't immediately share the code that reproduces it, but I can try to bring down the complexity a bit and see if that helps.

For simplicity, here's a test definition and conftest parameters that seem to reliably reproduce it - we see that the first test that would talk to the temporal dev server fails. Prior tests run and pass that mock the temporal workflow executor itself.

---- CONFTEST.PY SNIPPET

@pytest.fixture(scope="session")
def anyio_backend():
    """Set the default anyio backend to just asyncio.

    See Also:
        https://anyio.readthedocs.io/en/stable/testing.html
    """
    return "asyncio"


@pytest.fixture(scope="session")
def event_loop():
    """Create an event loop for the session scope."""
    loop = asyncio.get_event_loop_policy().new_event_loop()
    yield loop
    loop.close()

@pytest.fixture(scope="session")
async def env(
    # request
) -> AsyncGenerator[WorkflowEnvironment, None]:
    """Create a Temporal workflow environment for testing."""
    ## Example for running in different environments
    ## taken from https://github.com/temporalio/samples-python/blob/main/tests/conftest.py
    # env_type = request.config.getoption("--workflow-environment")
    # if env_type == "ci":
    #     env = WorkflowEnvironment.from_client(await Client.connect(env_type))
    # elif env_type == "time-skipping":
    #     env = await WorkflowEnvironment.start_time_skipping()
    # else:
    env = await WorkflowEnvironment.start_local(
        dev_server_extra_args=[
            "--dynamic-config-value",
            "frontend.enableExecuteMultiOperation=true",
        ]
    )
    yield env
    await env.shutdown()


@pytest.fixture
async def client(env: WorkflowEnvironment) -> Client:
    """Create a Temporal client for testing."""
    return env.client



-------- WORKFLOW TEST SNIPPET 

# Mocked activities for testing
@activity.defn(name="activity_1")
async def mocked_activity_1() -> dict[str, str]:
    return {}


@activity.defn(name="activity_2")
async def mocked_activity_2(generic_input: InputDataclass) -> dict[str, str]:
    return {"foo": "bar", "bizz": "buzz"}


@activity.defn(name="activity_3")
async def mocked_activity_3(generic_input: InputDataclass) -> dict[str, str]:
    return {"foo": "bar", "bizz": "buzz"}


### A bunch of other tests happen before using the temporal server for the first time.


@pytest.mark.anyio
async def test_workflow1():
    """
        This workflow just runs and returns the activity
    """
    task_queue_name = str(uuid.uuid4())
    id_name = str(uuid.uuid4())
    with patch(
        "temporalio.client.Client.execute_workflow", new_callable=AsyncMock
    ) as mock_execute_workflow:
        mock_execute_workflow.return_value = {"foo": "bar", "bizz": "buzz"}
        res = await mock_execute_workflow(
            activity_2.run,
            GenericInputs(
                foo="bar", bizz="buzz"
            ),
            id=id_name,
            task_queue=task_queue_name,
        )
        assert res == {"foo": "bar", "bizz": "buzz"}

@pytest.mark.anyio
async def test_workflow_2(client: Client):
    task_queue_name = str(uuid.uuid4())
    id_name = str(uuid.uuid4())

    async with Worker(
        client,
        task_queue=task_queue_name,
        workflows=[workflow_2],
        activities=[mocked_activity_1, mocked_activity_2, mocked_activity3],
    ):
        # TODO: Fix test case - illegal inputs
        res = await client.execute_workflow(
            workflow_1.run,
            GenericInputs(
                foo="bar", bizz="buzz"
            ),
            id=id_name,
            task_queue=task_queue_name,
        )
        assert "foo" in res

Environment/Versions

  • OS and processor: Alpine Linux, Intel Cascade Lake Xeon
  • Temporal Version:1.10.0 and/or SDK version
  • Building from source on Alpine Linux

Additional context

Trying to reproduce on Debian Trixie has been very difficult - side by side, I ran the same binary compiled on Alpine on the trixie and an alpine container, and over ~20 runs the Alpine container failed 9 times, and the trixie container didn't fail at all. It seems like this must have something to do with MUSL's implementation of pthreads or something, but I can't quite tell what. I checked my build, it used the unknown-linux-musl target, which may be problematic?

If you're curious about the segfault, it faulted trying to access 0x7f in get_meta, which I think means that we double-free'd. I am entirely uncertain how that could even happen.

Anyhow, let me know if this garners any further curiosity! I'd be happy to see if I can get to a really minimal reproducer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions