Skip to content

Conversation

Kamilake
Copy link

Pull Request Etiquette

Changes

  • Internal code
  • Library interface (affecting end-user code)
  • Documentation
  • Other: _____

Closes Issue: None reported yet, possibly due to the rarity and specific conditions of occurrence.

Description

This PR adds an explicit null check before calling Opus.INSTANCE.opus_encode() to prevent potential SIGSEGV crashes caused by a null opusEncoder reference.

Why is this needed?

Our production Discord bot, which serves over 1.3 million users, has experienced occasional JVM crashes (SIGSEGV in native opus code), approximately once per month. After analyzing JVM crash logs (hs_err_1.log), we identified that the cause was a null pointer (RDI=0x0) being passed to the native opus_encode() method via JNA.

The stack trace clearly showed:


Problematic frame:
C \[jna\*.tmp+0x35582] opus\_encode+0x22

This issue happens despite the existing null checks at higher abstraction levels, suggesting a potential race condition scenario—possibly due to concurrent audio connection closure and encoding operations.

How this change helps

By adding this explicit null check:

if (opusEncoder == null || opusEncoder.getValue() == null)
{
    throw new IllegalStateException("Opus encoder is not initialized! This should never happen, please report this as a bug.");
}

we ensure:

  • JVM safety: We completely eliminate the risk of JVM crashes (SIGSEGV) due to invalid native calls.
  • Easier debugging: If such a situation arises, developers will receive a clear and informative exception instead of an opaque native crash.

Expected scenarios where this exception might be thrown:

  • An unexpected encoder initialization failure (opus_encoder_create() returning null due to unsupported sample rates or invalid arguments).
  • A rare race condition between encoding and closing the audio connection.

Testing & Reproducibility

  • Due to the low frequency (~once a month), reproducing this issue is difficult. We will deploy this fix into our large-scale production environment and monitor for at least 2 months to confirm resolution.
  • This PR is proactively submitted because of the seriousness of the issue (SIGSEGV JVM crash).

Additional context

  • Our Discord bot extensively utilizes TTS functionality; stable audio encoding is mission-critical.
  • To date, we've experienced approximately 10 occurrences of this issue.

Note

  • The proposed solution is very minimal and safe, adding negligible runtime overhead.
  • The code was written by me, but the English explanation was translated and refined with the assistance of ChatGPT.

@freya022
Copy link
Contributor

I think this change only reduces the time window in which the SIGSEGV can happen, this PR should focus on fixing the actual race condition, not changing the crash for an exception, not to mention that a crash can still happen, though it should be even more rare now.

@freya022
Copy link
Contributor

An unexpected encoder initialization failure (opus_encoder_create() returning null due to unsupported sample rates or invalid arguments).

I'm not sure how this would cause an issue, the status code of opus_encoder_create is checked, and returns if it isn't OK, encodeToOpus should thus not be called when creating the encoder fails. The only thing that looks weird to me is the additional opusEncoder == null check in

if (error.get() != Opus.OPUS_OK && opusEncoder == null)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants