fix: batch job processing #3262

bdshadow · 2025-10-03T09:24:22Z

Summary by CodeRabbit

New Features
- Added admin endpoints to inspect project batch locks and the batch job queue.
- Introduced configurable per-project batch concurrency (default 1).
- Added optional JSON logging profile alongside standard console logging.
Improvements
- Optimized batch queue ordering and limited fetched items for better performance.
- Adjusted retry strategy for LLM operations to improve resilience.
Tests
- Added coverage for admin endpoints, including authorization and queue/lock operations.
Chores
- Added a manual Release workflow to build, package, and publish Docker images.

each item will be entirely inefficient. Improve the situation with 2 changes * Hardcode a max of 1_000 chunks to be added to the internal queue * Prioritize RUNNING job chunks in queue population 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Remove hardcoded 30-second timeout override and use [60, 120] configuration. This resolves the issue where OpenAI calls were timing out at 30s on the same chunks over and over. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add logback-spring.xml with profile-based configuration supporting both standard and JSON log formats. JSON logging enabled with 'json-logging' profile for improved log analysis and monitoring capabilities. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add administration endpoints for viewing and managing project batch job locks: - GET /v2/administration/project-batch-locks - retrieve all project locks - GET /v2/administration/batch-job-queue - returns all chunk execution items Features: - Admin-only access with @RequiresSuperAuthentication - Works with both Redis and local storage configurations - Provides lock status (UNLOCKED/UNINITIALIZED/LOCKED) and job details - Comprehensive test coverage - Full OpenAPI documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The previous getInitialJobId() logic created phantom locks by assuming jobs with fewer unlocked chunks were "running", when in reality they were just PENDING jobs with chunks sitting in local queues but never executing. 1. populateQueue() adds chunks to local queues (counted as "locked") 2. getInitialJobId() sees locked chunks → assumes job is "running" 3. Creates phantom lock pointing to PENDING job 4. All subsequent jobs fail project lock check → returned to queue 5. No job ever reaches handleItem() → no job becomes RUNNING 6. Phantom lock persists indefinitely → system deadlocked Now prioritizes database job status over chunk counting: 1. First: Look for jobs with status=RUNNING (truly active jobs) 2. Fallback: Use original chunk-counting logic for edge cases 3. If neither found: Return null (allow new job to acquire lock) This prevents phantom locks from PENDING jobs while preserving protection for legitimately running jobs. The fix should break the current deadlock where 203k+ PENDING jobs are stuck due to phantom project locks. Expected result: Batch jobs will start processing and populate LLM metrics. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Target full public github action to keep the repository a fork of tolgee/tolgee.

Replace old format Map<Long, Long> with new format Map<Long, Set<Long>> to support configurable concurrent jobs per project. Migration automatically runs on startup and performs one-time data conversion. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

coderabbitai · 2025-10-03T09:24:30Z

Walkthrough

Adds a release GitHub Actions workflow, introduces admin endpoints to inspect/clear project batch locks and view the batch-job queue, refactors project lock management to support multiple concurrent locks per project with migration logic, adds a projectConcurrency property, adjusts queue ordering/limits, updates logging, adds tests, and tweaks OpenAI retry attempts.

Changes

Cohort / File(s)	Change Summary
Release workflow `.github/workflows/forky-release.yml`	New manual “Release” workflow: sets up JDK 21 and Node 22, runs npm/Gradle tasks, builds/pushes Docker image to ghcr.io using input `releaseVersion`, and packs web resources.
Admin APIs for batch locks & queue `backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt`	New controller with endpoints: GET `/v2/administration/project-batch-locks`, GET `/v2/administration/batch-job-queue`. Models added for lock status, job info, and queue items; guarded by super-auth; includes error handling.
Tests for admin APIs `backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt`, `backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt`	New controller tests for authorized/unauthorized access, lock clearing/deleting, and queue retrieval. Updated test util to check emptiness of locked project IDs.
Batch locking refactor (multi-lock per project) `backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt`, `backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt`	Refactors locks from single `Long` to `Set<Long>` per project; supports per-project concurrency via new `BatchProperties.projectConcurrency`. Adds Redis data migration in `afterPropertiesSet`, updates APIs (`getMap`, `getLockedForProject`, `getLockedJobIds`), and locking/unlocking logic.
Batch queue inspection and ordering `backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt`	Adds `getAllQueueItems()`. Adjusts selection ordering (RUNNING first, then createdAt/executeAfter/id) and caps DB query results to 1000.
Batch launcher logging/comment `backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt`	Suppresses a verbose trace log; updates comment about project-level concurrency.
Scheduled job cleaner logging `backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt`	Collects stuck job IDs before logging; logs count and per-job trace prior to removal.
Logging configuration `backend/app/src/main/resources/logback-spring.xml`	Adds console and JSON console logging profiles with INFO root level.
OpenAI retry tuning (EE) `ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt`	Changes default retry attempts from `[30]` to `[60, 120]`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin
  participant API as ProjectBatchLockController
  participant LM as BatchJobProjectLockingManager
  participant BS as BatchJobService

  Admin->>API: GET /v2/administration/project-batch-locks
  API->>LM: getMap()
  LM-->>API: Map<ProjectId, Set<JobId>>
  loop For each locked JobId
    API->>BS: getJobDto(jobId)
    alt Success
      BS-->>API: JobInfo
    else Error
      BS-->>API: Error (handled)
    end
  end
  API-->>Admin: Collection<ProjectLockModel>
  note over API: lockStatus derived: LOCKED/UNLOCKED

sequenceDiagram
  autonumber
  actor Admin
  participant API as ProjectBatchLockController
  participant Q as BatchJobChunkExecutionQueue

  Admin->>API: GET /v2/administration/batch-job-queue
  API->>Q: getAllQueueItems()
  Q-->>API: List<ExecutionQueueItem>
  API-->>Admin: Collection<QueueItemModel>

sequenceDiagram
  autonumber
  participant Launcher
  participant LM as BatchJobProjectLockingManager
  participant Redis as Redis Map
  participant DB as Batch Jobs DB
  participant Props as BatchProperties

  Launcher->>LM: tryLock(projectId, jobId)
  LM->>Props: read projectConcurrency
  alt Redis enabled
    LM->>Redis: compute(projectId){ add jobId if set size < concurrency }
    Redis-->>LM: Updated Set<JobId>
  else Local map
    LM->>LM: update local Set<JobId> if size < concurrency
  end
  LM-->>Launcher: lock acquired? (boolean)
  opt First run / migration
    LM->>Redis: scan old Long values -> convert to Set<Long>
    Redis-->>LM: migrated entries
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: more llm providers #3118 — Touches the same OpenAI service, adjusting response nullability and error handling; related to current retry/backoff changes.

Suggested labels

enhancement

Suggested reviewers

JanCizmar
Anty0

Poem

A whisk of paws, a queue in line,
Locks now count in sets—how fine!
Admin ears perk—new endpoints gleam,
JSON logs blink in a tidy stream.
With hops of builds to ghcr flow,
Retries stretch long—then off we go! 🐇🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “fix: batch job processing” is overly generic and does not accurately capture the primary changes, which include refactoring the batch job locking manager to support concurrent locks, adding new administration endpoints, and updating the release workflow and logging configuration. Although it references batch job processing, it fails to convey the scope and significance of the key updates. A more descriptive title would help reviewers quickly understand the intent and impact of this substantial changeset.	Consider revising the title to reflect the core refactor and feature additions, for example “Refactor project‐level batch job locking for concurrency and add admin inspection endpoints,” so that the main objectives are clear at a glance.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (3)
307-309: Fix assertion to new Set API.

getLockedForProject(...) now returns Set.
-  fun assertJobUnlocked() {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(0L)
-  }
+  fun assertJobUnlocked() {
+    batchJobProjectLockingManager
+      .getLockedForProject(testData.projectBuilder.self.id)
+      .assert.isEmpty()
+  }
385-387: Fix verifyJobLocked to match Set.

Adjust to assert the set contains only the expected job id.
-  fun verifyJobLocked(job: BatchJob) {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(job.id)
-  }
+  fun verifyJobLocked(job: BatchJob) {
+    batchJobProjectLockingManager
+      .getLockedForProject(testData.projectBuilder.self.id)
+      .assert.isEqualTo(setOf(job.id))
+  }
1-478: Update stale getLockedForProject assertions in BatchJobTestUtil

In assertJobUnlocked (line 308), replace
batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(0L)
with
batchJobProjectLockingManager.getLockedForProject(...).assert.isEmpty()

In verifyJobLocked (line 386), replace
batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(job.id)
with
batchJobProjectLockingManager.getLockedForProject(...).assert.contains(job.id)

🧹 Nitpick comments (8)

backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt (1)
48-53: Skip warn when no stuck jobs.

Avoid a warn for 0 items; early-return keeps logs clean.
   private fun handleStuckJobs() {
-    val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds())
-    logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs")
+    val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds())
+    if (stuckJobIds.isEmpty()) return
+    logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs")
     stuckJobIds.forEach {
       logger.trace("Removing stuck job state using scheduled task for job $it")
       batchJobStateProvider.removeJobState(it)
     }
   }
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)
175-193: Add jitter/exponential backoff when project lock unavailable.

Constant +1s requeue can cause synchronized thundering. Add jitter or exponential backoff tied to a retry counter to spread retries under contention.

Example adjustment:

Increment a retry counter on the item (reuse managementErrorRetrials or add a dedicated field).

Delay = base * 2^retries with cap, plus small random jitter.
val retries = (executionItem.managementErrorRetrials + 1).coerceAtMost(6)
val backoff = (1_000L shl (retries - 1)).coerceAtMost(10_000L)
val jitter = (200L..600L).random()
it.managementErrorRetrials = retries
it.executeAfter = currentDateProvider.date.time + backoff + jitter
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)

66-71: Queue ordering and limiting look good.

RUNNING-first + stable ordering and max 1000 improves fairness and latency.

Consider adding/confirming DB indexes to support this query efficiently:

On BatchJobChunkExecution(status, createdAt, executeAfter, id)

On BatchJob(status) for the join

Also applies to: 77-81
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (1)
38-54: LGTM: lock overview endpoint.

Builds per-project models with defensive job info retrieval. Looks good.

Consider snapshotting the map first to avoid mid-iteration changes:
val locks = batchJobProjectLockingManager.getMap().toMap()
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)

1-136: Optional: assert response content for stronger guarantees.

Add JSON assertions (e.g., size, IDs) to validate payload shape.
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)
58-68: Tidy up duplicate unlock logs.

Same debug message logged twice when the job is present; keep one.
-    getMap().compute(projectId) { _, lockedJobIds ->
-      logger.debug("Unlocking job: $jobId for project $projectId")
+    getMap().compute(projectId) { _, lockedJobIds ->
       val currentJobs = lockedJobIds ?: emptySet()
       if (currentJobs.contains(jobId)) {
-        logger.debug("Unlocking job: $jobId for project $projectId")
+        logger.debug("Unlocking job: $jobId for project $projectId")
         val updatedJobs = currentJobs - jobId
         return@compute updatedJobs.ifEmpty { emptySet() }
       }
161-186: Reduce DB round-trips in getInitialJobId (optional).

You call getJobDto per job to check status. If possible, fetch statuses in one query or extend getAllIncompleteJobIds to include status to avoid N calls.

192-226: Guard migration with a Redis lock to avoid multi-node races.

If multiple nodes start concurrently, they could both migrate. Use an RLock around the migration section.
val lock = redissonClient.getLock("project_batch_job_locks:migration")
if (lock.tryLock()) {
  try {
    // migration body...
  } finally {
    lock.unlock()
  }
}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3447fc and da44146.

📒 Files selected for processing (11)

.github/workflows/forky-release.yml (1 hunks)
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (1 hunks)
backend/app/src/main/resources/logback-spring.xml (1 hunks)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1 hunks)
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1 hunks)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (2 hunks)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (2 hunks)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (4 hunks)
backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt (1 hunks)
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt (1 hunks)
ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)

backend/testing/src/main/kotlin/io/tolgee/testing/AuthorizedControllerTest.kt (3)

performAuthGet (130-133)

performAuthPut (114-120)

performAuthDelete (135-141)

🪛 actionlint (1.7.7)

.github/workflows/forky-release.yml

23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Frontend static check 🪲
GitHub Check: Build backend 🏗️

🔇 Additional comments (11)

ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt (1)

27-27: Clarify and validate timeout values
The defaultAttempts() list items map to RestTemplate read timeouts in seconds (via Duration.ofSeconds). You’ve increased this from [30] to [60, 120]. Confirm these durations align with OpenAI’s typical response times, SLAs, and your intended backoff strategy.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)

104-106: Good call on silencing the spammy trace.

backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1)

411-413: LGTM: updated to the new Set return type.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)

203-205: getAllQueueItems() snapshot is fine.

Returns a safe snapshot for admin inspection; no hot path impact.
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (2)
85-107: LGTM: queue inspection endpoint.

Simple mapping over a snapshot; appropriate for admin use.

24-25: Avoid permissive CORS on admin endpoints.

@crossorigin("*") on super-admin APIs is risky. Prefer removing it or restricting to configured origins.
-@CrossOrigin(origins = ["*"])
 @RequestMapping("/v2/administration")
Optionally, wire allowed origins via configuration and apply at a global CORS config rather than per-controller.
⛔ Skipped due to learnings
Learnt from: dkrizan
PR: tolgee/tolgee-platform#3092
File: backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/LabelsController.kt:32-32
Timestamp: 2025-06-04T15:04:55.253Z
Learning: In the Tolgee platform codebase, the team intentionally uses `CrossOrigin(origins = ["*"])` to allow all origins in their controllers. This is their standard practice and should not be flagged as a security concern in future reviews.
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (2)

48-81: Good coverage for lock listing with/without super auth.

103-126: Queue endpoint tests are fine; mapping aligns with data model.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)

128-156: Lock set computation looks correct under a positive limit.

With validated projectConcurrency ≥ 1, the paths are consistent. No action needed beyond property validation.

Edge case: If projectConcurrency were misconfigured to 0, initial lock path would still allow one job. Property validation (suggested in BatchProperties) avoids this.

188-191: Typed access to the Redis map is fine.

Returning ConcurrentMap<Long, Set> keeps usage consistent across callers.

228-230: LGTM: getLockedJobIds aggregates over Set values.

coderabbitai · 2025-10-03T09:35:59Z

.github/workflows/forky-release.yml

+      - name: Setup node
+        uses: actions/setup-node@v3
+        with:
+          node-version: "22.x"
+


⚠️ Potential issue | 🔴 Critical

Upgrade to a supported setup-node version.

actions/setup-node@v3 still runs on the deprecated Node16 runtime and is blocked on ubuntu-24.04 runners, so the workflow fails before any build logic runs. Please bump to @v4 (or later).

- - name: Setup node - uses: actions/setup-node@v3 + - name: Setup node + uses: actions/setup-node@v4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Setup node

uses: actions/setup-node@v3

with:

node-version: "22.x"

- name: Setup node

uses: actions/setup-node@v4

with:

node-version: "22.x"

🧰 Tools

🪛 actionlint (1.7.7)

23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

.github/workflows/forky-release.yml around lines 22-26: actions/setup-node is pinned to @v3 which uses deprecated Node16 runtime and fails on ubuntu-24.04; update the action to a supported major by changing the uses line to actions/setup-node@v4 (or later) and keep or adjust the node-version input (e.g., "22.x") as needed so the runner uses a supported setup-node implementation.

coderabbitai · 2025-10-03T09:35:59Z

.github/workflows/forky-release.yml

+          echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin
+          docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64
+          docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}


⚠️ Potential issue | 🔴 Critical

Fix the Docker login username.

-u $ expands to an empty string, so docker login refuses the authentication and the push never happens. Use a real username (e.g. github.actor) and keep the token on stdin.

- echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin + echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin

docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64

docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}

echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin

docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64

docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}

🤖 Prompt for AI Agents

.github/workflows/forky-release.yml around lines 50-52: the Docker login uses `-u $` which expands to an empty username and breaks authentication; replace the empty username with a real actor variable (e.g. `-u ${{ github.actor }}`) while continuing to pipe the token via `--password-stdin` so the command becomes: echo the token | docker login <registry> -u ${{ github.actor }} --password-stdin, then keep the build and push steps unchanged.

coderabbitai · 2025-10-03T09:35:59Z

backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

+/**
+ * REST API for managing project batch job locks
+ */
+@RestController
+@CrossOrigin(origins = ["*"])
+@RequestMapping("/v2/administration")
+@Tag(
+  name = "Server Administration",
+  description = "**Only for self-hosted instances** \n\n" +
+    "Management of project-level batch job locks and queue inspection for debugging and maintenance."
+)
+@OpenApiSelfHostedExtension
+class ProjectBatchLockController(
+  private val batchJobProjectLockingManager: BatchJobProjectLockingManager,
+  private val batchJobService: BatchJobService,
+  private val batchJobChunkExecutionQueue: BatchJobChunkExecutionQueue,
+) : IController, Logging {


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify presence of PUT/DELETE endpoints used in tests.

Tests call:

PUT /v2/administration/project-batch-locks/{id}/clear

DELETE /v2/administration/project-batch-locks/{id}

They are not in this controller. Ensure they exist elsewhere or add them here.

🏁 Script executed:

#!/bin/bash # Search for project-batch-locks admin endpoints in the codebase rg -nP -C3 --type=kotlin '@(PutMapping|DeleteMapping)\s*\(\s*"/v2/administration/project-batch-locks' rg -nP -C3 --type=kotlin 'RequestMapping\(.*/v2/administration/project-batch-locks'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Broad search for any mappings or references to project-batch-locks rg -nP --type=kotlin '@\w+Mapping\s*\(\s*"/v2/administration/project-batch-locks' -C3 rg -n --type=kotlin 'project-batch-locks' -C3

Length of output: 5387

Add missing PUT and DELETE mappings in ProjectBatchLockController
Tests call PUT /v2/administration/project-batch-locks/{id}/clear and DELETE /v2/administration/project-batch-locks/{id}, but neither handler is implemented in ProjectBatchLockController.kt. Implement these endpoints to satisfy the tests.

🤖 Prompt for AI Agents

In backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt around lines 20-36, add two REST handlers: a PUT at /v2/administration/project-batch-locks/{id}/clear and a DELETE at /v2/administration/project-batch-locks/{id}. The PUT handler should accept the path variable id (project id), call the batchJobProjectLockingManager method that clears/releases the project lock, and also ensure any related queued chunks or tracking are cleared via batchJobChunkExecutionQueue and/or batchJobService as appropriate, then return a 204/OK response. The DELETE handler should accept the path variable id, remove any persistent batch-job records for that project via batchJobService and clear related queue entries via batchJobChunkExecutionQueue and/or batchJobProjectLockingManager, and return an appropriate 204/OK response. Annotate both methods with @PutMapping/@DeleteMapping and @PathVariable, handle missing/not-found cases with suitable HTTP statuses, and add minimal logging consistent with the controller.

coderabbitai · 2025-10-03T09:36:00Z

backend/app/src/main/resources/logback-spring.xml

+        <!-- JSON console appender using standard logback pattern -->
+        <appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
+            <encoder>
+                <pattern>{"timestamp":"%d{yyyy-MM-dd'T'HH:mm:ss.SSSZ}","level":"%p","logger":"%c","message":"%m","thread":"%t","exception":"%ex{full}"}%n</pattern>
+            </encoder>
+        </appender>
+
+        <root level="INFO">
+            <appender-ref ref="JSON_CONSOLE"/>
+        </root>


⚠️ Potential issue | 🟠 Major

Use a dedicated JSON encoder instead of a raw pattern.

This pattern layout only resembles JSON; %m, %c, %ex{full} etc. will happily inject quotes, backslashes, and newlines without escaping, so any log containing " or a stack trace breaks consumers expecting valid JSON. Please switch to a real JSON encoder (e.g., ch.qos.logback.contrib.json.classic.JsonLayout, net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder, or Spring Boot’s default LoggingSystemProperties with LOGGING_LEVEL_PATTERN) or add a layout that handles escaping.

🤖 Prompt for AI Agents

In backend/app/src/main/resources/logback-spring.xml around lines 17 to 26, the ConsoleAppender currently uses a raw pattern string that can produce invalid JSON when messages or stack traces contain quotes/newlines; replace the pattern-based encoder with a dedicated JSON encoder (for example use net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder or ch.qos.logback.contrib.json.classic.JsonLayout) and configure it to emit timestamp, level, logger, thread, message and exception fields with proper escaping/stacktrace handling; remove the pattern block and add the appropriate encoder/layout element and required dependencies so consumers always receive well-formed JSON.

coderabbitai · 2025-10-03T09:36:00Z

backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt

+  @Test
+  fun `PUT clear project lock works with super auth`() {
+    val testLocks = ConcurrentHashMap<Long, Set<Long>>()
+    whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks)
+
+    performAuthPut("/v2/administration/project-batch-locks/123/clear", null)
+      .andIsOk
+  }
+
+  @Test
+  fun `DELETE project lock works with super auth`() {
+    val testLocks = ConcurrentHashMap<Long, Set<Long>>().apply {
+      put(123L, setOf(456L))
+    }
+    whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks)
+
+    performAuthDelete("/v2/administration/project-batch-locks/123")
+      .andIsOk
+  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

PUT/DELETE tests assume endpoints exist. Please verify.

🏁 Script executed:

#!/bin/bash # Verify controller mappings exist for the tested routes rg -nP --type=kotlin -C3 '@PutMapping$"/v2/administration/project-batch-locks/.*/clear"$' rg -nP --type=kotlin -C3 '@DeleteMapping$"/v2/administration/project-batch-locks/.*"$'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Find controller files related to project batch locking endpoints fd -IH 'ProjectBatchLock*Controller*' -e kt # Search for mapping annotations for project-batch-locks endpoints rg -n '/v2/administration/project-batch-locks' -g '*.kt' # Search for usage of batchJobProjectLockingManager in code rg -n 'batchJobProjectLockingManager' -g '*.kt'

Length of output: 3796

🏁 Script executed:

#!/bin/bash rg -n '@PutMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt rg -n '@DeleteMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 0

🏁 Script executed:

#!/bin/bash rg -nE '@(RequestMapping|GetMapping|PutMapping|DeleteMapping)' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 90

🏁 Script executed:

#!/bin/bash # Print the beginning of the controller to inspect class-level mapping sed -n '1,200p' backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt # Look for class-level RequestMapping rg -n '@RequestMapping' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt # Search for any 'clear' usage rg -n 'clear' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt # Search for HTTP delete handler rg -n 'fun .*delete' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 5018

Implement missing PUT/DELETE endpoints for project batch locks
ProjectBatchLockController.kt currently only defines GET mappings; the tests for PUT /v2/administration/project-batch-locks/{id}/clear and DELETE /v2/administration/project-batch-locks/{id} will fail. Add corresponding @PutMapping and @DeleteMapping methods (with @RequiresSuperAuthentication) to handle “clear” and “delete” operations.

🤖 Prompt for AI Agents

In backend/app/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt around where GET mappings are defined (add new methods corresponding to tests roughly next to existing endpoints), implement two new endpoints: add a @PutMapping("/v2/administration/project-batch-locks/{id}/clear") method annotated with @RequiresSuperAuthentication that reads the id path variable, retrieves batchJobProjectLockingManager.getMap(), and if an entry exists clears the Set for that id (e.g. replace with emptySet or clear the collection) and returns an OK response; add a @DeleteMapping("/v2/administration/project-batch-locks/{id}") method annotated with @RequiresSuperAuthentication that removes the map entry for the given id via batchJobProjectLockingManager.getMap().remove(id) and returns an OK response; ensure proper path variable types and return types match other controller methods (e.g. ResponseEntity or Unit mapped to 200) and include imports/annotations as needed.

coderabbitai · 2025-10-03T09:36:00Z

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt

+    val computedJobIds =
+      getRedissonProjectLocks().compute(projectId) { _, lockedJobIds ->
+        val newLockedJobIds = computeFnBody(batchJobDto, lockedJobIds ?: emptySet())
+        logger.debug(
+          "While trying to lock on redis {} for project {} new lock value is {}",
+          batchJobDto.id,
+          batchJobDto.projectId,
+          newLockedJobIds
+        )
+        newLockedJobIds
+      } ?: emptySet()
+    return computedJobIds.contains(batchJobDto.id)
  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Confirm atomicity of Redisson RMap.compute across nodes.

Concurrency guarantees are critical to enforce projectConcurrency. Please confirm RMap.compute is cluster-atomic; otherwise, guard with an RLock per project key or switch to a server-side entry-processor/Lua.

🌐 Web query:

Is Redisson RMap.compute atomic across the cluster? What is the recommended approach to perform an atomic read-modify-write on a map entry with Redisson?

💡 Result:

Short answer: No — RMap.compute is a client-side Java callback and is not guaranteed atomic across the cluster. The safe approaches are to perform the read-modify-write inside a Redis-backed primitive or server-side transaction/lock. Recommended options:

Acquire a Redis lock for the key (RLock) around your read-modify-write. This serializes updates across clients/nodes. [1][2]

Use Redisson transactions (RTransaction) so the map participates with ACID properties and changes are committed atomically. [3]

For simple “compute-if-absent” semantics prefer the server-side helper methods (e.g. computeIfAbsent on RMapCache / putIfAbsent / fastPutIfAbsent) which Redisson implements to avoid racing on the client. [4]

References:

Redisson — Distributed collections / RMap (behavior and API). [1]

Redisson — FAQ (thread-safety and use of locks). [2]

Redisson — Transactions (RMap participates in transactions; isolation READ_COMMITTED). [3]

Redisson Javadoc — RMapCache.computeIfAbsent and putIfAbsent helpers. [4]

Guard read-modify-write with a distributed lock or server-side atomic operation

RMap.compute is a client-side callback and not cluster-atomic. Wrap the compute call in an RLock keyed by projectId—or switch to a Redisson RTransaction or a server-side entry processor/Lua script—to enforce atomic read-modify-write across nodes.

coderabbitai · 2025-10-03T09:36:00Z

backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt

+  @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
+  var projectConcurrency: Int = 1


🛠️ Refactor suggestion | 🟠 Major

Validate concurrency properties (prevent 0/negative).

If set to 0 or negative, project-level locking behavior becomes inconsistent. Add bean validation and validate both properties.

package io.tolgee.configuration.tolgee @@ -import org.springframework.boot.context.properties.ConfigurationProperties +import org.springframework.boot.context.properties.ConfigurationProperties +import org.springframework.validation.annotation.Validated +import jakarta.validation.constraints.Min @@ -@ConfigurationProperties(prefix = "tolgee.batch") +@ConfigurationProperties(prefix = "tolgee.batch") +@Validated @DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations") class BatchProperties { @DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance") - var concurrency: Int = 1 + @field:Min(1) + var concurrency: Int = 1 @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances") - var projectConcurrency: Int = 1 + @field:Min(1) + var projectConcurrency: Int = 1 }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")

var projectConcurrency: Int = 1

package io.tolgee.configuration.tolgee

import org.springframework.boot.context.properties.ConfigurationProperties

import org.springframework.validation.annotation.Validated

import jakarta.validation.constraints.Min

@Validated

@ConfigurationProperties(prefix = "tolgee.batch")

@DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")

class BatchProperties {

@DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")

@field:Min(1)

var concurrency: Int = 1

@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")

@field:Min(1)

var projectConcurrency: Int = 1

}

🤖 Prompt for AI Agents

In backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt around lines 11-12, the projectConcurrency (and any other concurrency properties in this class) must be validated to prevent 0 or negative values; add bean validation annotations (e.g., @Min(1) or @Positive) to the property declarations and ensure the class is annotated/configured for validation (e.g., @Validated or @ConfigurationProperties + enable JSR-303 validation) so Spring will enforce the constraint at startup; apply the same validation annotation to any other concurrency-related properties in this file.

bdshadow · 2025-10-05T23:15:19Z

@ajeans hello!
I think it's worth splitting the commits here into two separate pull requests. Maybe we can leave this pr for reference and create a separate one just for the batch job fix and after it is merged continue on a new one, which allows multiple jobs per project.

Let's take the following commits to the first pr.

General comments:

Please fix the commit messages to follow the convention.
I think it's worth adding .setMaxResults(1_000) in the BatchJobChunkExecutionQueue as a configuration property to the io.tolgee.configuration.tolgee.BatchProperties
In the current state a new ProjectBatchLockControllerTest is failing. Please check that it is ok in the new pr-s.

Then we can continue with the review in these separate pr-s.

ajeans · 2025-10-06T07:50:11Z

@bdshadow Hello 👋

Sounds good, I'll create new PR #1 today or tomorrow

ajeans · 2025-10-06T11:57:08Z

@bdshadow for internal security reasons, I cannot push the branch to tolgee repo from my corporate laptop (sigh).

Can you please create the PR from this branch: https://github.com/RakutenFrance/tolgee-platform/commits/fix/batch-job-processing?

Tests and ktlint should be fine, and I amended the commit names.

bdshadow · 2025-10-06T19:17:51Z

@ajeans great! here it is #3265

JanCizmar · 2025-10-07T04:33:24Z

I guess we can close this one! Thanks for the contribution. It's much appreciated ❤️

I hope we will get the batch jobs stable and fast soon! 🚀

bdshadow · 2025-10-10T15:34:22Z

I will still use this issue for communication.
The first pr is merged. @ajeans please create the next pr. From my side, I also have one task for fixing hanging jobs, when job is cancelled. I'm getting back to it after your first fix to check if it helped. I hope all these fixes will make batch processing much more stable

ajeans · 2025-10-11T08:18:44Z

Hello @bdshadow

Excellent. I created the second pull request minutes ago, this is #3271

Let's have the review there, and we can keep this place to discuss further.

Regards

ajeans and others added 10 commits September 30, 2025 11:56

Disable very verbose trace to allow activating trace in production

9c76a4c

Make our own release pipeline

fa299df

Target full public github action to keep the repository a fork of tolgee/tolgee.

feat: log at info level for taken project locks

dbc0d62

Disable very verbose warn log (when a lot of jobs exist)

da44146

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

bdshadow changed the title ~~fix batch job processing~~ fix: batch job processing Oct 3, 2025

bdshadow marked this pull request as draft October 6, 2025 19:18

		@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
		var projectConcurrency: Int = 1

-  @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
-  var projectConcurrency: Int = 1
+package io.tolgee.configuration.tolgee
+import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.validation.annotation.Validated
+import jakarta.validation.constraints.Min
+@Validated
+@ConfigurationProperties(prefix = "tolgee.batch")
+@DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")
+class BatchProperties {
+  @DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")
+  @field:Min(1)
+  var concurrency: Int = 1
+  @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
+  @field:Min(1)
+  var projectConcurrency: Int = 1
+}

Uh oh!

fix: batch job processing #3262

Are you sure you want to change the base?

fix: batch job processing #3262

Uh oh!

Conversation

bdshadow commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

bdshadow commented Oct 5, 2025

Uh oh!

ajeans commented Oct 6, 2025

Uh oh!

ajeans commented Oct 6, 2025

Uh oh!

bdshadow commented Oct 6, 2025

Uh oh!

JanCizmar commented Oct 7, 2025

Uh oh!

bdshadow commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajeans commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdshadow commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading

bdshadow commented Oct 10, 2025 •

edited

Loading