Skip to content

Conversation

bdshadow
Copy link
Contributor

@bdshadow bdshadow commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • Added admin endpoints to inspect project batch locks and the batch job queue.
    • Introduced configurable per-project batch concurrency (default 1).
    • Added optional JSON logging profile alongside standard console logging.
  • Improvements

    • Optimized batch queue ordering and limited fetched items for better performance.
    • Adjusted retry strategy for LLM operations to improve resilience.
  • Tests

    • Added coverage for admin endpoints, including authorization and queue/lock operations.
  • Chores

    • Added a manual Release workflow to build, package, and publish Docker images.

ajeans and others added 10 commits September 30, 2025 11:56
each item will be entirely inefficient.

Improve the situation with 2 changes
* Hardcode a max of 1_000 chunks to be added to the internal queue
* Prioritize RUNNING job chunks in queue population

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Remove hardcoded 30-second timeout override
and use [60, 120] configuration.

This resolves the issue where OpenAI calls were timing out at 30s
on the same chunks over and over.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add logback-spring.xml with profile-based configuration supporting both
standard and JSON log formats. JSON logging enabled with 'json-logging'
profile for improved log analysis and monitoring capabilities.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add administration endpoints for viewing and managing project batch job locks:
- GET /v2/administration/project-batch-locks - retrieve all project locks
- GET /v2/administration/batch-job-queue - returns all chunk execution items

Features:
- Admin-only access with @RequiresSuperAuthentication
- Works with both Redis and local storage configurations
- Provides lock status (UNLOCKED/UNINITIALIZED/LOCKED) and job details
- Comprehensive test coverage
- Full OpenAPI documentation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The previous getInitialJobId() logic created phantom locks by assuming jobs
with fewer unlocked chunks were "running", when in reality they were just
PENDING jobs with chunks sitting in local queues but never executing.

1. populateQueue() adds chunks to local queues (counted as "locked")
2. getInitialJobId() sees locked chunks → assumes job is "running"
3. Creates phantom lock pointing to PENDING job
4. All subsequent jobs fail project lock check → returned to queue
5. No job ever reaches handleItem() → no job becomes RUNNING
6. Phantom lock persists indefinitely → system deadlocked

Now prioritizes database job status over chunk counting:
1. First: Look for jobs with status=RUNNING (truly active jobs)
2. Fallback: Use original chunk-counting logic for edge cases
3. If neither found: Return null (allow new job to acquire lock)

This prevents phantom locks from PENDING jobs while preserving protection
for legitimately running jobs. The fix should break the current deadlock
where 203k+ PENDING jobs are stuck due to phantom project locks.

Expected result: Batch jobs will start processing and populate LLM metrics.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Target full public github action to keep the
repository a fork of tolgee/tolgee.
Replace old format Map<Long, Long> with new format Map<Long, Set<Long>>
to support configurable concurrent jobs per project. Migration automatically
runs on startup and performs one-time data conversion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Adds a release GitHub Actions workflow, introduces admin endpoints to inspect/clear project batch locks and view the batch-job queue, refactors project lock management to support multiple concurrent locks per project with migration logic, adds a projectConcurrency property, adjusts queue ordering/limits, updates logging, adds tests, and tweaks OpenAI retry attempts.

Changes

Cohort / File(s) Change Summary
Release workflow
.github/workflows/forky-release.yml
New manual “Release” workflow: sets up JDK 21 and Node 22, runs npm/Gradle tasks, builds/pushes Docker image to ghcr.io using input releaseVersion, and packs web resources.
Admin APIs for batch locks & queue
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
New controller with endpoints: GET /v2/administration/project-batch-locks, GET /v2/administration/batch-job-queue. Models added for lock status, job info, and queue items; guarded by super-auth; includes error handling.
Tests for admin APIs
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt, backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt
New controller tests for authorized/unauthorized access, lock clearing/deleting, and queue retrieval. Updated test util to check emptiness of locked project IDs.
Batch locking refactor (multi-lock per project)
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt, backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
Refactors locks from single Long to Set<Long> per project; supports per-project concurrency via new BatchProperties.projectConcurrency. Adds Redis data migration in afterPropertiesSet, updates APIs (getMap, getLockedForProject, getLockedJobIds), and locking/unlocking logic.
Batch queue inspection and ordering
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt
Adds getAllQueueItems(). Adjusts selection ordering (RUNNING first, then createdAt/executeAfter/id) and caps DB query results to 1000.
Batch launcher logging/comment
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt
Suppresses a verbose trace log; updates comment about project-level concurrency.
Scheduled job cleaner logging
backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt
Collects stuck job IDs before logging; logs count and per-job trace prior to removal.
Logging configuration
backend/app/src/main/resources/logback-spring.xml
Adds console and JSON console logging profiles with INFO root level.
OpenAI retry tuning (EE)
ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt
Changes default retry attempts from [30] to [60, 120].

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin
  participant API as ProjectBatchLockController
  participant LM as BatchJobProjectLockingManager
  participant BS as BatchJobService

  Admin->>API: GET /v2/administration/project-batch-locks
  API->>LM: getMap()
  LM-->>API: Map<ProjectId, Set<JobId>>
  loop For each locked JobId
    API->>BS: getJobDto(jobId)
    alt Success
      BS-->>API: JobInfo
    else Error
      BS-->>API: Error (handled)
    end
  end
  API-->>Admin: Collection<ProjectLockModel>
  note over API: lockStatus derived: LOCKED/UNLOCKED
Loading
sequenceDiagram
  autonumber
  actor Admin
  participant API as ProjectBatchLockController
  participant Q as BatchJobChunkExecutionQueue

  Admin->>API: GET /v2/administration/batch-job-queue
  API->>Q: getAllQueueItems()
  Q-->>API: List<ExecutionQueueItem>
  API-->>Admin: Collection<QueueItemModel>
Loading
sequenceDiagram
  autonumber
  participant Launcher
  participant LM as BatchJobProjectLockingManager
  participant Redis as Redis Map
  participant DB as Batch Jobs DB
  participant Props as BatchProperties

  Launcher->>LM: tryLock(projectId, jobId)
  LM->>Props: read projectConcurrency
  alt Redis enabled
    LM->>Redis: compute(projectId){ add jobId if set size < concurrency }
    Redis-->>LM: Updated Set<JobId>
  else Local map
    LM->>LM: update local Set<JobId> if size < concurrency
  end
  LM-->>Launcher: lock acquired? (boolean)
  opt First run / migration
    LM->>Redis: scan old Long values -> convert to Set<Long>
    Redis-->>LM: migrated entries
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • feat: more llm providers #3118 — Touches the same OpenAI service, adjusting response nullability and error handling; related to current retry/backoff changes.

Suggested labels

enhancement

Suggested reviewers

  • JanCizmar
  • Anty0

Poem

A whisk of paws, a queue in line,
Locks now count in sets—how fine!
Admin ears perk—new endpoints gleam,
JSON logs blink in a tidy stream.
With hops of builds to ghcr flow,
Retries stretch long—then off we go! 🐇🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “fix: batch job processing” is overly generic and does not accurately capture the primary changes, which include refactoring the batch job locking manager to support concurrent locks, adding new administration endpoints, and updating the release workflow and logging configuration. Although it references batch job processing, it fails to convey the scope and significance of the key updates. A more descriptive title would help reviewers quickly understand the intent and impact of this substantial changeset. Consider revising the title to reflect the core refactor and feature additions, for example “Refactor project‐level batch job locking for concurrency and add admin inspection endpoints,” so that the main objectives are clear at a glance.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (3)

307-309: Fix assertion to new Set API.

getLockedForProject(...) now returns Set.

-  fun assertJobUnlocked() {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(0L)
-  }
+  fun assertJobUnlocked() {
+    batchJobProjectLockingManager
+      .getLockedForProject(testData.projectBuilder.self.id)
+      .assert.isEmpty()
+  }

385-387: Fix verifyJobLocked to match Set.

Adjust to assert the set contains only the expected job id.

-  fun verifyJobLocked(job: BatchJob) {
-    batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(job.id)
-  }
+  fun verifyJobLocked(job: BatchJob) {
+    batchJobProjectLockingManager
+      .getLockedForProject(testData.projectBuilder.self.id)
+      .assert.isEqualTo(setOf(job.id))
+  }

1-478: Update stale getLockedForProject assertions in BatchJobTestUtil

  • In assertJobUnlocked (line 308), replace
    batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(0L)
    with
    batchJobProjectLockingManager.getLockedForProject(...).assert.isEmpty()
  • In verifyJobLocked (line 386), replace
    batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(job.id)
    with
    batchJobProjectLockingManager.getLockedForProject(...).assert.contains(job.id)
🧹 Nitpick comments (8)
backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt (1)

48-53: Skip warn when no stuck jobs.

Avoid a warn for 0 items; early-return keeps logs clean.

   private fun handleStuckJobs() {
-    val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds())
-    logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs")
+    val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds())
+    if (stuckJobIds.isEmpty()) return
+    logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs")
     stuckJobIds.forEach {
       logger.trace("Removing stuck job state using scheduled task for job $it")
       batchJobStateProvider.removeJobState(it)
     }
   }
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)

175-193: Add jitter/exponential backoff when project lock unavailable.

Constant +1s requeue can cause synchronized thundering. Add jitter or exponential backoff tied to a retry counter to spread retries under contention.

Example adjustment:

  • Increment a retry counter on the item (reuse managementErrorRetrials or add a dedicated field).
  • Delay = base * 2^retries with cap, plus small random jitter.
val retries = (executionItem.managementErrorRetrials + 1).coerceAtMost(6)
val backoff = (1_000L shl (retries - 1)).coerceAtMost(10_000L)
val jitter = (200L..600L).random()
it.managementErrorRetrials = retries
it.executeAfter = currentDateProvider.date.time + backoff + jitter
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)

66-71: Queue ordering and limiting look good.

RUNNING-first + stable ordering and max 1000 improves fairness and latency.

Consider adding/confirming DB indexes to support this query efficiently:

  • On BatchJobChunkExecution(status, createdAt, executeAfter, id)
  • On BatchJob(status) for the join

Also applies to: 77-81

backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (1)

38-54: LGTM: lock overview endpoint.

Builds per-project models with defensive job info retrieval. Looks good.

Consider snapshotting the map first to avoid mid-iteration changes:

val locks = batchJobProjectLockingManager.getMap().toMap()
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)

1-136: Optional: assert response content for stronger guarantees.

Add JSON assertions (e.g., size, IDs) to validate payload shape.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)

58-68: Tidy up duplicate unlock logs.

Same debug message logged twice when the job is present; keep one.

-    getMap().compute(projectId) { _, lockedJobIds ->
-      logger.debug("Unlocking job: $jobId for project $projectId")
+    getMap().compute(projectId) { _, lockedJobIds ->
       val currentJobs = lockedJobIds ?: emptySet()
       if (currentJobs.contains(jobId)) {
-        logger.debug("Unlocking job: $jobId for project $projectId")
+        logger.debug("Unlocking job: $jobId for project $projectId")
         val updatedJobs = currentJobs - jobId
         return@compute updatedJobs.ifEmpty { emptySet() }
       }

161-186: Reduce DB round-trips in getInitialJobId (optional).

You call getJobDto per job to check status. If possible, fetch statuses in one query or extend getAllIncompleteJobIds to include status to avoid N calls.


192-226: Guard migration with a Redis lock to avoid multi-node races.

If multiple nodes start concurrently, they could both migrate. Use an RLock around the migration section.

val lock = redissonClient.getLock("project_batch_job_locks:migration")
if (lock.tryLock()) {
  try {
    // migration body...
  } finally {
    lock.unlock()
  }
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3447fc and da44146.

📒 Files selected for processing (11)
  • .github/workflows/forky-release.yml (1 hunks)
  • backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (1 hunks)
  • backend/app/src/main/resources/logback-spring.xml (1 hunks)
  • backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1 hunks)
  • backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (2 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (2 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (4 hunks)
  • backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt (1 hunks)
  • backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt (1 hunks)
  • ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)
backend/testing/src/main/kotlin/io/tolgee/testing/AuthorizedControllerTest.kt (3)
  • performAuthGet (130-133)
  • performAuthPut (114-120)
  • performAuthDelete (135-141)
🪛 actionlint (1.7.7)
.github/workflows/forky-release.yml

23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Frontend static check 🪲
  • GitHub Check: Build backend 🏗️
🔇 Additional comments (11)
ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt (1)

27-27: Clarify and validate timeout values
The defaultAttempts() list items map to RestTemplate read timeouts in seconds (via Duration.ofSeconds). You’ve increased this from [30] to [60, 120]. Confirm these durations align with OpenAI’s typical response times, SLAs, and your intended backoff strategy.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)

104-106: Good call on silencing the spammy trace.

backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1)

411-413: LGTM: updated to the new Set return type.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)

203-205: getAllQueueItems() snapshot is fine.

Returns a safe snapshot for admin inspection; no hot path impact.

backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (2)

85-107: LGTM: queue inspection endpoint.

Simple mapping over a snapshot; appropriate for admin use.


24-25: Avoid permissive CORS on admin endpoints.

@crossorigin("*") on super-admin APIs is risky. Prefer removing it or restricting to configured origins.

-@CrossOrigin(origins = ["*"])
 @RequestMapping("/v2/administration")

Optionally, wire allowed origins via configuration and apply at a global CORS config rather than per-controller.

⛔ Skipped due to learnings
Learnt from: dkrizan
PR: tolgee/tolgee-platform#3092
File: backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/LabelsController.kt:32-32
Timestamp: 2025-06-04T15:04:55.253Z
Learning: In the Tolgee platform codebase, the team intentionally uses `CrossOrigin(origins = ["*"])` to allow all origins in their controllers. This is their standard practice and should not be flagged as a security concern in future reviews.
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (2)

48-81: Good coverage for lock listing with/without super auth.


103-126: Queue endpoint tests are fine; mapping aligns with data model.

backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)

128-156: Lock set computation looks correct under a positive limit.

With validated projectConcurrency ≥ 1, the paths are consistent. No action needed beyond property validation.

Edge case: If projectConcurrency were misconfigured to 0, initial lock path would still allow one job. Property validation (suggested in BatchProperties) avoids this.


188-191: Typed access to the Redis map is fine.

Returning ConcurrentMap<Long, Set> keeps usage consistent across callers.


228-230: LGTM: getLockedJobIds aggregates over Set values.

Comment on lines +22 to +26
- name: Setup node
uses: actions/setup-node@v3
with:
node-version: "22.x"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Upgrade to a supported setup-node version.

actions/setup-node@v3 still runs on the deprecated Node16 runtime and is blocked on ubuntu-24.04 runners, so the workflow fails before any build logic runs. Please bump to @v4 (or later).

-      - name: Setup node
-        uses: actions/setup-node@v3
+      - name: Setup node
+        uses: actions/setup-node@v4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Setup node
uses: actions/setup-node@v3
with:
node-version: "22.x"
- name: Setup node
uses: actions/setup-node@v4
with:
node-version: "22.x"
🧰 Tools
🪛 actionlint (1.7.7)

23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
.github/workflows/forky-release.yml around lines 22-26: actions/setup-node is
pinned to @v3 which uses deprecated Node16 runtime and fails on ubuntu-24.04;
update the action to a supported major by changing the uses line to
actions/setup-node@v4 (or later) and keep or adjust the node-version input
(e.g., "22.x") as needed so the runner uses a supported setup-node
implementation.

Comment on lines +50 to +52
echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the Docker login username.

-u $ expands to an empty string, so docker login refuses the authentication and the push never happens. Use a real username (e.g. github.actor) and keep the token on stdin.

-          echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin
+          echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}
echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }}
🤖 Prompt for AI Agents
.github/workflows/forky-release.yml around lines 50-52: the Docker login uses
`-u $` which expands to an empty username and breaks authentication; replace the
empty username with a real actor variable (e.g. `-u ${{ github.actor }}`) while
continuing to pipe the token via `--password-stdin` so the command becomes: echo
the token | docker login <registry> -u ${{ github.actor }} --password-stdin,
then keep the build and push steps unchanged.

Comment on lines +20 to +36
/**
* REST API for managing project batch job locks
*/
@RestController
@CrossOrigin(origins = ["*"])
@RequestMapping("/v2/administration")
@Tag(
name = "Server Administration",
description = "**Only for self-hosted instances** \n\n" +
"Management of project-level batch job locks and queue inspection for debugging and maintenance."
)
@OpenApiSelfHostedExtension
class ProjectBatchLockController(
private val batchJobProjectLockingManager: BatchJobProjectLockingManager,
private val batchJobService: BatchJobService,
private val batchJobChunkExecutionQueue: BatchJobChunkExecutionQueue,
) : IController, Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify presence of PUT/DELETE endpoints used in tests.

Tests call:

  • PUT /v2/administration/project-batch-locks/{id}/clear
  • DELETE /v2/administration/project-batch-locks/{id}

They are not in this controller. Ensure they exist elsewhere or add them here.


🏁 Script executed:

#!/bin/bash
# Search for project-batch-locks admin endpoints in the codebase
rg -nP -C3 --type=kotlin '@(PutMapping|DeleteMapping)\s*\(\s*"/v2/administration/project-batch-locks'
rg -nP -C3 --type=kotlin 'RequestMapping\(.*/v2/administration/project-batch-locks'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Broad search for any mappings or references to project-batch-locks
rg -nP --type=kotlin '@\w+Mapping\s*\(\s*"/v2/administration/project-batch-locks' -C3
rg -n --type=kotlin 'project-batch-locks' -C3

Length of output: 5387


Add missing PUT and DELETE mappings in ProjectBatchLockController
Tests call PUT /v2/administration/project-batch-locks/{id}/clear and DELETE /v2/administration/project-batch-locks/{id}, but neither handler is implemented in ProjectBatchLockController.kt. Implement these endpoints to satisfy the tests.

🤖 Prompt for AI Agents
In
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
around lines 20-36, add two REST handlers: a PUT at
/v2/administration/project-batch-locks/{id}/clear and a DELETE at
/v2/administration/project-batch-locks/{id}. The PUT handler should accept the
path variable id (project id), call the batchJobProjectLockingManager method
that clears/releases the project lock, and also ensure any related queued chunks
or tracking are cleared via batchJobChunkExecutionQueue and/or batchJobService
as appropriate, then return a 204/OK response. The DELETE handler should accept
the path variable id, remove any persistent batch-job records for that project
via batchJobService and clear related queue entries via
batchJobChunkExecutionQueue and/or batchJobProjectLockingManager, and return an
appropriate 204/OK response. Annotate both methods with
@PutMapping/@DeleteMapping and @PathVariable, handle missing/not-found cases
with suitable HTTP statuses, and add minimal logging consistent with the
controller.

Comment on lines +17 to +26
<!-- JSON console appender using standard logback pattern -->
<appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>{"timestamp":"%d{yyyy-MM-dd'T'HH:mm:ss.SSSZ}","level":"%p","logger":"%c","message":"%m","thread":"%t","exception":"%ex{full}"}%n</pattern>
</encoder>
</appender>

<root level="INFO">
<appender-ref ref="JSON_CONSOLE"/>
</root>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a dedicated JSON encoder instead of a raw pattern.

This pattern layout only resembles JSON; %m, %c, %ex{full} etc. will happily inject quotes, backslashes, and newlines without escaping, so any log containing " or a stack trace breaks consumers expecting valid JSON. Please switch to a real JSON encoder (e.g., ch.qos.logback.contrib.json.classic.JsonLayout, net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder, or Spring Boot’s default LoggingSystemProperties with LOGGING_LEVEL_PATTERN) or add a layout that handles escaping.

🤖 Prompt for AI Agents
In backend/app/src/main/resources/logback-spring.xml around lines 17 to 26, the
ConsoleAppender currently uses a raw pattern string that can produce invalid
JSON when messages or stack traces contain quotes/newlines; replace the
pattern-based encoder with a dedicated JSON encoder (for example use
net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder or
ch.qos.logback.contrib.json.classic.JsonLayout) and configure it to emit
timestamp, level, logger, thread, message and exception fields with proper
escaping/stacktrace handling; remove the pattern block and add the appropriate
encoder/layout element and required dependencies so consumers always receive
well-formed JSON.

Comment on lines +83 to +101
@Test
fun `PUT clear project lock works with super auth`() {
val testLocks = ConcurrentHashMap<Long, Set<Long>>()
whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks)

performAuthPut("/v2/administration/project-batch-locks/123/clear", null)
.andIsOk
}

@Test
fun `DELETE project lock works with super auth`() {
val testLocks = ConcurrentHashMap<Long, Set<Long>>().apply {
put(123L, setOf(456L))
}
whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks)

performAuthDelete("/v2/administration/project-batch-locks/123")
.andIsOk
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

PUT/DELETE tests assume endpoints exist. Please verify.


🏁 Script executed:

#!/bin/bash
# Verify controller mappings exist for the tested routes
rg -nP --type=kotlin -C3 '@PutMapping\("/v2/administration/project-batch-locks/.*/clear"\)'
rg -nP --type=kotlin -C3 '@DeleteMapping\("/v2/administration/project-batch-locks/.*"\)'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find controller files related to project batch locking endpoints
fd -IH 'ProjectBatchLock*Controller*' -e kt

# Search for mapping annotations for project-batch-locks endpoints
rg -n '/v2/administration/project-batch-locks' -g '*.kt'

# Search for usage of batchJobProjectLockingManager in code
rg -n 'batchJobProjectLockingManager' -g '*.kt'

Length of output: 3796


🏁 Script executed:

#!/bin/bash
rg -n '@PutMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
rg -n '@DeleteMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 0


🏁 Script executed:

#!/bin/bash
rg -nE '@(RequestMapping|GetMapping|PutMapping|DeleteMapping)' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 90


🏁 Script executed:

#!/bin/bash
# Print the beginning of the controller to inspect class-level mapping
sed -n '1,200p' backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

# Look for class-level RequestMapping
rg -n '@RequestMapping' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

# Search for any 'clear' usage
rg -n 'clear' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

# Search for HTTP delete handler
rg -n 'fun .*delete' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt

Length of output: 5018


Implement missing PUT/DELETE endpoints for project batch locks
ProjectBatchLockController.kt currently only defines GET mappings; the tests for PUT /v2/administration/project-batch-locks/{id}/clear and DELETE /v2/administration/project-batch-locks/{id} will fail. Add corresponding @PutMapping and @DeleteMapping methods (with @RequiresSuperAuthentication) to handle “clear” and “delete” operations.

🤖 Prompt for AI Agents
In
backend/app/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
around where GET mappings are defined (add new methods corresponding to tests
roughly next to existing endpoints), implement two new endpoints: add a
@PutMapping("/v2/administration/project-batch-locks/{id}/clear") method
annotated with @RequiresSuperAuthentication that reads the id path variable,
retrieves batchJobProjectLockingManager.getMap(), and if an entry exists clears
the Set for that id (e.g. replace with emptySet or clear the collection) and
returns an OK response; add a
@DeleteMapping("/v2/administration/project-batch-locks/{id}") method annotated
with @RequiresSuperAuthentication that removes the map entry for the given id
via batchJobProjectLockingManager.getMap().remove(id) and returns an OK
response; ensure proper path variable types and return types match other
controller methods (e.g. ResponseEntity or Unit mapped to 200) and include
imports/annotations as needed.

Comment on lines +80 to 92
val computedJobIds =
getRedissonProjectLocks().compute(projectId) { _, lockedJobIds ->
val newLockedJobIds = computeFnBody(batchJobDto, lockedJobIds ?: emptySet())
logger.debug(
"While trying to lock on redis {} for project {} new lock value is {}",
batchJobDto.id,
batchJobDto.projectId,
newLockedJobIds
)
newLockedJobIds
} ?: emptySet()
return computedJobIds.contains(batchJobDto.id)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Confirm atomicity of Redisson RMap.compute across nodes.

Concurrency guarantees are critical to enforce projectConcurrency. Please confirm RMap.compute is cluster-atomic; otherwise, guard with an RLock per project key or switch to a server-side entry-processor/Lua.


🌐 Web query:

Is Redisson RMap.compute atomic across the cluster? What is the recommended approach to perform an atomic read-modify-write on a map entry with Redisson?

💡 Result:

Short answer: No — RMap.compute is a client-side Java callback and is not guaranteed atomic across the cluster. The safe approaches are to perform the read-modify-write inside a Redis-backed primitive or server-side transaction/lock. Recommended options:

  • Acquire a Redis lock for the key (RLock) around your read-modify-write. This serializes updates across clients/nodes. [1][2]
  • Use Redisson transactions (RTransaction) so the map participates with ACID properties and changes are committed atomically. [3]
  • For simple “compute-if-absent” semantics prefer the server-side helper methods (e.g. computeIfAbsent on RMapCache / putIfAbsent / fastPutIfAbsent) which Redisson implements to avoid racing on the client. [4]

References:

  1. Redisson — Distributed collections / RMap (behavior and API). [1]
  2. Redisson — FAQ (thread-safety and use of locks). [2]
  3. Redisson — Transactions (RMap participates in transactions; isolation READ_COMMITTED). [3]
  4. Redisson Javadoc — RMapCache.computeIfAbsent and putIfAbsent helpers. [4]

Guard read-modify-write with a distributed lock or server-side atomic operation

RMap.compute is a client-side callback and not cluster-atomic. Wrap the compute call in an RLock keyed by projectId—or switch to a Redisson RTransaction or a server-side entry processor/Lua script—to enforce atomic read-modify-write across nodes.

Comment on lines +11 to +12
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
var projectConcurrency: Int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Validate concurrency properties (prevent 0/negative).

If set to 0 or negative, project-level locking behavior becomes inconsistent. Add bean validation and validate both properties.

 package io.tolgee.configuration.tolgee
@@
-import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.validation.annotation.Validated
+import jakarta.validation.constraints.Min
@@
-@ConfigurationProperties(prefix = "tolgee.batch")
+@ConfigurationProperties(prefix = "tolgee.batch")
+@Validated
 @DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")
 class BatchProperties {
   @DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")
-  var concurrency: Int = 1
+  @field:Min(1)
+  var concurrency: Int = 1
   @DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
-  var projectConcurrency: Int = 1
+  @field:Min(1)
+  var projectConcurrency: Int = 1
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
var projectConcurrency: Int = 1
package io.tolgee.configuration.tolgee
import org.springframework.boot.context.properties.ConfigurationProperties
import org.springframework.validation.annotation.Validated
import jakarta.validation.constraints.Min
@Validated
@ConfigurationProperties(prefix = "tolgee.batch")
@DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")
class BatchProperties {
@DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")
@field:Min(1)
var concurrency: Int = 1
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
@field:Min(1)
var projectConcurrency: Int = 1
}
🤖 Prompt for AI Agents
In
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
around lines 11-12, the projectConcurrency (and any other concurrency properties
in this class) must be validated to prevent 0 or negative values; add bean
validation annotations (e.g., @Min(1) or @Positive) to the property declarations
and ensure the class is annotated/configured for validation (e.g., @Validated or
@ConfigurationProperties + enable JSR-303 validation) so Spring will enforce the
constraint at startup; apply the same validation annotation to any other
concurrency-related properties in this file.

@bdshadow bdshadow changed the title fix batch job processing fix: batch job processing Oct 3, 2025
@bdshadow
Copy link
Contributor Author

bdshadow commented Oct 5, 2025

@ajeans hello!
I think it's worth splitting the commits here into two separate pull requests. Maybe we can leave this pr for reference and create a separate one just for the batch job fix and after it is merged continue on a new one, which allows multiple jobs per project.

Let's take the following commits to the first pr.
Screenshot 2025-10-06 at 1 08 30

General comments:

  1. Please fix the commit messages to follow the convention.
  2. I think it's worth adding .setMaxResults(1_000) in the BatchJobChunkExecutionQueue as a configuration property to the io.tolgee.configuration.tolgee.BatchProperties
  3. In the current state a new ProjectBatchLockControllerTest is failing. Please check that it is ok in the new pr-s.

Then we can continue with the review in these separate pr-s.

@ajeans
Copy link
Contributor

ajeans commented Oct 6, 2025

@bdshadow Hello 👋

Sounds good, I'll create new PR #1 today or tomorrow

@ajeans
Copy link
Contributor

ajeans commented Oct 6, 2025

@bdshadow for internal security reasons, I cannot push the branch to tolgee repo from my corporate laptop (sigh).

Can you please create the PR from this branch: https://github.com/RakutenFrance/tolgee-platform/commits/fix/batch-job-processing?

Tests and ktlint should be fine, and I amended the commit names.

@bdshadow
Copy link
Contributor Author

bdshadow commented Oct 6, 2025

@ajeans great! here it is #3265

@bdshadow bdshadow marked this pull request as draft October 6, 2025 19:18
@JanCizmar
Copy link
Contributor

I guess we can close this one! Thanks for the contribution. It's much appreciated ❤️

I hope we will get the batch jobs stable and fast soon! 🚀

@bdshadow
Copy link
Contributor Author

bdshadow commented Oct 10, 2025

I will still use this issue for communication.
The first pr is merged. @ajeans please create the next pr. From my side, I also have one task for fixing hanging jobs, when job is cancelled. I'm getting back to it after your first fix to check if it helped. I hope all these fixes will make batch processing much more stable

@ajeans
Copy link
Contributor

ajeans commented Oct 11, 2025

Hello @bdshadow

Excellent. I created the second pull request minutes ago, this is #3271

Let's have the review there, and we can keep this place to discuss further.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants