-
-
Notifications
You must be signed in to change notification settings - Fork 305
fix: batch job processing #3262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
each item will be entirely inefficient. Improve the situation with 2 changes * Hardcode a max of 1_000 chunks to be added to the internal queue * Prioritize RUNNING job chunks in queue population 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Remove hardcoded 30-second timeout override and use [60, 120] configuration. This resolves the issue where OpenAI calls were timing out at 30s on the same chunks over and over. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Add logback-spring.xml with profile-based configuration supporting both standard and JSON log formats. JSON logging enabled with 'json-logging' profile for improved log analysis and monitoring capabilities. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Add administration endpoints for viewing and managing project batch job locks: - GET /v2/administration/project-batch-locks - retrieve all project locks - GET /v2/administration/batch-job-queue - returns all chunk execution items Features: - Admin-only access with @RequiresSuperAuthentication - Works with both Redis and local storage configurations - Provides lock status (UNLOCKED/UNINITIALIZED/LOCKED) and job details - Comprehensive test coverage - Full OpenAPI documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
The previous getInitialJobId() logic created phantom locks by assuming jobs with fewer unlocked chunks were "running", when in reality they were just PENDING jobs with chunks sitting in local queues but never executing. 1. populateQueue() adds chunks to local queues (counted as "locked") 2. getInitialJobId() sees locked chunks → assumes job is "running" 3. Creates phantom lock pointing to PENDING job 4. All subsequent jobs fail project lock check → returned to queue 5. No job ever reaches handleItem() → no job becomes RUNNING 6. Phantom lock persists indefinitely → system deadlocked Now prioritizes database job status over chunk counting: 1. First: Look for jobs with status=RUNNING (truly active jobs) 2. Fallback: Use original chunk-counting logic for edge cases 3. If neither found: Return null (allow new job to acquire lock) This prevents phantom locks from PENDING jobs while preserving protection for legitimately running jobs. The fix should break the current deadlock where 203k+ PENDING jobs are stuck due to phantom project locks. Expected result: Batch jobs will start processing and populate LLM metrics. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Target full public github action to keep the repository a fork of tolgee/tolgee.
Replace old format Map<Long, Long> with new format Map<Long, Set<Long>> to support configurable concurrent jobs per project. Migration automatically runs on startup and performs one-time data conversion. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
WalkthroughAdds a release GitHub Actions workflow, introduces admin endpoints to inspect/clear project batch locks and view the batch-job queue, refactors project lock management to support multiple concurrent locks per project with migration logic, adds a projectConcurrency property, adjusts queue ordering/limits, updates logging, adds tests, and tweaks OpenAI retry attempts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Admin
participant API as ProjectBatchLockController
participant LM as BatchJobProjectLockingManager
participant BS as BatchJobService
Admin->>API: GET /v2/administration/project-batch-locks
API->>LM: getMap()
LM-->>API: Map<ProjectId, Set<JobId>>
loop For each locked JobId
API->>BS: getJobDto(jobId)
alt Success
BS-->>API: JobInfo
else Error
BS-->>API: Error (handled)
end
end
API-->>Admin: Collection<ProjectLockModel>
note over API: lockStatus derived: LOCKED/UNLOCKED
sequenceDiagram
autonumber
actor Admin
participant API as ProjectBatchLockController
participant Q as BatchJobChunkExecutionQueue
Admin->>API: GET /v2/administration/batch-job-queue
API->>Q: getAllQueueItems()
Q-->>API: List<ExecutionQueueItem>
API-->>Admin: Collection<QueueItemModel>
sequenceDiagram
autonumber
participant Launcher
participant LM as BatchJobProjectLockingManager
participant Redis as Redis Map
participant DB as Batch Jobs DB
participant Props as BatchProperties
Launcher->>LM: tryLock(projectId, jobId)
LM->>Props: read projectConcurrency
alt Redis enabled
LM->>Redis: compute(projectId){ add jobId if set size < concurrency }
Redis-->>LM: Updated Set<JobId>
else Local map
LM->>LM: update local Set<JobId> if size < concurrency
end
LM-->>Launcher: lock acquired? (boolean)
opt First run / migration
LM->>Redis: scan old Long values -> convert to Set<Long>
Redis-->>LM: migrated entries
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (3)
307-309
: Fix assertion to new Set API.getLockedForProject(...) now returns Set.
- fun assertJobUnlocked() { - batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(0L) - } + fun assertJobUnlocked() { + batchJobProjectLockingManager + .getLockedForProject(testData.projectBuilder.self.id) + .assert.isEmpty() + }
385-387
: Fix verifyJobLocked to match Set.Adjust to assert the set contains only the expected job id.
- fun verifyJobLocked(job: BatchJob) { - batchJobProjectLockingManager.getLockedForProject(testData.projectBuilder.self.id).assert.isEqualTo(job.id) - } + fun verifyJobLocked(job: BatchJob) { + batchJobProjectLockingManager + .getLockedForProject(testData.projectBuilder.self.id) + .assert.isEqualTo(setOf(job.id)) + }
1-478
: Update stale getLockedForProject assertions in BatchJobTestUtil
- In
assertJobUnlocked
(line 308), replace
batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(0L)
with
batchJobProjectLockingManager.getLockedForProject(...).assert.isEmpty()
- In
verifyJobLocked
(line 386), replace
batchJobProjectLockingManager.getLockedForProject(...).assert.isEqualTo(job.id)
with
batchJobProjectLockingManager.getLockedForProject(...).assert.contains(job.id)
🧹 Nitpick comments (8)
backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt (1)
48-53
: Skip warn when no stuck jobs.Avoid a warn for 0 items; early-return keeps logs clean.
private fun handleStuckJobs() { - val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds()) - logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs") + val stuckJobIds = batchJobService.getStuckJobIds(batchJobStateProvider.getCachedJobIds()) + if (stuckJobIds.isEmpty()) return + logger.warn("Removing stuck job state using scheduled task for ${stuckJobIds.size} jobs") stuckJobIds.forEach { logger.trace("Removing stuck job state using scheduled task for job $it") batchJobStateProvider.removeJobState(it) } }backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)
175-193
: Add jitter/exponential backoff when project lock unavailable.Constant +1s requeue can cause synchronized thundering. Add jitter or exponential backoff tied to a retry counter to spread retries under contention.
Example adjustment:
- Increment a retry counter on the item (reuse managementErrorRetrials or add a dedicated field).
- Delay = base * 2^retries with cap, plus small random jitter.
val retries = (executionItem.managementErrorRetrials + 1).coerceAtMost(6) val backoff = (1_000L shl (retries - 1)).coerceAtMost(10_000L) val jitter = (200L..600L).random() it.managementErrorRetrials = retries it.executeAfter = currentDateProvider.date.time + backoff + jitterbackend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)
66-71
: Queue ordering and limiting look good.RUNNING-first + stable ordering and max 1000 improves fairness and latency.
Consider adding/confirming DB indexes to support this query efficiently:
- On BatchJobChunkExecution(status, createdAt, executeAfter, id)
- On BatchJob(status) for the join
Also applies to: 77-81
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (1)
38-54
: LGTM: lock overview endpoint.Builds per-project models with defensive job info retrieval. Looks good.
Consider snapshotting the map first to avoid mid-iteration changes:
val locks = batchJobProjectLockingManager.getMap().toMap()backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)
1-136
: Optional: assert response content for stronger guarantees.Add JSON assertions (e.g., size, IDs) to validate payload shape.
backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)
58-68
: Tidy up duplicate unlock logs.Same debug message logged twice when the job is present; keep one.
- getMap().compute(projectId) { _, lockedJobIds -> - logger.debug("Unlocking job: $jobId for project $projectId") + getMap().compute(projectId) { _, lockedJobIds -> val currentJobs = lockedJobIds ?: emptySet() if (currentJobs.contains(jobId)) { - logger.debug("Unlocking job: $jobId for project $projectId") + logger.debug("Unlocking job: $jobId for project $projectId") val updatedJobs = currentJobs - jobId return@compute updatedJobs.ifEmpty { emptySet() } }
161-186
: Reduce DB round-trips in getInitialJobId (optional).You call getJobDto per job to check status. If possible, fetch statuses in one query or extend getAllIncompleteJobIds to include status to avoid N calls.
192-226
: Guard migration with a Redis lock to avoid multi-node races.If multiple nodes start concurrently, they could both migrate. Use an RLock around the migration section.
val lock = redissonClient.getLock("project_batch_job_locks:migration") if (lock.tryLock()) { try { // migration body... } finally { lock.unlock() } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/workflows/forky-release.yml
(1 hunks)backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
(1 hunks)backend/app/src/main/resources/logback-spring.xml
(1 hunks)backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt
(1 hunks)backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt
(1 hunks)backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt
(2 hunks)backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt
(2 hunks)backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt
(4 hunks)backend/data/src/main/kotlin/io/tolgee/batch/cleaning/ScheduledJobCleaner.kt
(1 hunks)backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
(1 hunks)ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (1)
backend/testing/src/main/kotlin/io/tolgee/testing/AuthorizedControllerTest.kt (3)
performAuthGet
(130-133)performAuthPut
(114-120)performAuthDelete
(135-141)
🪛 actionlint (1.7.7)
.github/workflows/forky-release.yml
23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Frontend static check 🪲
- GitHub Check: Build backend 🏗️
🔇 Additional comments (11)
ee/backend/app/src/main/kotlin/io/tolgee/ee/component/llm/OpenaiApiService.kt (1)
27-27
: Clarify and validate timeout values
ThedefaultAttempts()
list items map to RestTemplate read timeouts in seconds (viaDuration.ofSeconds
). You’ve increased this from[30]
to[60, 120]
. Confirm these durations align with OpenAI’s typical response times, SLAs, and your intended backoff strategy.backend/data/src/main/kotlin/io/tolgee/batch/BatchJobConcurrentLauncher.kt (1)
104-106
: Good call on silencing the spammy trace.backend/app/src/test/kotlin/io/tolgee/batch/BatchJobTestUtil.kt (1)
411-413
: LGTM: updated to the new Set return type.backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt (1)
203-205
: getAllQueueItems() snapshot is fine.Returns a safe snapshot for admin inspection; no hot path impact.
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt (2)
85-107
: LGTM: queue inspection endpoint.Simple mapping over a snapshot; appropriate for admin use.
24-25
: Avoid permissive CORS on admin endpoints.@crossorigin("*") on super-admin APIs is risky. Prefer removing it or restricting to configured origins.
-@CrossOrigin(origins = ["*"]) @RequestMapping("/v2/administration")
Optionally, wire allowed origins via configuration and apply at a global CORS config rather than per-controller.
⛔ Skipped due to learnings
Learnt from: dkrizan PR: tolgee/tolgee-platform#3092 File: backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/LabelsController.kt:32-32 Timestamp: 2025-06-04T15:04:55.253Z Learning: In the Tolgee platform codebase, the team intentionally uses `CrossOrigin(origins = ["*"])` to allow all origins in their controllers. This is their standard practice and should not be flagged as a security concern in future reviews.
backend/app/src/test/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockControllerTest.kt (2)
48-81
: Good coverage for lock listing with/without super auth.
103-126
: Queue endpoint tests are fine; mapping aligns with data model.backend/data/src/main/kotlin/io/tolgee/batch/BatchJobProjectLockingManager.kt (3)
128-156
: Lock set computation looks correct under a positive limit.With validated projectConcurrency ≥ 1, the paths are consistent. No action needed beyond property validation.
Edge case: If projectConcurrency were misconfigured to 0, initial lock path would still allow one job. Property validation (suggested in BatchProperties) avoids this.
188-191
: Typed access to the Redis map is fine.Returning ConcurrentMap<Long, Set> keeps usage consistent across callers.
228-230
: LGTM: getLockedJobIds aggregates over Set values.
- name: Setup node | ||
uses: actions/setup-node@v3 | ||
with: | ||
node-version: "22.x" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade to a supported setup-node
version.
actions/setup-node@v3
still runs on the deprecated Node16 runtime and is blocked on ubuntu-24.04 runners, so the workflow fails before any build logic runs. Please bump to @v4
(or later).
- - name: Setup node
- uses: actions/setup-node@v3
+ - name: Setup node
+ uses: actions/setup-node@v4
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- name: Setup node | |
uses: actions/setup-node@v3 | |
with: | |
node-version: "22.x" | |
- name: Setup node | |
uses: actions/setup-node@v4 | |
with: | |
node-version: "22.x" |
🧰 Tools
🪛 actionlint (1.7.7)
23-23: the runner of "actions/setup-node@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
.github/workflows/forky-release.yml around lines 22-26: actions/setup-node is
pinned to @v3 which uses deprecated Node16 runtime and fails on ubuntu-24.04;
update the action to a supported major by changing the uses line to
actions/setup-node@v4 (or later) and keep or adjust the node-version input
(e.g., "22.x") as needed so the runner uses a supported setup-node
implementation.
echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin | ||
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64 | ||
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the Docker login username.
-u $
expands to an empty string, so docker login
refuses the authentication and the push never happens. Use a real username (e.g. github.actor
) and keep the token on stdin.
- echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin
+ echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
echo ${{ secrets.GITHUB_TOKEN }} | docker login ${{ env.DOCKER_REGISTRY }} -u $ --password-stdin | |
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64 | |
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} | |
echo "${{ secrets.GITHUB_TOKEN }}" | docker login "${{ env.DOCKER_REGISTRY }}" -u "${{ github.actor }}" --password-stdin | |
docker build . -t ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} --platform linux/amd64 | |
docker push ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_PATH }}:${{ github.event.inputs.releaseVersion }} |
🤖 Prompt for AI Agents
.github/workflows/forky-release.yml around lines 50-52: the Docker login uses
`-u $` which expands to an empty username and breaks authentication; replace the
empty username with a real actor variable (e.g. `-u ${{ github.actor }}`) while
continuing to pipe the token via `--password-stdin` so the command becomes: echo
the token | docker login <registry> -u ${{ github.actor }} --password-stdin,
then keep the build and push steps unchanged.
/** | ||
* REST API for managing project batch job locks | ||
*/ | ||
@RestController | ||
@CrossOrigin(origins = ["*"]) | ||
@RequestMapping("/v2/administration") | ||
@Tag( | ||
name = "Server Administration", | ||
description = "**Only for self-hosted instances** \n\n" + | ||
"Management of project-level batch job locks and queue inspection for debugging and maintenance." | ||
) | ||
@OpenApiSelfHostedExtension | ||
class ProjectBatchLockController( | ||
private val batchJobProjectLockingManager: BatchJobProjectLockingManager, | ||
private val batchJobService: BatchJobService, | ||
private val batchJobChunkExecutionQueue: BatchJobChunkExecutionQueue, | ||
) : IController, Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify presence of PUT/DELETE endpoints used in tests.
Tests call:
- PUT /v2/administration/project-batch-locks/{id}/clear
- DELETE /v2/administration/project-batch-locks/{id}
They are not in this controller. Ensure they exist elsewhere or add them here.
🏁 Script executed:
#!/bin/bash
# Search for project-batch-locks admin endpoints in the codebase
rg -nP -C3 --type=kotlin '@(PutMapping|DeleteMapping)\s*\(\s*"/v2/administration/project-batch-locks'
rg -nP -C3 --type=kotlin 'RequestMapping\(.*/v2/administration/project-batch-locks'
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Broad search for any mappings or references to project-batch-locks
rg -nP --type=kotlin '@\w+Mapping\s*\(\s*"/v2/administration/project-batch-locks' -C3
rg -n --type=kotlin 'project-batch-locks' -C3
Length of output: 5387
Add missing PUT and DELETE mappings in ProjectBatchLockController
Tests call PUT /v2/administration/project-batch-locks/{id}/clear
and DELETE /v2/administration/project-batch-locks/{id}
, but neither handler is implemented in ProjectBatchLockController.kt
. Implement these endpoints to satisfy the tests.
🤖 Prompt for AI Agents
In
backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
around lines 20-36, add two REST handlers: a PUT at
/v2/administration/project-batch-locks/{id}/clear and a DELETE at
/v2/administration/project-batch-locks/{id}. The PUT handler should accept the
path variable id (project id), call the batchJobProjectLockingManager method
that clears/releases the project lock, and also ensure any related queued chunks
or tracking are cleared via batchJobChunkExecutionQueue and/or batchJobService
as appropriate, then return a 204/OK response. The DELETE handler should accept
the path variable id, remove any persistent batch-job records for that project
via batchJobService and clear related queue entries via
batchJobChunkExecutionQueue and/or batchJobProjectLockingManager, and return an
appropriate 204/OK response. Annotate both methods with
@PutMapping/@DeleteMapping and @PathVariable, handle missing/not-found cases
with suitable HTTP statuses, and add minimal logging consistent with the
controller.
<!-- JSON console appender using standard logback pattern --> | ||
<appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender"> | ||
<encoder> | ||
<pattern>{"timestamp":"%d{yyyy-MM-dd'T'HH:mm:ss.SSSZ}","level":"%p","logger":"%c","message":"%m","thread":"%t","exception":"%ex{full}"}%n</pattern> | ||
</encoder> | ||
</appender> | ||
|
||
<root level="INFO"> | ||
<appender-ref ref="JSON_CONSOLE"/> | ||
</root> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a dedicated JSON encoder instead of a raw pattern.
This pattern layout only resembles JSON; %m
, %c
, %ex{full}
etc. will happily inject quotes, backslashes, and newlines without escaping, so any log containing "
or a stack trace breaks consumers expecting valid JSON. Please switch to a real JSON encoder (e.g., ch.qos.logback.contrib.json.classic.JsonLayout
, net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder
, or Spring Boot’s default LoggingSystemProperties
with LOGGING_LEVEL_PATTERN
) or add a layout that handles escaping.
🤖 Prompt for AI Agents
In backend/app/src/main/resources/logback-spring.xml around lines 17 to 26, the
ConsoleAppender currently uses a raw pattern string that can produce invalid
JSON when messages or stack traces contain quotes/newlines; replace the
pattern-based encoder with a dedicated JSON encoder (for example use
net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder or
ch.qos.logback.contrib.json.classic.JsonLayout) and configure it to emit
timestamp, level, logger, thread, message and exception fields with proper
escaping/stacktrace handling; remove the pattern block and add the appropriate
encoder/layout element and required dependencies so consumers always receive
well-formed JSON.
@Test | ||
fun `PUT clear project lock works with super auth`() { | ||
val testLocks = ConcurrentHashMap<Long, Set<Long>>() | ||
whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks) | ||
|
||
performAuthPut("/v2/administration/project-batch-locks/123/clear", null) | ||
.andIsOk | ||
} | ||
|
||
@Test | ||
fun `DELETE project lock works with super auth`() { | ||
val testLocks = ConcurrentHashMap<Long, Set<Long>>().apply { | ||
put(123L, setOf(456L)) | ||
} | ||
whenever(batchJobProjectLockingManager.getMap()).thenReturn(testLocks) | ||
|
||
performAuthDelete("/v2/administration/project-batch-locks/123") | ||
.andIsOk | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
PUT/DELETE tests assume endpoints exist. Please verify.
🏁 Script executed:
#!/bin/bash
# Verify controller mappings exist for the tested routes
rg -nP --type=kotlin -C3 '@PutMapping\("/v2/administration/project-batch-locks/.*/clear"\)'
rg -nP --type=kotlin -C3 '@DeleteMapping\("/v2/administration/project-batch-locks/.*"\)'
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find controller files related to project batch locking endpoints
fd -IH 'ProjectBatchLock*Controller*' -e kt
# Search for mapping annotations for project-batch-locks endpoints
rg -n '/v2/administration/project-batch-locks' -g '*.kt'
# Search for usage of batchJobProjectLockingManager in code
rg -n 'batchJobProjectLockingManager' -g '*.kt'
Length of output: 3796
🏁 Script executed:
#!/bin/bash
rg -n '@PutMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
rg -n '@DeleteMapping' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
Length of output: 0
🏁 Script executed:
#!/bin/bash
rg -nE '@(RequestMapping|GetMapping|PutMapping|DeleteMapping)' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
Length of output: 90
🏁 Script executed:
#!/bin/bash
# Print the beginning of the controller to inspect class-level mapping
sed -n '1,200p' backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
# Look for class-level RequestMapping
rg -n '@RequestMapping' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
# Search for any 'clear' usage
rg -n 'clear' -C3 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
# Search for HTTP delete handler
rg -n 'fun .*delete' -C2 backend/api/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
Length of output: 5018
Implement missing PUT/DELETE endpoints for project batch locks
ProjectBatchLockController.kt currently only defines GET mappings; the tests for PUT /v2/administration/project-batch-locks/{id}/clear and DELETE /v2/administration/project-batch-locks/{id} will fail. Add corresponding @PutMapping and @DeleteMapping methods (with @RequiresSuperAuthentication) to handle “clear” and “delete” operations.
🤖 Prompt for AI Agents
In
backend/app/src/main/kotlin/io/tolgee/api/v2/controllers/ProjectBatchLockController.kt
around where GET mappings are defined (add new methods corresponding to tests
roughly next to existing endpoints), implement two new endpoints: add a
@PutMapping("/v2/administration/project-batch-locks/{id}/clear") method
annotated with @RequiresSuperAuthentication that reads the id path variable,
retrieves batchJobProjectLockingManager.getMap(), and if an entry exists clears
the Set for that id (e.g. replace with emptySet or clear the collection) and
returns an OK response; add a
@DeleteMapping("/v2/administration/project-batch-locks/{id}") method annotated
with @RequiresSuperAuthentication that removes the map entry for the given id
via batchJobProjectLockingManager.getMap().remove(id) and returns an OK
response; ensure proper path variable types and return types match other
controller methods (e.g. ResponseEntity or Unit mapped to 200) and include
imports/annotations as needed.
val computedJobIds = | ||
getRedissonProjectLocks().compute(projectId) { _, lockedJobIds -> | ||
val newLockedJobIds = computeFnBody(batchJobDto, lockedJobIds ?: emptySet()) | ||
logger.debug( | ||
"While trying to lock on redis {} for project {} new lock value is {}", | ||
batchJobDto.id, | ||
batchJobDto.projectId, | ||
newLockedJobIds | ||
) | ||
newLockedJobIds | ||
} ?: emptySet() | ||
return computedJobIds.contains(batchJobDto.id) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Confirm atomicity of Redisson RMap.compute across nodes.
Concurrency guarantees are critical to enforce projectConcurrency. Please confirm RMap.compute is cluster-atomic; otherwise, guard with an RLock per project key or switch to a server-side entry-processor/Lua.
🌐 Web query:
Is Redisson RMap.compute atomic across the cluster? What is the recommended approach to perform an atomic read-modify-write on a map entry with Redisson?
💡 Result:
Short answer: No — RMap.compute is a client-side Java callback and is not guaranteed atomic across the cluster. The safe approaches are to perform the read-modify-write inside a Redis-backed primitive or server-side transaction/lock. Recommended options:
- Acquire a Redis lock for the key (RLock) around your read-modify-write. This serializes updates across clients/nodes. [1][2]
- Use Redisson transactions (RTransaction) so the map participates with ACID properties and changes are committed atomically. [3]
- For simple “compute-if-absent” semantics prefer the server-side helper methods (e.g. computeIfAbsent on RMapCache / putIfAbsent / fastPutIfAbsent) which Redisson implements to avoid racing on the client. [4]
References:
- Redisson — Distributed collections / RMap (behavior and API). [1]
- Redisson — FAQ (thread-safety and use of locks). [2]
- Redisson — Transactions (RMap participates in transactions; isolation READ_COMMITTED). [3]
- Redisson Javadoc — RMapCache.computeIfAbsent and putIfAbsent helpers. [4]
Guard read-modify-write with a distributed lock or server-side atomic operation
RMap.compute is a client-side callback and not cluster-atomic. Wrap the compute call in an RLock keyed by projectId—or switch to a Redisson RTransaction or a server-side entry processor/Lua script—to enforce atomic read-modify-write across nodes.
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances") | ||
var projectConcurrency: Int = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Validate concurrency properties (prevent 0/negative).
If set to 0 or negative, project-level locking behavior becomes inconsistent. Add bean validation and validate both properties.
package io.tolgee.configuration.tolgee
@@
-import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.boot.context.properties.ConfigurationProperties
+import org.springframework.validation.annotation.Validated
+import jakarta.validation.constraints.Min
@@
-@ConfigurationProperties(prefix = "tolgee.batch")
+@ConfigurationProperties(prefix = "tolgee.batch")
+@Validated
@DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations")
class BatchProperties {
@DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance")
- var concurrency: Int = 1
+ @field:Min(1)
+ var concurrency: Int = 1
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances")
- var projectConcurrency: Int = 1
+ @field:Min(1)
+ var projectConcurrency: Int = 1
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances") | |
var projectConcurrency: Int = 1 | |
package io.tolgee.configuration.tolgee | |
import org.springframework.boot.context.properties.ConfigurationProperties | |
import org.springframework.validation.annotation.Validated | |
import jakarta.validation.constraints.Min | |
@Validated | |
@ConfigurationProperties(prefix = "tolgee.batch") | |
@DocProperty(description = "Configuration of batch operations.", displayName = "Batch operations") | |
class BatchProperties { | |
@DocProperty(description = "How many parallel jobs can be run at once on single Tolgee instance") | |
@field:Min(1) | |
var concurrency: Int = 1 | |
@DocProperty(description = "How many parallel jobs can be run at once per project across all Tolgee instances") | |
@field:Min(1) | |
var projectConcurrency: Int = 1 | |
} |
🤖 Prompt for AI Agents
In
backend/data/src/main/kotlin/io/tolgee/configuration/tolgee/BatchProperties.kt
around lines 11-12, the projectConcurrency (and any other concurrency properties
in this class) must be validated to prevent 0 or negative values; add bean
validation annotations (e.g., @Min(1) or @Positive) to the property declarations
and ensure the class is annotated/configured for validation (e.g., @Validated or
@ConfigurationProperties + enable JSR-303 validation) so Spring will enforce the
constraint at startup; apply the same validation annotation to any other
concurrency-related properties in this file.
@ajeans hello! Let's take the following commits to the first pr. General comments:
Then we can continue with the review in these separate pr-s. |
@bdshadow for internal security reasons, I cannot push the branch to tolgee repo from my corporate laptop (sigh). Can you please create the PR from this branch: https://github.com/RakutenFrance/tolgee-platform/commits/fix/batch-job-processing? Tests and ktlint should be fine, and I amended the commit names. |
I guess we can close this one! Thanks for the contribution. It's much appreciated ❤️ I hope we will get the batch jobs stable and fast soon! 🚀 |
I will still use this issue for communication. |
Summary by CodeRabbit
New Features
Improvements
Tests
Chores