Skip to content

fix: make PaddleOCR timeout configurable and surface real ingest progress#14050

Open
joonsoome wants to merge 7 commits intoinfiniflow:mainfrom
joonsoome:paddleocr-enhancement
Open

fix: make PaddleOCR timeout configurable and surface real ingest progress#14050
joonsoome wants to merge 7 commits intoinfiniflow:mainfrom
joonsoome:paddleocr-enhancement

Conversation

@joonsoome
Copy link
Copy Markdown

What problem does this PR solve?

PaddleOCR request timeout was effectively hardcoded, which made long-form ingest jobs such as books or large PDFs fail or stall without a clear tuning option. In addition, the ingest pipeline did not clearly surface the later chunking, embedding, and indexing stages, so jobs could look stalled, failed, or finished too early even while the backend was still working.

This PR makes the timeout configurable end-to-end, wires it through the parser and UI, adds coverage for timeout propagation, and improves progress/log reporting so ingest reflects the real pipeline state.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

Summary of changes

  • Added configurable paddleocr_request_timeout support end-to-end.
  • Wired the timeout through parser options, dataset defaults, model setup, and tests.
  • Fixed progress reporting so parsing is not marked complete before chunking, embedding, and indexing are finished.
  • Added stage-level LLM logs for keywords, questions, metadata, and embedding to make long ingest jobs easier to debug.
  • Updated the PaddleOCR UI so request timeout is shown as part of the preset-defined configuration, matching the rest of the preset-managed fields.
  • Kept docker/.env local-only by removing the tracked copy from the repository.

Validation

  • python -m py_compile ...
  • cd web && npm exec eslint src/components/paddleocr-options-form-field.tsx

Notes

  • cd web && npm exec tsc --noEmit currently fails because of an existing repository tsconfig.json deprecation flag configuration, unrelated to this PR.

Screenshot

paddleOCR-vl can handle documents with many pages or books, or when using custom models, exceeding the hardcoded 600-second limit in the code. Consequently, I received a failure message without any logs, and paddleOCR-api remained silent. Upon deep debugging, I discovered that the vLLM container was still processing. After reverse-analyzing the code, I found that RAGFlow had a hardcoded timeout of 600 seconds. By making this configurable, I was able to ingest my books as needed.
image

Screenshot 2

There was a bug where, even after setting and selecting a preset, the system required the same value to be entered again. Moreover, this input did not actually override the preset.
스크린샷 2026-04-09 21-01-07

Therefore, I made the following changes:
스크린샷 2026-04-10 23-18-54

root added 3 commits April 10, 2026 23:11
- resolve PaddleOCR timeout from saved config or env\n- use the resolved timeout for the HTTP request\n- add timeout input to the PaddleOCR models modal
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 10, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

This pull request adds a configurable paddleocr_request_timeout (default 600s) into parser defaults and validation, propagates and logs that timeout through the OCR model and PaddleOCR parser request path, adjusts various parser progress callback fractions, updates frontend forms/schemas/translations to expose the setting, and adds regression tests for timeout propagation.

Changes

Cohort / File(s) Summary
Parser defaults & validation
api/utils/api_utils.py, api/utils/validation_utils.py
Added paddleocr_request_timeout default (600) and ParserConfig field with ge=1 validation.
PaddleOCR request path & model
deepdoc/parser/paddleocr_parser.py, rag/llm/ocr_model.py
Sanitized per-request request_timeout coercion/validation, logging of timeout={request_timeout}s, and use of sanitized timeout in requests.post. OCR model resolves/coerces config value and passes into parser.
Timeout propagation in flows
rag/app/naive.py, rag/flow/parser/parser.py
Resolve request_timeout from kwargs or parser_config, log source/value, and pass request_timeout into pdf_parser.parse_pdf(...).
Progress callback adjustments
deepdoc/parser/docling_parser.py, deepdoc/parser/pdf_parser.py, deepdoc/parser/tcadp_parser.py, rag/app/resume.py
Lowered several progress-fraction milestones (various callbacks changed from ~0.83–1.0 down to ~0.74–0.8). No parsing logic changed.
Server task executor: logging & progress
rag/svr/task_executor.py
Introduced progress-range constants, improved structured logs for keyword/question/metadata generation and embedding, precomputed metadata schema, and adjusted batching progress interpolation.
Frontend schemas, defaults & forms
web/src/components/chunk-method-dialog/index.tsx, web/src/components/chunk-method-dialog/use-default-parser-values.ts, web/src/pages/dataset/dataset-setting/form-schema.ts, web/src/pages/dataset/dataset-setting/index.tsx, web/src/pages/user-setting/setting-model/modal/paddleocr-modal/index.tsx
Added paddleocr_request_timeout to validation schemas and form defaults (600), coerced to integer with min 1, added numeric input in PaddleOCR modal, and included default in dataset settings. useDefaultParserValues no longer depends on translations.
Frontend component refactor
web/src/components/paddleocr-options-form-field.tsx
Reworked component: removed namePrefix prop and react-hook-form wiring; now renders a static preset-managed summary block instead of editable inputs.
Frontend types & interfaces
web/src/interfaces/database/document.ts
Added optional paddleocr_request_timeout?: number to IParserConfig.
i18n (multiple locales)
web/src/locales/... (ar, bg, de, en, es, fr, id, it, ja, pt-br, ru, tr, vi, zh, zh-traditional)
Added translation keys for PaddleOCR request timeout labels, tips, placeholders, and a minimum-value message across locales.
Tests
test/unit_test/deepdoc/parser/test_paddleocr_timeout.py
Added regression tests validating timeout parsing/coercion in PaddleOCROcrModel, forwarding into PaddleOCRParser, and use of request_timeout in _send_request (mocked HTTP).

Sequence Diagram(s)

sequenceDiagram
    participant UI as "Frontend UI"
    participant Server as "App / RAG Flow"
    participant OCRModel as "PaddleOCROcrModel"
    participant Parser as "PaddleOCRParser"
    participant API as "PaddleOCR API"

    UI->>Server: save / start parse (parser_config includes paddleocr_request_timeout?)
    Server->>OCRModel: init (config)
    Note right of OCRModel: _resolve_int_config -> int timeout or default(600)
    OCRModel->>Parser: parse_pdf(..., request_timeout=intTimeout)
    Parser->>Parser: determine request_timeout (config or passed arg)
    Parser->>API: POST /ocr (timeout=request_timeout) 
    API-->>Parser: response JSON
    Parser-->>Server: parsed result
    Server-->>UI: progress callbacks & final result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

🧰 typescript

Suggested reviewers

  • yingfeng
  • dcc123456

Poem

🐰 A timeout set in careful hops,
Six hundred heartbeats, no more flops,
Parsers hum and callbacks sway,
Translations sing in bright array,
Logs and tests cheer—hooray! 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: making PaddleOCR timeout configurable and improving ingest progress visibility.
Description check ✅ Passed The description fully addresses the template by explaining the problem solved, checking both Bug Fix and New Feature options, and providing a detailed summary of changes with validation steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rag/svr/task_executor.py (1)

648-656: ⚠️ Potential issue | 🟠 Major

Embedding progress is still under-reported for multi-item batches.

Line 656 uses (i + 1) / len(cnts), but i is the batch start index. With batch size > 1, final progress can stay much lower than intended (e.g., 128 chunks, batch 64 ends around 0.876). Use processed-item count instead.

Proposed fix
-        callback(prog=0.8 + 0.15 * (i + 1) / len(cnts), msg="")
+        processed = min(i + settings.EMBEDDING_BATCH_SIZE, len(cnts))
+        callback(prog=0.8 + 0.15 * processed / len(cnts), msg="")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/svr/task_executor.py` around lines 648 - 656, The progress calculation
uses the batch start index i so multi-item batches under-report progress; update
the callback progress numerator to the number of items processed so far (e.g.,
processed = min(i + len(vts), len(cnts)) or i + settings.EMBEDDING_BATCH_SIZE)
and replace (i + 1) / len(cnts) with processed / len(cnts) in the callback;
change code in the loop around batch_encode/callback (variables: cnts, vts,
settings.EMBEDDING_BATCH_SIZE, cnts_, tk_count, callback) to compute processed
from the actual batch size (len(vts)) before calling callback.
🧹 Nitpick comments (2)
rag/flow/parser/parser.py (1)

495-503: Add parser-layer logging for the new timeout flow.

The timeout is correctly forwarded, but this new branch-level flow should log the configured value to make ingest-timeout behavior traceable during debugging.

Proposed patch
             pdf_parser = ocr_model.mdl
             request_timeout = conf.get("paddleocr_request_timeout")
+            logging.info(
+                "Using PaddleOCR request timeout: %s seconds",
+                request_timeout,
+            )
 
             lines, _ = pdf_parser.parse_pdf(
                 filepath=name,
                 binary=blob,
                 callback=self.callback,
                 parse_method="pipeline",
                 request_timeout=request_timeout,
             )

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/flow/parser/parser.py` around lines 495 - 503, Add a parser-layer log
that emits the configured timeout value when you forward request_timeout into
pdf_parser.parse_pdf: log the request_timeout (and optionally the target
filepath/name) immediately after computing request_timeout and before calling
pdf_parser.parse_pdf in parser.py (the block that sets request_timeout and calls
parse_pdf). Use the module/class logger used elsewhere in this file (e.g.,
self.logger or module-level logger) and choose an appropriate level (info/debug)
so ingest-timeout behavior is traceable.
web/src/pages/dataset/dataset-setting/form-schema.ts (1)

24-24: Use a localized min-validation message for timeout.

Line 24 uses Zod’s default .min(1) message, which can leak non-localized text in dataset settings validation.

♻️ Proposed patch
-        paddleocr_request_timeout: z.coerce.number().int().min(1).optional(),
+        paddleocr_request_timeout: z.coerce
+          .number()
+          .int()
+          .min(1, {
+            message: t(
+              'knowledgeConfiguration.paddleocrRequestTimeoutMin',
+              'Request timeout must be at least 1 second',
+            ),
+          })
+          .optional(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/pages/dataset/dataset-setting/form-schema.ts` at line 24, The
paddleocr_request_timeout field currently uses Zod's default .min(1) message;
update the validator for paddleocr_request_timeout in form-schema.ts to pass a
localized error message to .min (e.g., .min(1, { message:
i18n.t('dataset.settings.timeout_min', { min: 1 }) }) or using your project's
translate helper) so validation errors surface with the localized string instead
of the default English text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag/app/naive.py`:
- Around line 216-219: Add a log that emits the resolved PaddleOCR timeout and
its source after computing request_timeout from kwargs or parser_config: after
the block that sets parser_config = kwargs.get("parser_config") or {} and the
request_timeout resolution, call the module/function logger to log the effective
timeout value and whether it came from the explicit request_timeout kwarg or
from parser_config["paddleocr_request_timeout"] (include fallback/null if
unset); update the same logging in the alternate flow referenced around the
request_timeout handling so both branches produce the same traceable message.

In `@rag/llm/ocr_model.py`:
- Around line 119-126: In _resolve_int_config, add warning logs for both
fallback paths so invalid or non-positive values don't fail silently: when
int(raw_value) raises TypeError/ValueError log a warning including key, env_key,
raw_value and the default being used; similarly, when timeout <= 0 log a warning
including the parsed timeout, key, env_key and the default used; use the module
logger (e.g., logging.getLogger(__name__) or the existing logger) and keep the
return behavior unchanged, still returning default on those fallbacks; reference
_resolve_int_config and the upstream _resolve_config to locate where to add
these logs.

In `@test/unit_test/deepdoc/parser/test_paddleocr_timeout.py`:
- Around line 95-182: Add pytest priority markers to these three tests by
importing pytest if missing and decorating each test function
(test_paddleocr_model_reads_request_timeout_from_json_config,
test_paddleocr_parse_pdf_forwards_request_timeout_to_http_call,
test_paddleocr_send_request_uses_configured_timeout) with the appropriate
priority marker (e.g., `@pytest.mark.p1` or the repo-specified p1/p2/p3 level);
ensure the marker name matches the test-suite convention and run lint/tests to
confirm markers are recognized.

In `@web/src/locales/ar.ts`:
- Around line 445-448: The Arabic locale is missing translations for the new
PaddleOCR timeout keys; update the entries for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, paddleocrRequestTimeoutPlaceholder (and the
duplicate set around the other block) in ar.ts to Arabic strings so the UI is
not mixed-language—locate those keys in the file (e.g., paddleocrRequestTimeout
/ paddleocrRequestTimeoutTip / paddleocrRequestTimeoutPlaceholder around the
shown diffs and the similar block at lines ~1136-1139) and replace the English
values with appropriate Arabic translations for the label, explanatory tip, and
placeholder/min value.

In `@web/src/locales/bg.ts`:
- Around line 449-452: The Bulgarian locale currently leaves three new keys
untranslated: paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder; replace the English strings with proper
Bulgarian translations for the label, descriptive tip (mentioning large
PDFs/books may need higher timeout), and the numeric placeholder (if localized
format differs) so the entries are consistent with the rest of bg locale; update
both occurrences (the one shown and the duplicate at lines referenced) ensuring
the keys remain unchanged.

In `@web/src/locales/de.ts`:
- Around line 451-454: The German locale file contains English strings for the
keys paddleocrRequestTimeout, paddleocrRequestTimeoutTip and
paddleocrRequestTimeoutPlaceholder (also the duplicate occurrence around the
later block), so replace those English values with their German translations
(e.g., "Anforderungs-Timeout (Sekunden)", "Große PDFs oder Bücher benötigen
möglicherweise ein höheres Timeout." and an appropriate numeric placeholder like
"600") by updating the entries in web/src/locales/de.ts for
paddleocrRequestTimeout, paddleocrRequestTimeoutTip and
paddleocrRequestTimeoutPlaceholder (and the matching duplicate keys at the later
block) so the UI displays German text.

In `@web/src/locales/es.ts`:
- Around line 169-172: The Spanish locale currently leaves PaddleOCR timeout
strings in English; update the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder to Spanish
translations (e.g., "Tiempo de espera de la solicitud (segundos)", "PDFs grandes
o libros pueden requerir un tiempo de espera mayor." and keep the numeric
placeholder "600"); ensure you replace both occurrences found for these keys so
the UI is fully localized.

In `@web/src/locales/fr.ts`:
- Around line 303-306: Update the French locale entries for the PaddleOCR
timeout keys: replace the English strings for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in fr.ts with
proper French translations (e.g., "Délai d'attente de la requête (secondes)",
"Les PDF volumineux ou les livres peuvent nécessiter un délai plus long.", and
an appropriate numeric placeholder like "600"), and make the same replacements
for the duplicated set of keys further down in the file so both occurrences use
French copy.

In `@web/src/locales/id.ts`:
- Around line 324-327: Replace the English timeout strings for the Indonesian
locale by updating the paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder entries: set paddleocrRequestTimeout to
"Waktu tunggu permintaan (detik)", paddleocrRequestTimeoutTip to "PDF besar atau
buku mungkin memerlukan waktu tunggu yang lebih lama.", and keep
paddleocrRequestTimeoutPlaceholder as "600"; also make the same replacements for
the duplicate keys present around the other occurrence (the entries at the other
location referenced in the comment) so the ID locale no longer mixes English and
Indonesian.

In `@web/src/locales/it.ts`:
- Around line 497-500: Replace the English strings for the PaddleOCR timeout
keys with Italian translations: update paddleocrRequestTimeout to "Timeout
richiesta (secondi)", paddleocrRequestTimeoutTip to "PDF voluminosi o libri
potrebbero richiedere un timeout maggiore." and leave
paddleocrRequestTimeoutPlaceholder as "600"; apply the same changes for the
second occurrence of these keys elsewhere in the file (the duplicate block
around the later lines) so both instances use the Italian text.

In `@web/src/locales/ja.ts`:
- Around line 322-325: Replace the English strings introduced for the PaddleOCR
timeout with Japanese translations: update the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in this file
(and the second occurrence later where the same keys appear) so their values are
Japanese text conveying "Request timeout (seconds)", the tip about large
PDFs/books needing a higher timeout, and the placeholder "600"; keep the key
names unchanged and only modify the string values to appropriate Japanese
translations.

In `@web/src/locales/pt-br.ts`:
- Around line 319-322: Translate the new timeout strings for Portuguese: replace
the English values for the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder with
appropriate pt-BR text; also update the duplicate occurrences of these same keys
further down the file (the second group around the dataset/model settings block)
so both places show the same Portuguese translations.

In `@web/src/locales/ru.ts`:
- Around line 471-474: The RU locale currently leaves the paddleocr timeout
strings in English; update the entries for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip and paddleocrRequestTimeoutPlaceholder (and the
duplicate keys around lines 1299-1302) in web/src/locales/ru.ts to Russian
equivalents — replace "Request timeout (seconds)" with "Таймаут запроса
(секунды)", "Large PDFs or books may require a higher timeout." with "Большие
PDF или книги могут требовать большего таймаута." and adjust the placeholder
"600" if needed; ensure both occurrences of these keys are translated
consistently so the UI/validation text is fully localized.

In `@web/src/locales/tr.ts`:
- Around line 462-465: The Turkish locale is missing translations for the new
PaddleOCR timeout keys; update the entries paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in the
Turkish locale file by replacing the English strings with proper Turkish
translations (and apply the same replacements for the duplicate set of keys
found elsewhere in the same tr.ts file), so the UI displays consistent Turkish
text for the timeout label, tip, and placeholder.

In `@web/src/locales/vi.ts`:
- Around line 363-366: The English timeout labels need Vietnamese translations:
replace the values for paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder with Vietnamese equivalents (e.g., "Thời gian
chờ yêu cầu (giây)", "Các PDF hoặc sách lớn có thể cần thời gian chờ lớn hơn.",
"600"), and make the same replacement for the duplicate keys around lines
620-623 so all timeout labels are consistently localized; update the string
values for those keys (paddleocrRequestTimeout*, etc.) in vi.ts accordingly.

---

Outside diff comments:
In `@rag/svr/task_executor.py`:
- Around line 648-656: The progress calculation uses the batch start index i so
multi-item batches under-report progress; update the callback progress numerator
to the number of items processed so far (e.g., processed = min(i + len(vts),
len(cnts)) or i + settings.EMBEDDING_BATCH_SIZE) and replace (i + 1) / len(cnts)
with processed / len(cnts) in the callback; change code in the loop around
batch_encode/callback (variables: cnts, vts, settings.EMBEDDING_BATCH_SIZE,
cnts_, tk_count, callback) to compute processed from the actual batch size
(len(vts)) before calling callback.

---

Nitpick comments:
In `@rag/flow/parser/parser.py`:
- Around line 495-503: Add a parser-layer log that emits the configured timeout
value when you forward request_timeout into pdf_parser.parse_pdf: log the
request_timeout (and optionally the target filepath/name) immediately after
computing request_timeout and before calling pdf_parser.parse_pdf in parser.py
(the block that sets request_timeout and calls parse_pdf). Use the module/class
logger used elsewhere in this file (e.g., self.logger or module-level logger)
and choose an appropriate level (info/debug) so ingest-timeout behavior is
traceable.

In `@web/src/pages/dataset/dataset-setting/form-schema.ts`:
- Line 24: The paddleocr_request_timeout field currently uses Zod's default
.min(1) message; update the validator for paddleocr_request_timeout in
form-schema.ts to pass a localized error message to .min (e.g., .min(1, {
message: i18n.t('dataset.settings.timeout_min', { min: 1 }) }) or using your
project's translate helper) so validation errors surface with the localized
string instead of the default English text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 08dc228b-1484-4175-a538-b9e8de642a5d

📥 Commits

Reviewing files that changed from the base of the PR and between 52442c8 and 7fc5b85.

📒 Files selected for processing (35)
  • api/utils/api_utils.py
  • api/utils/validation_utils.py
  • deepdoc/parser/docling_parser.py
  • deepdoc/parser/paddleocr_parser.py
  • deepdoc/parser/pdf_parser.py
  • deepdoc/parser/tcadp_parser.py
  • docker/.env
  • rag/app/naive.py
  • rag/app/resume.py
  • rag/flow/parser/parser.py
  • rag/llm/ocr_model.py
  • rag/svr/task_executor.py
  • test/unit_test/deepdoc/parser/test_paddleocr_timeout.py
  • web/src/components/chunk-method-dialog/index.tsx
  • web/src/components/chunk-method-dialog/use-default-parser-values.ts
  • web/src/components/paddleocr-options-form-field.tsx
  • web/src/interfaces/database/document.ts
  • web/src/locales/ar.ts
  • web/src/locales/bg.ts
  • web/src/locales/de.ts
  • web/src/locales/en.ts
  • web/src/locales/es.ts
  • web/src/locales/fr.ts
  • web/src/locales/id.ts
  • web/src/locales/it.ts
  • web/src/locales/ja.ts
  • web/src/locales/pt-br.ts
  • web/src/locales/ru.ts
  • web/src/locales/tr.ts
  • web/src/locales/vi.ts
  • web/src/locales/zh-traditional.ts
  • web/src/locales/zh.ts
  • web/src/pages/dataset/dataset-setting/form-schema.ts
  • web/src/pages/dataset/dataset-setting/index.tsx
  • web/src/pages/user-setting/setting-model/modal/paddleocr-modal/index.tsx
💤 Files with no reviewable changes (1)
  • docker/.env

@joonsoome joonsoome force-pushed the paddleocr-enhancement branch from 7fc5b85 to d5337fa Compare April 11, 2026 01:41
@joonsoome
Copy link
Copy Markdown
Author

Thanks for the review. I’ve addressed the PaddleOCR timeout/logging/test/i18n feedback and rewrote the branch history so the docker/.env deletion is no longer part of the PR diff or commit history. The inline review threads are resolved on my side, and I ran python -m py_compile and git diff --check locally.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag/svr/task_executor.py`:
- Line 656: The progress values overlap because the embedding phase uses
callback(prog=0.8 + 0.15 * (i + 1) / len(cnts)) while insert_chunks() restarts
at 0.8; update the indexing progress formula in insert_chunks() to start where
embedding ends (e.g., replace 0.8 + 0.1 * (b + 1) / len(chunks) with 0.95 + 0.05
* (b + 1) / len(chunks) or use named constants like EMBED_PHASE_END=0.95 and
INDEX_PHASE_RANGE=0.05) so progress is monotonically increasing; adjust any
related callback invocations to use the same constants to avoid future drift.

In `@web/src/components/paddleocr-options-form-field.tsx`:
- Around line 29-81: Add the missing Portuguese translations for the two
PaddleOCR UI strings referenced in
web/src/components/paddleocr-options-form-field.tsx by adding entries for
knowledgeConfiguration.paddleocrPresetManaged and
knowledgeConfiguration.paddleocrPresetManagedValue to the knowledgeConfiguration
object in web/src/locales/pt-br.ts; ensure the keys are placed alongside the
other knowledgeConfiguration translations and provide appropriate pt-BR strings
so the UI no longer falls back to English.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 689f7340-9264-4933-b199-d527f0ad4b0a

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc5b85 and d5337fa.

📒 Files selected for processing (23)
  • deepdoc/parser/docling_parser.py
  • deepdoc/parser/paddleocr_parser.py
  • deepdoc/parser/pdf_parser.py
  • deepdoc/parser/tcadp_parser.py
  • rag/app/naive.py
  • rag/app/resume.py
  • rag/flow/parser/parser.py
  • rag/llm/ocr_model.py
  • rag/svr/task_executor.py
  • test/unit_test/deepdoc/parser/test_paddleocr_timeout.py
  • web/src/components/paddleocr-options-form-field.tsx
  • web/src/locales/ar.ts
  • web/src/locales/bg.ts
  • web/src/locales/de.ts
  • web/src/locales/es.ts
  • web/src/locales/fr.ts
  • web/src/locales/id.ts
  • web/src/locales/it.ts
  • web/src/locales/ja.ts
  • web/src/locales/pt-br.ts
  • web/src/locales/ru.ts
  • web/src/locales/tr.ts
  • web/src/locales/vi.ts
✅ Files skipped from review due to trivial changes (10)
  • deepdoc/parser/docling_parser.py
  • rag/app/resume.py
  • deepdoc/parser/tcadp_parser.py
  • web/src/locales/es.ts
  • web/src/locales/it.ts
  • web/src/locales/fr.ts
  • web/src/locales/ja.ts
  • web/src/locales/tr.ts
  • web/src/locales/ru.ts
  • web/src/locales/bg.ts
🚧 Files skipped from review as they are similar to previous changes (10)
  • rag/flow/parser/parser.py
  • rag/app/naive.py
  • web/src/locales/de.ts
  • deepdoc/parser/pdf_parser.py
  • web/src/locales/vi.ts
  • web/src/locales/ar.ts
  • deepdoc/parser/paddleocr_parser.py
  • web/src/locales/id.ts
  • test/unit_test/deepdoc/parser/test_paddleocr_timeout.py
  • rag/llm/ocr_model.py

@joonsoome joonsoome requested a review from yingfeng April 11, 2026 01:52
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
rag/svr/task_executor.py (1)

349-353: Use a single model identifier field in start/done logs.

Start logs use task["llm_id"] while completion logs use chat_mdl.llm_name; using one canonical identifier will make log correlation easier.

Also applies to: 396-401, 445-448

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/svr/task_executor.py` around lines 349 - 353, Start and completion logs
use two different model identifiers (task["llm_id"] vs chat_mdl.llm_name); pick
one canonical identifier and use it consistently in both start and done logs.
Modify the start-log calls (the logging.info that currently references
task["llm_id"]) to use a single model_id variable sourced from chat_mdl.llm_name
with a fallback to task["llm_id"] if chat_mdl is not yet available, and replace
similar occurrences referenced around the other start/done pairs (the blocks at
~396-401 and ~445-448) so all logs use the same model_id variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@rag/svr/task_executor.py`:
- Around line 349-353: Start and completion logs use two different model
identifiers (task["llm_id"] vs chat_mdl.llm_name); pick one canonical identifier
and use it consistently in both start and done logs. Modify the start-log calls
(the logging.info that currently references task["llm_id"]) to use a single
model_id variable sourced from chat_mdl.llm_name with a fallback to
task["llm_id"] if chat_mdl is not yet available, and replace similar occurrences
referenced around the other start/done pairs (the blocks at ~396-401 and
~445-448) so all logs use the same model_id variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 863821b7-1fe6-4a24-85eb-04ab1775f8ae

📥 Commits

Reviewing files that changed from the base of the PR and between d5337fa and 4a24907.

📒 Files selected for processing (2)
  • rag/svr/task_executor.py
  • web/src/locales/pt-br.ts
✅ Files skipped from review due to trivial changes (1)
  • web/src/locales/pt-br.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants