fix: make PaddleOCR timeout configurable and surface real ingest progress#14050
fix: make PaddleOCR timeout configurable and surface real ingest progress#14050joonsoome wants to merge 7 commits intoinfiniflow:mainfrom
Conversation
- resolve PaddleOCR timeout from saved config or env\n- use the resolved timeout for the HTTP request\n- add timeout input to the PaddleOCR models modal
📝 WalkthroughWalkthroughThis pull request adds a configurable Changes
Sequence Diagram(s)sequenceDiagram
participant UI as "Frontend UI"
participant Server as "App / RAG Flow"
participant OCRModel as "PaddleOCROcrModel"
participant Parser as "PaddleOCRParser"
participant API as "PaddleOCR API"
UI->>Server: save / start parse (parser_config includes paddleocr_request_timeout?)
Server->>OCRModel: init (config)
Note right of OCRModel: _resolve_int_config -> int timeout or default(600)
OCRModel->>Parser: parse_pdf(..., request_timeout=intTimeout)
Parser->>Parser: determine request_timeout (config or passed arg)
Parser->>API: POST /ocr (timeout=request_timeout)
API-->>Parser: response JSON
Parser-->>Server: parsed result
Server-->>UI: progress callbacks & final result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rag/svr/task_executor.py (1)
648-656:⚠️ Potential issue | 🟠 MajorEmbedding progress is still under-reported for multi-item batches.
Line 656 uses
(i + 1) / len(cnts), butiis the batch start index. With batch size > 1, final progress can stay much lower than intended (e.g., 128 chunks, batch 64 ends around0.876). Use processed-item count instead.Proposed fix
- callback(prog=0.8 + 0.15 * (i + 1) / len(cnts), msg="") + processed = min(i + settings.EMBEDDING_BATCH_SIZE, len(cnts)) + callback(prog=0.8 + 0.15 * processed / len(cnts), msg="")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/svr/task_executor.py` around lines 648 - 656, The progress calculation uses the batch start index i so multi-item batches under-report progress; update the callback progress numerator to the number of items processed so far (e.g., processed = min(i + len(vts), len(cnts)) or i + settings.EMBEDDING_BATCH_SIZE) and replace (i + 1) / len(cnts) with processed / len(cnts) in the callback; change code in the loop around batch_encode/callback (variables: cnts, vts, settings.EMBEDDING_BATCH_SIZE, cnts_, tk_count, callback) to compute processed from the actual batch size (len(vts)) before calling callback.
🧹 Nitpick comments (2)
rag/flow/parser/parser.py (1)
495-503: Add parser-layer logging for the new timeout flow.The timeout is correctly forwarded, but this new branch-level flow should log the configured value to make ingest-timeout behavior traceable during debugging.
Proposed patch
pdf_parser = ocr_model.mdl request_timeout = conf.get("paddleocr_request_timeout") + logging.info( + "Using PaddleOCR request timeout: %s seconds", + request_timeout, + ) lines, _ = pdf_parser.parse_pdf( filepath=name, binary=blob, callback=self.callback, parse_method="pipeline", request_timeout=request_timeout, )As per coding guidelines,
**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/flow/parser/parser.py` around lines 495 - 503, Add a parser-layer log that emits the configured timeout value when you forward request_timeout into pdf_parser.parse_pdf: log the request_timeout (and optionally the target filepath/name) immediately after computing request_timeout and before calling pdf_parser.parse_pdf in parser.py (the block that sets request_timeout and calls parse_pdf). Use the module/class logger used elsewhere in this file (e.g., self.logger or module-level logger) and choose an appropriate level (info/debug) so ingest-timeout behavior is traceable.web/src/pages/dataset/dataset-setting/form-schema.ts (1)
24-24: Use a localized min-validation message for timeout.Line 24 uses Zod’s default
.min(1)message, which can leak non-localized text in dataset settings validation.♻️ Proposed patch
- paddleocr_request_timeout: z.coerce.number().int().min(1).optional(), + paddleocr_request_timeout: z.coerce + .number() + .int() + .min(1, { + message: t( + 'knowledgeConfiguration.paddleocrRequestTimeoutMin', + 'Request timeout must be at least 1 second', + ), + }) + .optional(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/pages/dataset/dataset-setting/form-schema.ts` at line 24, The paddleocr_request_timeout field currently uses Zod's default .min(1) message; update the validator for paddleocr_request_timeout in form-schema.ts to pass a localized error message to .min (e.g., .min(1, { message: i18n.t('dataset.settings.timeout_min', { min: 1 }) }) or using your project's translate helper) so validation errors surface with the localized string instead of the default English text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/app/naive.py`:
- Around line 216-219: Add a log that emits the resolved PaddleOCR timeout and
its source after computing request_timeout from kwargs or parser_config: after
the block that sets parser_config = kwargs.get("parser_config") or {} and the
request_timeout resolution, call the module/function logger to log the effective
timeout value and whether it came from the explicit request_timeout kwarg or
from parser_config["paddleocr_request_timeout"] (include fallback/null if
unset); update the same logging in the alternate flow referenced around the
request_timeout handling so both branches produce the same traceable message.
In `@rag/llm/ocr_model.py`:
- Around line 119-126: In _resolve_int_config, add warning logs for both
fallback paths so invalid or non-positive values don't fail silently: when
int(raw_value) raises TypeError/ValueError log a warning including key, env_key,
raw_value and the default being used; similarly, when timeout <= 0 log a warning
including the parsed timeout, key, env_key and the default used; use the module
logger (e.g., logging.getLogger(__name__) or the existing logger) and keep the
return behavior unchanged, still returning default on those fallbacks; reference
_resolve_int_config and the upstream _resolve_config to locate where to add
these logs.
In `@test/unit_test/deepdoc/parser/test_paddleocr_timeout.py`:
- Around line 95-182: Add pytest priority markers to these three tests by
importing pytest if missing and decorating each test function
(test_paddleocr_model_reads_request_timeout_from_json_config,
test_paddleocr_parse_pdf_forwards_request_timeout_to_http_call,
test_paddleocr_send_request_uses_configured_timeout) with the appropriate
priority marker (e.g., `@pytest.mark.p1` or the repo-specified p1/p2/p3 level);
ensure the marker name matches the test-suite convention and run lint/tests to
confirm markers are recognized.
In `@web/src/locales/ar.ts`:
- Around line 445-448: The Arabic locale is missing translations for the new
PaddleOCR timeout keys; update the entries for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, paddleocrRequestTimeoutPlaceholder (and the
duplicate set around the other block) in ar.ts to Arabic strings so the UI is
not mixed-language—locate those keys in the file (e.g., paddleocrRequestTimeout
/ paddleocrRequestTimeoutTip / paddleocrRequestTimeoutPlaceholder around the
shown diffs and the similar block at lines ~1136-1139) and replace the English
values with appropriate Arabic translations for the label, explanatory tip, and
placeholder/min value.
In `@web/src/locales/bg.ts`:
- Around line 449-452: The Bulgarian locale currently leaves three new keys
untranslated: paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder; replace the English strings with proper
Bulgarian translations for the label, descriptive tip (mentioning large
PDFs/books may need higher timeout), and the numeric placeholder (if localized
format differs) so the entries are consistent with the rest of bg locale; update
both occurrences (the one shown and the duplicate at lines referenced) ensuring
the keys remain unchanged.
In `@web/src/locales/de.ts`:
- Around line 451-454: The German locale file contains English strings for the
keys paddleocrRequestTimeout, paddleocrRequestTimeoutTip and
paddleocrRequestTimeoutPlaceholder (also the duplicate occurrence around the
later block), so replace those English values with their German translations
(e.g., "Anforderungs-Timeout (Sekunden)", "Große PDFs oder Bücher benötigen
möglicherweise ein höheres Timeout." and an appropriate numeric placeholder like
"600") by updating the entries in web/src/locales/de.ts for
paddleocrRequestTimeout, paddleocrRequestTimeoutTip and
paddleocrRequestTimeoutPlaceholder (and the matching duplicate keys at the later
block) so the UI displays German text.
In `@web/src/locales/es.ts`:
- Around line 169-172: The Spanish locale currently leaves PaddleOCR timeout
strings in English; update the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder to Spanish
translations (e.g., "Tiempo de espera de la solicitud (segundos)", "PDFs grandes
o libros pueden requerir un tiempo de espera mayor." and keep the numeric
placeholder "600"); ensure you replace both occurrences found for these keys so
the UI is fully localized.
In `@web/src/locales/fr.ts`:
- Around line 303-306: Update the French locale entries for the PaddleOCR
timeout keys: replace the English strings for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in fr.ts with
proper French translations (e.g., "Délai d'attente de la requête (secondes)",
"Les PDF volumineux ou les livres peuvent nécessiter un délai plus long.", and
an appropriate numeric placeholder like "600"), and make the same replacements
for the duplicated set of keys further down in the file so both occurrences use
French copy.
In `@web/src/locales/id.ts`:
- Around line 324-327: Replace the English timeout strings for the Indonesian
locale by updating the paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder entries: set paddleocrRequestTimeout to
"Waktu tunggu permintaan (detik)", paddleocrRequestTimeoutTip to "PDF besar atau
buku mungkin memerlukan waktu tunggu yang lebih lama.", and keep
paddleocrRequestTimeoutPlaceholder as "600"; also make the same replacements for
the duplicate keys present around the other occurrence (the entries at the other
location referenced in the comment) so the ID locale no longer mixes English and
Indonesian.
In `@web/src/locales/it.ts`:
- Around line 497-500: Replace the English strings for the PaddleOCR timeout
keys with Italian translations: update paddleocrRequestTimeout to "Timeout
richiesta (secondi)", paddleocrRequestTimeoutTip to "PDF voluminosi o libri
potrebbero richiedere un timeout maggiore." and leave
paddleocrRequestTimeoutPlaceholder as "600"; apply the same changes for the
second occurrence of these keys elsewhere in the file (the duplicate block
around the later lines) so both instances use the Italian text.
In `@web/src/locales/ja.ts`:
- Around line 322-325: Replace the English strings introduced for the PaddleOCR
timeout with Japanese translations: update the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in this file
(and the second occurrence later where the same keys appear) so their values are
Japanese text conveying "Request timeout (seconds)", the tip about large
PDFs/books needing a higher timeout, and the placeholder "600"; keep the key
names unchanged and only modify the string values to appropriate Japanese
translations.
In `@web/src/locales/pt-br.ts`:
- Around line 319-322: Translate the new timeout strings for Portuguese: replace
the English values for the keys paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder with
appropriate pt-BR text; also update the duplicate occurrences of these same keys
further down the file (the second group around the dataset/model settings block)
so both places show the same Portuguese translations.
In `@web/src/locales/ru.ts`:
- Around line 471-474: The RU locale currently leaves the paddleocr timeout
strings in English; update the entries for paddleocrRequestTimeout,
paddleocrRequestTimeoutTip and paddleocrRequestTimeoutPlaceholder (and the
duplicate keys around lines 1299-1302) in web/src/locales/ru.ts to Russian
equivalents — replace "Request timeout (seconds)" with "Таймаут запроса
(секунды)", "Large PDFs or books may require a higher timeout." with "Большие
PDF или книги могут требовать большего таймаута." and adjust the placeholder
"600" if needed; ensure both occurrences of these keys are translated
consistently so the UI/validation text is fully localized.
In `@web/src/locales/tr.ts`:
- Around line 462-465: The Turkish locale is missing translations for the new
PaddleOCR timeout keys; update the entries paddleocrRequestTimeout,
paddleocrRequestTimeoutTip, and paddleocrRequestTimeoutPlaceholder in the
Turkish locale file by replacing the English strings with proper Turkish
translations (and apply the same replacements for the duplicate set of keys
found elsewhere in the same tr.ts file), so the UI displays consistent Turkish
text for the timeout label, tip, and placeholder.
In `@web/src/locales/vi.ts`:
- Around line 363-366: The English timeout labels need Vietnamese translations:
replace the values for paddleocrRequestTimeout, paddleocrRequestTimeoutTip, and
paddleocrRequestTimeoutPlaceholder with Vietnamese equivalents (e.g., "Thời gian
chờ yêu cầu (giây)", "Các PDF hoặc sách lớn có thể cần thời gian chờ lớn hơn.",
"600"), and make the same replacement for the duplicate keys around lines
620-623 so all timeout labels are consistently localized; update the string
values for those keys (paddleocrRequestTimeout*, etc.) in vi.ts accordingly.
---
Outside diff comments:
In `@rag/svr/task_executor.py`:
- Around line 648-656: The progress calculation uses the batch start index i so
multi-item batches under-report progress; update the callback progress numerator
to the number of items processed so far (e.g., processed = min(i + len(vts),
len(cnts)) or i + settings.EMBEDDING_BATCH_SIZE) and replace (i + 1) / len(cnts)
with processed / len(cnts) in the callback; change code in the loop around
batch_encode/callback (variables: cnts, vts, settings.EMBEDDING_BATCH_SIZE,
cnts_, tk_count, callback) to compute processed from the actual batch size
(len(vts)) before calling callback.
---
Nitpick comments:
In `@rag/flow/parser/parser.py`:
- Around line 495-503: Add a parser-layer log that emits the configured timeout
value when you forward request_timeout into pdf_parser.parse_pdf: log the
request_timeout (and optionally the target filepath/name) immediately after
computing request_timeout and before calling pdf_parser.parse_pdf in parser.py
(the block that sets request_timeout and calls parse_pdf). Use the module/class
logger used elsewhere in this file (e.g., self.logger or module-level logger)
and choose an appropriate level (info/debug) so ingest-timeout behavior is
traceable.
In `@web/src/pages/dataset/dataset-setting/form-schema.ts`:
- Line 24: The paddleocr_request_timeout field currently uses Zod's default
.min(1) message; update the validator for paddleocr_request_timeout in
form-schema.ts to pass a localized error message to .min (e.g., .min(1, {
message: i18n.t('dataset.settings.timeout_min', { min: 1 }) }) or using your
project's translate helper) so validation errors surface with the localized
string instead of the default English text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 08dc228b-1484-4175-a538-b9e8de642a5d
📒 Files selected for processing (35)
api/utils/api_utils.pyapi/utils/validation_utils.pydeepdoc/parser/docling_parser.pydeepdoc/parser/paddleocr_parser.pydeepdoc/parser/pdf_parser.pydeepdoc/parser/tcadp_parser.pydocker/.envrag/app/naive.pyrag/app/resume.pyrag/flow/parser/parser.pyrag/llm/ocr_model.pyrag/svr/task_executor.pytest/unit_test/deepdoc/parser/test_paddleocr_timeout.pyweb/src/components/chunk-method-dialog/index.tsxweb/src/components/chunk-method-dialog/use-default-parser-values.tsweb/src/components/paddleocr-options-form-field.tsxweb/src/interfaces/database/document.tsweb/src/locales/ar.tsweb/src/locales/bg.tsweb/src/locales/de.tsweb/src/locales/en.tsweb/src/locales/es.tsweb/src/locales/fr.tsweb/src/locales/id.tsweb/src/locales/it.tsweb/src/locales/ja.tsweb/src/locales/pt-br.tsweb/src/locales/ru.tsweb/src/locales/tr.tsweb/src/locales/vi.tsweb/src/locales/zh-traditional.tsweb/src/locales/zh.tsweb/src/pages/dataset/dataset-setting/form-schema.tsweb/src/pages/dataset/dataset-setting/index.tsxweb/src/pages/user-setting/setting-model/modal/paddleocr-modal/index.tsx
💤 Files with no reviewable changes (1)
- docker/.env
7fc5b85 to
d5337fa
Compare
|
Thanks for the review. I’ve addressed the PaddleOCR timeout/logging/test/i18n feedback and rewrote the branch history so the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/svr/task_executor.py`:
- Line 656: The progress values overlap because the embedding phase uses
callback(prog=0.8 + 0.15 * (i + 1) / len(cnts)) while insert_chunks() restarts
at 0.8; update the indexing progress formula in insert_chunks() to start where
embedding ends (e.g., replace 0.8 + 0.1 * (b + 1) / len(chunks) with 0.95 + 0.05
* (b + 1) / len(chunks) or use named constants like EMBED_PHASE_END=0.95 and
INDEX_PHASE_RANGE=0.05) so progress is monotonically increasing; adjust any
related callback invocations to use the same constants to avoid future drift.
In `@web/src/components/paddleocr-options-form-field.tsx`:
- Around line 29-81: Add the missing Portuguese translations for the two
PaddleOCR UI strings referenced in
web/src/components/paddleocr-options-form-field.tsx by adding entries for
knowledgeConfiguration.paddleocrPresetManaged and
knowledgeConfiguration.paddleocrPresetManagedValue to the knowledgeConfiguration
object in web/src/locales/pt-br.ts; ensure the keys are placed alongside the
other knowledgeConfiguration translations and provide appropriate pt-BR strings
so the UI no longer falls back to English.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 689f7340-9264-4933-b199-d527f0ad4b0a
📒 Files selected for processing (23)
deepdoc/parser/docling_parser.pydeepdoc/parser/paddleocr_parser.pydeepdoc/parser/pdf_parser.pydeepdoc/parser/tcadp_parser.pyrag/app/naive.pyrag/app/resume.pyrag/flow/parser/parser.pyrag/llm/ocr_model.pyrag/svr/task_executor.pytest/unit_test/deepdoc/parser/test_paddleocr_timeout.pyweb/src/components/paddleocr-options-form-field.tsxweb/src/locales/ar.tsweb/src/locales/bg.tsweb/src/locales/de.tsweb/src/locales/es.tsweb/src/locales/fr.tsweb/src/locales/id.tsweb/src/locales/it.tsweb/src/locales/ja.tsweb/src/locales/pt-br.tsweb/src/locales/ru.tsweb/src/locales/tr.tsweb/src/locales/vi.ts
✅ Files skipped from review due to trivial changes (10)
- deepdoc/parser/docling_parser.py
- rag/app/resume.py
- deepdoc/parser/tcadp_parser.py
- web/src/locales/es.ts
- web/src/locales/it.ts
- web/src/locales/fr.ts
- web/src/locales/ja.ts
- web/src/locales/tr.ts
- web/src/locales/ru.ts
- web/src/locales/bg.ts
🚧 Files skipped from review as they are similar to previous changes (10)
- rag/flow/parser/parser.py
- rag/app/naive.py
- web/src/locales/de.ts
- deepdoc/parser/pdf_parser.py
- web/src/locales/vi.ts
- web/src/locales/ar.ts
- deepdoc/parser/paddleocr_parser.py
- web/src/locales/id.ts
- test/unit_test/deepdoc/parser/test_paddleocr_timeout.py
- rag/llm/ocr_model.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
rag/svr/task_executor.py (1)
349-353: Use a single model identifier field in start/done logs.Start logs use
task["llm_id"]while completion logs usechat_mdl.llm_name; using one canonical identifier will make log correlation easier.Also applies to: 396-401, 445-448
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/svr/task_executor.py` around lines 349 - 353, Start and completion logs use two different model identifiers (task["llm_id"] vs chat_mdl.llm_name); pick one canonical identifier and use it consistently in both start and done logs. Modify the start-log calls (the logging.info that currently references task["llm_id"]) to use a single model_id variable sourced from chat_mdl.llm_name with a fallback to task["llm_id"] if chat_mdl is not yet available, and replace similar occurrences referenced around the other start/done pairs (the blocks at ~396-401 and ~445-448) so all logs use the same model_id variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@rag/svr/task_executor.py`:
- Around line 349-353: Start and completion logs use two different model
identifiers (task["llm_id"] vs chat_mdl.llm_name); pick one canonical identifier
and use it consistently in both start and done logs. Modify the start-log calls
(the logging.info that currently references task["llm_id"]) to use a single
model_id variable sourced from chat_mdl.llm_name with a fallback to
task["llm_id"] if chat_mdl is not yet available, and replace similar occurrences
referenced around the other start/done pairs (the blocks at ~396-401 and
~445-448) so all logs use the same model_id variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 863821b7-1fe6-4a24-85eb-04ab1775f8ae
📒 Files selected for processing (2)
rag/svr/task_executor.pyweb/src/locales/pt-br.ts
✅ Files skipped from review due to trivial changes (1)
- web/src/locales/pt-br.ts
What problem does this PR solve?
PaddleOCR request timeout was effectively hardcoded, which made long-form ingest jobs such as books or large PDFs fail or stall without a clear tuning option. In addition, the ingest pipeline did not clearly surface the later chunking, embedding, and indexing stages, so jobs could look stalled, failed, or finished too early even while the backend was still working.
This PR makes the timeout configurable end-to-end, wires it through the parser and UI, adds coverage for timeout propagation, and improves progress/log reporting so ingest reflects the real pipeline state.
Type of change
Summary of changes
paddleocr_request_timeoutsupport end-to-end.docker/.envlocal-only by removing the tracked copy from the repository.Validation
python -m py_compile ...cd web && npm exec eslint src/components/paddleocr-options-form-field.tsxNotes
cd web && npm exec tsc --noEmitcurrently fails because of an existing repositorytsconfig.jsondeprecation flag configuration, unrelated to this PR.Screenshot
paddleOCR-vl can handle documents with many pages or books, or when using custom models, exceeding the hardcoded 600-second limit in the code. Consequently, I received a failure message without any logs, and paddleOCR-api remained silent. Upon deep debugging, I discovered that the vLLM container was still processing. After reverse-analyzing the code, I found that RAGFlow had a hardcoded timeout of 600 seconds. By making this configurable, I was able to ingest my books as needed.

Screenshot 2
There was a bug where, even after setting and selecting a preset, the system required the same value to be entered again. Moreover, this input did not actually override the preset.

Therefore, I made the following changes:
