read manifest from s3 by karpnv · Pull Request #15330 · NVIDIA-NeMo/NeMo

karpnv · 2026-01-28T02:14:50Z

S3 support

Collection: ASR

Changelog

Use input manifest in s3 object storage (s3://abc/sharded_manifests/manifest_0.jsonl or s3://abc/sharded_manifests/manifest__OP_0..2047_CL_.jsonl)
Add path to the s3 credentials file and section. --s3cfg Example: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".
Add S3 path to tarred audio files --tar-base-path (e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).
When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.

Usage

You can potentially add a usage example below

python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]

python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/bucket_OP_1..8_CL_/manifest__OP_0..2047_CL_.jsonl --tar-base-path s3://abc/tarred/bucket_OP_1..8_CL_/audio__OP_0..2047_CL_.tar --s3cfg ~/.s3cfg[default]

GitHub Actions CI

PR Type:

[ V] New Feature
Bugfix
Documentation

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

tools/speech_data_explorer/data_explorer.py

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

github-actions · 2026-01-31T00:32:07Z

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

vsl9

Thanks, LGTM

tools/speech_data_explorer/data_explorer.py

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Jorjeous · 2026-02-06T14:30:42Z

@karpnv please check again, now all LGTM

tools/speech_data_explorer/data_explorer.py

 import operator
 import os
 import pickle
+import tarfile


To fix an unused import in Python, the general approach is to remove the import statement for the module that is not referenced anywhere in the file. This keeps the namespace clean, avoids unnecessary dependencies, and can slightly improve startup time.

In this file, the specific fix is to delete the import tarfile statement at line 27. All other imports should remain unchanged, since they may be used elsewhere in the file. No additional code, methods, or definitions are required, and no existing functionality needs to change. The only edit is to remove that single line from tools/speech_data_explorer/data_explorer.py.

tools/speech_data_explorer/data_explorer.py

+    for line in lines[1:]:  # Skip header line
+        parts = line.split()
+        if len(parts) >= 4:
+            file_type = parts[0]


In general, to fix an unused local variable, either remove the assignment if it is not needed (taking care not to remove any necessary side effects) or, if the value is intentionally ignored, rename the variable to an “unused” name such as _ or something containing unused so that static analysis and humans both understand it is intentionally not used.

Here, file_type = parts[0] is not used anywhere in parse_dali_index, and reading parts[0] has no required side effects. To avoid changing functionality or subtle debugging behavior (for example, if someone relies on the parse blowing up when parts[0] is missing), the safest fix is to keep the assignment but rename the variable to a conventional unused name. I will rename file_type to _file_type_unused, which satisfies CodeQL’s pattern (“containing unused”) and makes the intent explicit. No imports or additional definitions are required, and only line 279 in tools/speech_data_explorer/data_explorer.py needs to change.

github-actions · 2026-02-11T17:01:53Z

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Jorjeous · 2026-02-12T18:04:59Z

testing

github-actions · 2026-02-18T17:02:46Z

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-02-19T15:20:49Z

python tools/speech_data_explorer/data_explorer.py \
    /path/to/manifest.jsonl \
    --tar-base-path "s3://DATASET_A/lang/tarred_train/audio__OP_0..63_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

Example usage for non bucketed dataset

Jorjeous · 2026-02-19T15:21:27Z

# Explore bucket1
python tools/speech_data_explorer/data_explorer.py \
    /path/to/bucket1/tarred_audio_manifest.json \
    --tar-base-path "s3://DATASET_B/train/bucket1/audio__OP_0..2047_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

# Explore bucket4
python tools/speech_data_explorer/data_explorer.py \
    /path/to/bucket4/tarred_audio_manifest.json \
    --tar-base-path "s3://DATASET_B/train/bucket4/audio__OP_0..2047_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

example usage for bucketed dataset

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-03-06T13:00:08Z

@karpnv check again, now all should work. and LGTM

Jorjeous

LGTM

Jorjeous · 2026-03-09T12:26:14Z

Non-bucketed
python3 tools/speech_data_explorer/data_explorer.py \ "s3://BUCKET/path/to/sharded_manifests/manifest__OP_0..N_CL_.json" \ --tar-base-path "s3://BUCKET/path/to/audio__OP_0..N_CL_.tar" \ --s3cfg '~/.s3cfg[SECTION]'
Bucketed
python3 tools/speech_data_explorer/data_explorer.py \ "s3://BUCKET/path/to/bucket_OP_1..B_CL_/sharded_manifests/manifest__OP_0..N_CL_.json" \ --tar-base-path "s3://BUCKET/path/to/bucket_OP_1..B_CL_/audio__OP_0..N_CL_.tar" \ --s3cfg '~/.s3cfg[SECTION]'

Where:

B = number of buckets
N = number of shards per bucket (must match in manifest and tar)
SECTION = defaut | or config name

karpnv and others added 2 commits January 27, 2026 18:13

read manifest from s3

fe21e7e

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

9069210

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

tools/speech_data_explorer/data_explorer.py Fixed Show fixed Hide fixed

tools/speech_data_explorer/data_explorer.py Fixed Show fixed Hide fixed

karpnv added 2 commits January 30, 2026 16:21

s3cfg parameter

89a595f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

d67ec95

karpnv marked this pull request as ready for review January 31, 2026 00:23

karpnv requested a review from Jorjeous January 31, 2026 00:23

karpnv added the Run CICD label Jan 31, 2026

karpnv requested a review from vsl9 January 31, 2026 00:31

github-actions bot removed the Run CICD label Jan 31, 2026

vsl9 previously approved these changes Feb 2, 2026

View reviewed changes

Jorjeous requested changes Feb 3, 2026

View reviewed changes

tools/speech_data_explorer/data_explorer.py Outdated Show resolved Hide resolved

tools/speech_data_explorer/data_explorer.py Outdated Show resolved Hide resolved

tools/speech_data_explorer/data_explorer.py Show resolved Hide resolved

file range

da895cb

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

karpnv dismissed vsl9’s stale review via da895cb February 4, 2026 02:03

karpnv and others added 2 commits February 4, 2026 02:04

Apply isort and black reformatting

b79a0da

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Avoid downloading of full tar, instead extracting specific audio file…

fce458b

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous previously approved these changes Feb 6, 2026

View reviewed changes

Apply isort and black reformatting

3e5f4ec

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

chtruong814 dismissed Jorjeous’s stale review via 3e5f4ec February 6, 2026 14:29

Jorjeous requested a review from vsl9 February 6, 2026 14:30

Jorjeous previously approved these changes Feb 6, 2026

View reviewed changes

github-advanced-security bot found potential problems Feb 6, 2026

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

391b045

karpnv added the Run CICD label Feb 11, 2026

github-actions bot removed the Run CICD label Feb 11, 2026

karpnv added 2 commits February 11, 2026 11:22

shard_index + 1

64e662f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

0e043e3

karpnv dismissed Jorjeous’s stale review via 0e043e3 February 11, 2026 19:22

karpnv added the Run CICD label Feb 12, 2026

Merge branch 'main' into karpnv/sde_s3

850fd4c

chtruong814 added Run CICD and removed Run CICD labels Feb 18, 2026

github-actions bot removed the Run CICD label Feb 18, 2026

Undo latest changes, as it was dataset specific

69500f6

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

karpnv and others added 2 commits March 3, 2026 16:36

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

ba40e0a

update table to not fail on "non-string format", update bucketing and…

2bae4dc

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous approved these changes Mar 6, 2026

View reviewed changes

Merge branch 'main' into karpnv/sde_s3

3803d5d

Jorjeous requested a review from nithinraok March 6, 2026 14:47

@@ -276,7 +276,7 @@
                 for line in lines[1:]:  # Skip header line
                     parts = line.split()
                     if len(parts) >= 4:
-                        file_type = parts[0]
+                        _file_type_unused = parts[0]
                         offset = int(parts[1])
                         size = int(parts[2])
                         filename = parts[3]

Conversation

karpnv commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Usage

GitHub Actions CI

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

vsl9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous commented Feb 6, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

github-actions bot commented Feb 11, 2026

Uh oh!

Jorjeous commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Jorjeous commented Feb 19, 2026

Uh oh!

Jorjeous commented Feb 19, 2026

Uh oh!

Jorjeous commented Mar 6, 2026

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karpnv commented Jan 28, 2026 •

edited

Loading