Skip to content

read manifest from s3#15330

Open
karpnv wants to merge 16 commits intomainfrom
karpnv/sde_s3
Open

read manifest from s3#15330
karpnv wants to merge 16 commits intomainfrom
karpnv/sde_s3

Conversation

@karpnv
Copy link
Collaborator

@karpnv karpnv commented Jan 28, 2026

S3 support

Collection: ASR

Changelog

  • Use input manifest in s3 object storage (s3://abc/sharded_manifests/manifest_0.jsonl or s3://abc/sharded_manifests/manifest__OP_0..2047_CL_.jsonl)
  • Add path to the s3 credentials file and section. --s3cfg Example: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".
  • Add S3 path to tarred audio files --tar-base-path (e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).
    When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.

Usage

  • You can potentially add a usage example below
python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]
python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/bucket_OP_1..8_CL_/manifest__OP_0..2047_CL_.jsonl --tar-base-path s3://abc/tarred/bucket_OP_1..8_CL_/audio__OP_0..2047_CL_.tar --s3cfg ~/.s3cfg[default]

GitHub Actions CI

PR Type:

  • [ V] New Feature
  • Bugfix
  • Documentation

karpnv and others added 2 commits January 27, 2026 18:13
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
@karpnv karpnv marked this pull request as ready for review January 31, 2026 00:23
@karpnv karpnv requested a review from Jorjeous January 31, 2026 00:23
@karpnv karpnv requested a review from vsl9 January 31, 2026 00:31
@github-actions github-actions bot removed the Run CICD label Jan 31, 2026
@github-actions
Copy link
Contributor

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

vsl9
vsl9 previously approved these changes Feb 2, 2026
Copy link
Collaborator

@vsl9 vsl9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
karpnv and others added 2 commits February 4, 2026 02:04
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Jorjeous
Jorjeous previously approved these changes Feb 6, 2026
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
@Jorjeous
Copy link
Member

Jorjeous commented Feb 6, 2026

@karpnv please check again, now all LGTM

Jorjeous
Jorjeous previously approved these changes Feb 6, 2026
import operator
import os
import pickle
import tarfile

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'tarfile' is not used.

Copilot Autofix

AI 3 days ago

To fix an unused import in Python, the general approach is to remove the import statement for the module that is not referenced anywhere in the file. This keeps the namespace clean, avoids unnecessary dependencies, and can slightly improve startup time.

In this file, the specific fix is to delete the import tarfile statement at line 27. All other imports should remain unchanged, since they may be used elsewhere in the file. No additional code, methods, or definitions are required, and no existing functionality needs to change. The only edit is to remove that single line from tools/speech_data_explorer/data_explorer.py.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -24,7 +24,6 @@
 import operator
 import os
 import pickle
-import tarfile
 from collections import defaultdict
 from os.path import expanduser
 from pathlib import Path
EOF
@@ -24,7 +24,6 @@
import operator
import os
import pickle
import tarfile
from collections import defaultdict
from os.path import expanduser
from pathlib import Path
Copilot is powered by AI and may make mistakes. Always verify output.
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable file_type is not used.

Copilot Autofix

AI 3 days ago

In general, to fix an unused local variable, either remove the assignment if it is not needed (taking care not to remove any necessary side effects) or, if the value is intentionally ignored, rename the variable to an “unused” name such as _ or something containing unused so that static analysis and humans both understand it is intentionally not used.

Here, file_type = parts[0] is not used anywhere in parse_dali_index, and reading parts[0] has no required side effects. To avoid changing functionality or subtle debugging behavior (for example, if someone relies on the parse blowing up when parts[0] is missing), the safest fix is to keep the assignment but rename the variable to a conventional unused name. I will rename file_type to _file_type_unused, which satisfies CodeQL’s pattern (“containing unused”) and makes the intent explicit. No imports or additional definitions are required, and only line 279 in tools/speech_data_explorer/data_explorer.py needs to change.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -276,7 +276,7 @@
     for line in lines[1:]:  # Skip header line
         parts = line.split()
         if len(parts) >= 4:
-            file_type = parts[0]
+            _file_type_unused = parts[0]
             offset = int(parts[1])
             size = int(parts[2])
             filename = parts[3]
EOF
@@ -276,7 +276,7 @@
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]
_file_type_unused = parts[0]
offset = int(parts[1])
size = int(parts[2])
filename = parts[3]
Copilot is powered by AI and may make mistakes. Always verify output.
@github-actions
Copy link
Contributor

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@Jorjeous
Copy link
Member

testing

@github-actions
Copy link
Contributor

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member

python tools/speech_data_explorer/data_explorer.py \
    /path/to/manifest.jsonl \
    --tar-base-path "s3://DATASET_A/lang/tarred_train/audio__OP_0..63_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

Example usage for non bucketed dataset

@Jorjeous
Copy link
Member

# Explore bucket1
python tools/speech_data_explorer/data_explorer.py \
    /path/to/bucket1/tarred_audio_manifest.json \
    --tar-base-path "s3://DATASET_B/train/bucket1/audio__OP_0..2047_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

# Explore bucket4
python tools/speech_data_explorer/data_explorer.py \
    /path/to/bucket4/tarred_audio_manifest.json \
    --tar-base-path "s3://DATASET_B/train/bucket4/audio__OP_0..2047_CL_.tar" \
    --s3cfg "~/.s3cfg[default]"

example usage for bucketed dataset

karpnv and others added 2 commits March 3, 2026 16:36
… sharding with separate numeration.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member

Jorjeous commented Mar 6, 2026

@karpnv check again, now all should work. and LGTM

Copy link
Member

@Jorjeous Jorjeous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jorjeous Jorjeous requested a review from nithinraok March 6, 2026 14:47
@Jorjeous
Copy link
Member

Jorjeous commented Mar 9, 2026

Non-bucketed
python3 tools/speech_data_explorer/data_explorer.py \ "s3://BUCKET/path/to/sharded_manifests/manifest__OP_0..N_CL_.json" \ --tar-base-path "s3://BUCKET/path/to/audio__OP_0..N_CL_.tar" \ --s3cfg '~/.s3cfg[SECTION]'
Bucketed
python3 tools/speech_data_explorer/data_explorer.py \ "s3://BUCKET/path/to/bucket_OP_1..B_CL_/sharded_manifests/manifest__OP_0..N_CL_.json" \ --tar-base-path "s3://BUCKET/path/to/bucket_OP_1..B_CL_/audio__OP_0..N_CL_.tar" \ --s3cfg '~/.s3cfg[SECTION]'

Where:
  • B = number of buckets
  • N = number of shards per bucket (must match in manifest and tar)
  • SECTION = defaut | or config name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants