Conversation
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
|
[🤖]: Hi @karpnv 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
|
@karpnv please check again, now all LGTM |
| import operator | ||
| import os | ||
| import pickle | ||
| import tarfile |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 days ago
To fix an unused import in Python, the general approach is to remove the import statement for the module that is not referenced anywhere in the file. This keeps the namespace clean, avoids unnecessary dependencies, and can slightly improve startup time.
In this file, the specific fix is to delete the import tarfile statement at line 27. All other imports should remain unchanged, since they may be used elsewhere in the file. No additional code, methods, or definitions are required, and no existing functionality needs to change. The only edit is to remove that single line from tools/speech_data_explorer/data_explorer.py.
| @@ -24,7 +24,6 @@ | ||
| import operator | ||
| import os | ||
| import pickle | ||
| import tarfile | ||
| from collections import defaultdict | ||
| from os.path import expanduser | ||
| from pathlib import Path |
| for line in lines[1:]: # Skip header line | ||
| parts = line.split() | ||
| if len(parts) >= 4: | ||
| file_type = parts[0] |
Check notice
Code scanning / CodeQL
Unused local variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 days ago
In general, to fix an unused local variable, either remove the assignment if it is not needed (taking care not to remove any necessary side effects) or, if the value is intentionally ignored, rename the variable to an “unused” name such as _ or something containing unused so that static analysis and humans both understand it is intentionally not used.
Here, file_type = parts[0] is not used anywhere in parse_dali_index, and reading parts[0] has no required side effects. To avoid changing functionality or subtle debugging behavior (for example, if someone relies on the parse blowing up when parts[0] is missing), the safest fix is to keep the assignment but rename the variable to a conventional unused name. I will rename file_type to _file_type_unused, which satisfies CodeQL’s pattern (“containing unused”) and makes the intent explicit. No imports or additional definitions are required, and only line 279 in tools/speech_data_explorer/data_explorer.py needs to change.
| @@ -276,7 +276,7 @@ | ||
| for line in lines[1:]: # Skip header line | ||
| parts = line.split() | ||
| if len(parts) >= 4: | ||
| file_type = parts[0] | ||
| _file_type_unused = parts[0] | ||
| offset = int(parts[1]) | ||
| size = int(parts[2]) | ||
| filename = parts[3] |
|
[🤖]: Hi @karpnv 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
|
testing |
|
[🤖]: Hi @karpnv 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Example usage for non bucketed dataset |
example usage for bucketed dataset |
… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@karpnv check again, now all should work. and LGTM |
|
Non-bucketed
|
S3 support
Collection: ASR
Changelog
--s3cfgExample: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".--tar-base-path(e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.
Usage
python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/bucket_OP_1..8_CL_/manifest__OP_0..2047_CL_.jsonl --tar-base-path s3://abc/tarred/bucket_OP_1..8_CL_/audio__OP_0..2047_CL_.tar --s3cfg ~/.s3cfg[default]GitHub Actions CI
PR Type: