analyticsinmotion
diff --git a/‎CHANGELOG.md‎
Lines changed: 19 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 94 additions & 4 deletions b/‎README.md‎
Lines changed: 94 additions & 4 deletions
diff --git a/‎benchmarks/memory_comparison_synthetic.py‎ renamed to ‎benchmarks/memory_comparison_synthetic_data.py‎ b/‎benchmarks/memory_comparison_synthetic.py‎ renamed to ‎benchmarks/memory_comparison_synthetic_data.py‎
diff --git a/‎benchmarks/speed_comparison_librispeech.py‎ renamed to ‎benchmarks/speed_comparison_librispeech_data.py‎ b/‎benchmarks/speed_comparison_librispeech.py‎ renamed to ‎benchmarks/speed_comparison_librispeech_data.py‎
diff --git a/‎benchmarks/speed_comparison_synthetic.py‎ renamed to ‎benchmarks/speed_comparison_synthetic_data.py‎ b/‎benchmarks/speed_comparison_synthetic.py‎ renamed to ‎benchmarks/speed_comparison_synthetic_data.py‎
diff --git a/‎benchmarks/speed_comparison_werx_dataframes.py‎
Lines changed: 68 additions & 0 deletions b/‎benchmarks/speed_comparison_werx_dataframes.py‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎benchmarks/wer_analysis_results.py‎
Lines changed: 111 additions & 0 deletions b/‎benchmarks/wer_analysis_results.py‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 13 additions & 8 deletions b/‎pyproject.toml‎
Lines changed: 13 additions & 8 deletions
diff --git a/‎src/werx/__init__.py‎
Lines changed: 4 additions & 2 deletions b/‎src/werx/__init__.py‎
Lines changed: 4 additions & 2 deletions
@@ -23,6 +23,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ---
 
+## [0.3.0] - 2025-05-16
+
+### Added
+- `wer_analysis` module for detailed WER analytics and word-level error breakdown
+- New utilities: `to_pandas()` and `to_polars()` for converting analysis results into Pandas and Polars DataFrames.
+
+### Changed
+- Added minimum version requirements for all packages in [project.optional-dependencies] in pyproject.toml. This improves dependency management and reduces risk of incompatibility with older package versions.
+- Updated `__init__.py` to expose analysis, to_pandas, and to_polars at the top level for easier access.
+- Updated README.md to include detailed user guide and instructions for using the analysis() function.
+- Documented optional dependency installation steps for Pandas and Polars.
+- Included  instructions for converting analysis results to Pandas and Polars DataFrames using the `to_pandas()` and `to_polars()` utilities.
+
+### Fixed
+- Added type annotations to all public functions in `utils.py`, resolving Pylance warnings about unknown or missing types.
+- Improved docstrings and code comments for better clarity and maintainability.
+
+---
+
 ## [0.2.0] - 2025-05-14
 
 ### Added
 
@@ -41,7 +41,7 @@
 
 🧩 **Robust:** Designed to handle edge cases gracefully, including empty strings and mismatched sequences<br>
 
-📐 **Accurate:** Carefully tested to ensure consistent and reliable results<br>
+📐 **Insightful:** Provides rich word-level error breakdowns, including substitutions, insertions, deletions, and weighted error rates<br>
 
 🛡️ **Production-Ready:** Minimal dependencies, memory-efficient, and engineered for stability<br> 
 
@@ -73,7 +73,7 @@ import werx
 
 ### Examples:
 
-#### 1. Single sentence comparison
+### 1. Single sentence comparison
 
 *Python Code:*
 ```python
@@ -86,7 +86,9 @@ print(wer)
 0.25
 ```
 
-#### 2. Corpus level Word Error Rate Calculation
+<br/>
+
+### 2. Corpus level Word Error Rate Calculation
 
 *Python Code:*
 ```python
@@ -101,7 +103,9 @@ print(wer)
 0.2
 ```
 
-#### 3. Weighted Word Error Rate Calculation (Custom Weights)
+<br/>
+
+### 3. Weighted Word Error Rate Calculation
 
 *Python Code:*
 ```python
@@ -124,6 +128,92 @@ print(wer)
 0.15
 ```
 
+<br/>
+
+### 4. Complete Word Error Rate Breakdown
+
+The `analysis()` function provides a complete breakdown of word error rates, supporting both standard WER and weighted WER calculations.
+
+It delivers detailed, per-sentence metrics—including insertions, deletions, substitutions, and word-level error tracking, with the flexibility to customize error weights.
+
+Results are easily accessible through standard Python objects or can be conveniently converted into Pandas and Polars DataFrames for further analysis and reporting.
+
+
+#### 4a. Getting Started
+
+*Python Code:*
+```python
+ref = ["the quick brown fox"]
+hyp = ["the quick brown dog"]
+
+results = werx.analysis(ref, hyp)
+
+print("Inserted:", results[0].inserted_words)
+print("Deleted:", results[0].deleted_words)
+print("Substituted:", results[0].substituted_words)
+
+```
+
+*Results Output:*
+```
+Inserted Words   : []
+Deleted Words    : []
+Substituted Words: [('fox', 'dog')]
+```
+
+<br/>
+
+#### 4b. Converting Analysis Results to a DataFrame
+
+*Note:* To use this module, you must have either `pandas` or `polars` (or both) installed.
+
+*Install Pandas / Polars for DataFrame Conversion*
+```python
+uv pip install pandas
+uv pip install polars
+```
+
+*Python Code:*
+```python
+ref = ["i love cold pizza", "the sugar bear character was popular"]
+hyp = ["i love pizza", "the sugar bare character was popular"]
+results = werx.analysis(
+    ref, hyp,
+    insertion_weight=2,
+    deletion_weight=2,
+    substitution_weight=1
+)
+```
+We’ve created a special utility to make working with DataFrames seamless.
+Just import the following helper:
+
+```python
+import werx
+from werx.utils import to_polars, to_pandas
+```
+
+You can then easily convert analysis results to get output using **Polars**:
+```python
+# Convert to Polars DataFrame
+df_polars = to_polars(results)
+print(df_polars)
+```
+
+Alternatively, you can also use **Pandas** depending on your preference:
+```python
+# Convert to Pandas DataFrame
+df_pandas = to_pandas(results)
+print(df_pandas)
+```
+
+*Results Output:*
+
+| wer    | wwer   | ld  | n_ref | insertions | deletions | substitutions | inserted_words | deleted_words | substituted_words   |
+|--------|--------|-----|-------|------------|-----------|---------------|----------------|----------------|---------------------|
+| 0.25   | 0.50   | 1   | 4     | 0          | 1         | 0             | []             | ['cold']       | []                  |
+| 0.1667 | 0.1667 | 1   | 6     | 0          | 0         | 1             | []             | []             | [('bear', 'bare')]   |
+
+
 <br/>
 
 ## 📄 License
 
@@ -0,0 +1,68 @@
+from datasets import load_dataset
+import werx
+import werpy
+import timeit
+from werx.utils import to_pandas, to_polars
+
+# Load the consolidated CSV from the Hugging Face Hub
+dataset = load_dataset(
+    "analyticsinmotion/librispeech-eval",
+    data_files="all_splits.csv",
+    split="train"
+)
+
+# Specify which split and model/version to evaluate
+split = "test-clean"
+model_name = "whisper-base"
+model_version = "v20240930"
+
+# Filter references and hypotheses for the chosen split/model/version
+filtered = dataset.filter(
+    lambda x: x["split"] == split and
+              x["model_name"] == model_name and
+              x["model_version"] == model_version
+)
+
+filtered = list(filtered)
+references = [str(werpy.normalize(row["reference"])) for row in filtered]
+hypotheses = [str(werpy.normalize(row["hypothesis"])) for row in filtered]
+
+# --- Run werx.analysis once for Standard and Weighted ---
+results_standard = werx.analysis(references, hypotheses)
+results_weighted = werx.analysis(references, hypotheses, insertion_weight=2, deletion_weight=2, substitution_weight=1)
+
+# --- DataFrame conversion tools ---
+df_tools = {
+    "Pandas (Standard)": lambda: to_pandas(results_standard),
+    "Polars (Standard)": lambda: to_polars(results_standard),
+    "Pandas (Weighted)": lambda: to_pandas(results_weighted),
+    "Polars (Weighted)": lambda: to_polars(results_weighted),
+}
+
+# --- Run + time each DataFrame conversion using timeit ---
+df_results = []
+n_repeats = 10
+
+for name, func in df_tools.items():
+    total_time = timeit.timeit(func, number=n_repeats)
+    avg_time = total_time / n_repeats
+    # Actually create the DataFrame once for display
+    df = func()
+    df_results.append((name, df, avg_time))
+
+# --- Sort by fastest execution time ---
+df_results.sort(key=lambda x: x[2])
+
+# --- Print CLI-friendly table ---
+print("\nWERX Analysis: DataFrame Conversion Benchmark (Ordered by Speed)\n")
+print(f"{'Method':<20} {'Rows':<8} {'Cols':<6} {'Time (s)':<12}")
+print("-" * 50)
+for name, df, t in df_results:
+    n_rows = len(df)
+    n_cols = len(df.columns)
+    print(f"{name:<20} {n_rows:<8} {n_cols:<6} {t:.6f}")
+
+# --- Optionally, show a preview of each DataFrame ---
+for name, df, _ in df_results:
+    print(f"\n{name} DataFrame preview:")
+    print(df.head())
@@ -0,0 +1,111 @@
+import werx
+from werx.utils import to_polars, to_pandas
+
+# ****************************
+# Test 1 - Sentence-Level WER Analysis with Word-Level Errors
+# ****************************
+
+ref = [
+    'it is consumed domestically and exported to other countries',
+    'rufino street in makati right inside the makati central business district',
+    'its estuary is considered to have abnormally low rates of dissolved oxygen',
+    'he later cited his first wife anita as the inspiration for the song',
+    'no one else could claim that'
+]
+
+hyp = [
+    'it is consumed domestically and exported to other countries',
+    'rofino street in mccauti right inside the macasi central business district',
+    'its estiary is considered to have a normally low rates of dissolved oxygen',
+    'he later sighted his first wife anita as the inspiration for the song',
+    'no one else could claim that'
+]
+
+results = werx.analysis(ref, hyp)
+print("===== Test 1: Full Analysis Results =====")
+print(results)
+
+# Inspect detailed word-level analysis for the second sentence (index 1)
+first_result = results[1]
+
+print("\n===== Test 1: Detailed Word-Level Errors for Sentence 2 =====")
+print("Inserted Words   :", first_result.inserted_words)
+print("Deleted Words    :", first_result.deleted_words)
+print("Substituted Words:", first_result.substituted_words)
+
+
+# ****************************
+# Test 2 - Simple WER Calculation Example
+# ****************************
+
+ref2 = ["this is a test"]
+hyp2 = ["this was a test"]
+results = werx.analysis(ref2, hyp2)
+
+print("\n===== Test 2: Simple WER Calculation =====")
+print(f"WER: {results[0].wer}")
+
+# ****************************
+# Test 3 - Weighted WER Calculation Example
+# ****************************
+
+ref3 = ["i love cold pizza", "the sugar bear character was popular"]
+hyp3 = ["i love pizza", "the sugar bare character was popular"]
+results = werx.analysis(ref3, hyp3, insertion_weight=2, deletion_weight=2, substitution_weight=1)
+
+print("\n===== Test 3: Weighted WER Calculation =====")
+print(f"Weighted WER (wwer): {results[0].wwer}")
+
+# ****************************
+# Test 4 - Results with Polars Example
+# ****************************
+
+ref4 = ["i love cold pizza", "the sugar bear character was popular"]
+hyp4 = ["i love pizza", "the sugar bare character was popular"]
+
+results = werx.analysis(ref4, hyp4)
+
+# Convert results to Polars DataFrame
+df_polars = to_polars(results)
+
+print("\n===== Test 4: Polars DataFrame Output =====")
+print(df_polars)
+
+# ****************************
+# Test 5 - Weghted WER with Polars Example
+# ****************************
+
+results = werx.analysis(ref, hyp, insertion_weight=2, deletion_weight=2, substitution_weight=1)
+
+# Convert results to Polars DataFrame
+df_polars = to_polars(results)
+
+print("\n===== Test 5: Polars DataFrame Output =====")
+print(df_polars)
+
+# ****************************
+# Test 6 - Results with Pandas Example
+# ****************************
+
+ref6 = ["i love cold pizza", "the sugar bear character was popular"]
+hyp6 = ["i love pizza", "the sugar bare character was popular"]
+
+results = werx.analysis(ref6, hyp6)
+
+# Convert results to Polars DataFrame
+df_pandas = to_pandas(results)
+
+print("\n===== Test 4: Pandas DataFrame Output =====")
+print(df_pandas)
+
+# ****************************
+# Test 7 - Weghted WER with Pandas Example
+# ****************************
+
+results = werx.analysis(ref, hyp, insertion_weight=2, deletion_weight=2, substitution_weight=1)
+
+# Convert results to Polars DataFrame
+df_pandas = to_pandas(results)
+
+print("\n===== Test 5: Pandas DataFrame Output =====")
+print(df_pandas)
@@ -1,6 +1,6 @@
 [project]
 name = "werx"
-version = "0.2.0"
+version = "0.3.0"
 description = "A high-performance Python package for calculating Word Error Rate (WER), powered by Rust."
 readme = "README.md"
 authors = [
@@ -9,7 +9,7 @@ authors = [
 requires-python = ">=3.10"
 license = {file = 'LICENSE'}
 classifiers = [
-    "Development Status :: 4 - Beta",
+    "Development Status :: 5 - Production/Stable",
     "License :: OSI Approved :: Apache Software License",
     "Programming Language :: Python :: 3",
     "Programming Language :: Python :: 3 :: Only",
@@ -60,14 +60,19 @@ dev = [
     "ruff >=0.11.8",
     ]
 
+dataframes = [
+    "polars >=1.29.0",
+    "pandas >=2.2.3"
+]
+
 benchmarks = [
-    "werpy",
-    "jiwer",
-    "pywer",
+    "werpy >=3.1.0",
+    "jiwer >=3.1.0",
+    "pywer >=0.1.1",
     "torchmetrics",
-    "evaluate",
-    "memory-profiler",
-    "datasets",
+    "evaluate >=0.4.3",
+    "memory-profiler >=0.61.0",
+    "datasets >=3.6.0",
 ]
 
 [project.urls]
 
@@ -1,5 +1,7 @@
-__version__ = "0.2.0"
+__version__ = "0.3.0"
 from .wer import wer
 from .weighted_wer import weighted_wer, wwer
+from .wer_analysis import analysis
+from .utils import to_polars, to_pandas
 
-__all__ = ["wer", "weighted_wer", "wwer"]
+__all__ = ["wer", "weighted_wer", "wwer", "analysis", "to_polars", "to_pandas"]