fix(lsi): improve sentence and paragraph splitting in Summary #98

cardmagic · 2025-12-28T21:19:23Z

Summary

Improves text summarization by fixing sentence and paragraph splitting edge cases.

Changes

Sentence splitting now handles:

Abbreviations: Mr., Mrs., Dr., Prof., Inc., Corp., etc.
Decimal numbers: $3.50, 2.5, etc.
No trailing space: "Hello.World" splits correctly

Optional high-accuracy mode:

If pragmatic_segmenter gem is installed, uses it for 98%+ accuracy on edge cases.

Paragraph splitting improved:

Handles Unix (\n\n) and Windows (\r\n\r\n) line endings
Handles multiple consecutive newlines

Test plan

Added 13 new tests for sentence/paragraph splitting
Tests cover abbreviations, decimals, exclamations, mixed line endings
All 221 tests pass

Closes #68

Sources:

pragmatic_segmenter - Ruby sentence boundary detection

greptile-apps · 2025-12-28T21:22:04Z

Greptile Summary

Fixes sentence and paragraph splitting edge cases in text summarization by handling abbreviations (Mr., Dr., Inc., etc.), decimal numbers ($3.50, 2.5), and mixed line endings. Adds optional high-accuracy mode via pragmatic_segmenter gem.

Replaces primitive regex patterns with abbreviation-aware splitting logic
Handles both Unix and Windows line endings in paragraph splitting
Adds 13 comprehensive tests covering all edge cases
Simplifies redundant filtering logic in perform_lsi:47
Contains nested conditional in perform_lsi:42-45 that violates style guide (should use early return)

Confidence Score: 4/5

Safe to merge with minor style improvements recommended
Core logic is sound with comprehensive test coverage. Two style issues exist (nested conditional, redundant filter) but don't affect functionality. The improvements address real edge cases that were previously broken.
lib/classifier/lsi/summary.rb needs minor refactoring for cleaner conditional structure

Important Files Changed

Filename	Overview
lib/classifier/lsi/summary.rb	Improved sentence/paragraph splitting with abbreviation and decimal handling. Contains nested conditional and redundant filter that should be simplified.
test/lsi/summary_test.rb	Comprehensive test coverage for new splitting logic covering abbreviations, decimals, and line endings. All tests are well-structured.
README.md	Added documentation for text summarization feature with optional pragmatic_segmenter gem integration. Clear and accurate.

Sequence Diagram

sequenceDiagram
    participant User
    participant String
    participant Summary as split_sentences_regex
    participant LSI as Classifier::LSI
    
    User->>String: text.summary(count)
    String->>String: split_sentences()
    alt PragmaticSegmenter available
        String->>String: pragmatic_segment()
        Note over String: Uses PragmaticSegmenter gem
    else Fallback to regex
        String->>Summary: split_sentences_regex()
        Summary->>Summary: Replace abbreviations with placeholder
        Note over Summary: Mr., Dr., Inc. → <<<DOT>>>
        Summary->>Summary: Replace decimals with placeholder
        Note over Summary: 3.50 → 3<<<DOT>>>50
        Summary->>Summary: Split on sentence boundaries
        Note over Summary: /(?<=[.!?])(?:\s+|(?=[A-Z]))/
        Summary->>Summary: Restore dots from placeholder
        Summary-->>String: sentences array
    end
    String->>LSI: perform_lsi(sentences, count, separator)
    LSI->>LSI: Create new LSI instance
    loop each sentence
        LSI->>LSI: Filter empty/single-word chunks
        LSI->>LSI: Add chunk to index
    end
    LSI->>LSI: build_index() - SVD computation
    LSI->>LSI: highest_relative_content(count)
    LSI-->>String: top sentences
    String-->>User: joined summary string

greptile-apps

_{2 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-28T21:22:01Z

lib/classifier/lsi/summary.rb

 # License::   LGPL

 class String
+  # Common abbreviations that shouldn't trigger sentence breaks


style: remove comment - constant name is self-explanatory

Suggested change

# Common abbreviations that shouldn't trigger sentence breaks

ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 6:6 Comment: **style:** remove comment - constant name is self-explanatory ```suggestion ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:22:01Z

lib/classifier/lsi/summary.rb

+  def split_sentences_regex
+    abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
+    text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
+    text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals


style: remove comment - restates what the code shows

Suggested change

text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals

text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 36:36 Comment: **style:** remove comment - restates what the code shows ```suggestion text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:22:02Z

lib/classifier/lsi/summary.rb

+    abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
+    text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
+    text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals
+    # Split on punctuation followed by space, or punctuation followed by capital letter


style: remove comment - obvious from the regex pattern

Suggested change

# Split on punctuation followed by space, or punctuation followed by capital letter

sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 37:37 Comment: **style:** remove comment - obvious from the regex pattern ```suggestion sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/) ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:22:02Z

test/lsi/summary_test.rb

+require_relative '../test_helper'
+
+class SummaryTest < Minitest::Test
+  # Sentence splitting tests


style: remove section comment - test method names are self-documenting

Suggested change

# Sentence splitting tests

def test_split_sentences_basic

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: test/lsi/summary_test.rb Line: 4:4 Comment: **style:** remove section comment - test method names are self-documenting ```suggestion def test_split_sentences_basic ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:22:03Z

test/lsi/summary_test.rb

+    assert_includes sentences[0], 'etc.'
+  end
+
+  # Paragraph splitting tests


style: remove section comment - test method names are self-documenting

Suggested change

# Paragraph splitting tests

def test_split_paragraphs_double_newline

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: test/lsi/summary_test.rb Line: 70:70 Comment: **style:** remove section comment - test method names are self-documenting ```suggestion def test_split_paragraphs_double_newline ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:22:03Z

test/lsi/summary_test.rb

+    assert_equal 3, paragraphs.size
+  end
+
+  # Integration tests


style: remove section comment - test method names are self-documenting

Suggested change

# Integration tests

def test_summary_returns_string

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: test/lsi/summary_test.rb Line: 104:104 Comment: **style:** remove section comment - test method names are self-documenting ```suggestion def test_summary_returns_string ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

cardmagic · 2025-12-28T21:26:04Z

@greptile-apps Removed all comments as suggested. Please re-review.

greptile-apps

Additional Comments (1)

lib/classifier/lsi/summary.rb, line 45 (link)

logic: summaries.select { |chunk| summaries.include?(chunk) } is redundant - filtering an array by items that are already in it returns the same array

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-28T21:28:18Z

lib/classifier/lsi/summary.rb

+    abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
+    text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
+    text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')
+    sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)


logic: regex pattern (?:\s+|(?=[A-Z])) is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.

Suggested change

sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)

sentences = text.split(/(?<=[.!?])\s+/)

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 36:36 Comment: **logic:** regex pattern `(?:\s+|(?=[A-Z]))` is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior. ```suggestion sentences = text.split(/(?<=[.!?])\s+/) ``` How can I resolve this? If you propose a fix, please make it concise.

cardmagic · 2025-12-28T21:42:21Z

@greptile-apps Fixed the redundant .select call.

Regarding the regex suggestion to remove (?=[A-Z]):

The capital letter lookahead is intentional - it handles cases where sentences are concatenated without spaces (e.g., "Dogs.This text"). This is needed because the existing test data uses .join without separators.

The "inconsistent" behavior is actually correct:

"Hello.World" → splits (capital after period = likely new sentence)
"Hello.world" → doesn't split (lowercase after period = likely abbreviation or domain)

Removing it would break tests. Please re-review.

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-28T21:44:51Z

lib/classifier/lsi/summary.rb

+
  def perform_lsi(chunks, count, separator)
    lsi = Classifier::LSI.new auto_rebuild: false
    chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }


style: chunk.strip called twice - store in variable

Suggested change

chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }

stripped = chunk.strip

chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 }

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 42:42 Comment: **style:** `chunk.strip` called twice - store in variable ```suggestion stripped = chunk.strip chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 } ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:44:52Z

lib/classifier/lsi/summary.rb

  def perform_lsi(chunks, count, separator)
    lsi = Classifier::LSI.new auto_rebuild: false
    chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
    lsi.build_index
    summaries = lsi.highest_relative_content count
-    summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)
+    summaries.map(&:strip).join(separator)
  end


style: calls chunk.strip twice per iteration - store result

Suggested change

def perform_lsi(chunks, count, separator)

lsi = Classifier::LSI.new auto_rebuild: false

chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }

lsi.build_index

summaries = lsi.highest_relative_content count

summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)

summaries.map(&:strip).join(separator)

end

def perform_lsi(chunks, count, separator)

lsi = Classifier::LSI.new auto_rebuild: false

chunks.each do |chunk|

stripped = chunk.strip

lsi << chunk unless stripped.empty? || stripped.split.size == 1

end

lsi.build_index

summaries = lsi.highest_relative_content count

summaries.map(&:strip).join(separator)

end

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 40:46 Comment: **style:** calls `chunk.strip` twice per iteration - store result ```suggestion def perform_lsi(chunks, count, separator) lsi = Classifier::LSI.new auto_rebuild: false chunks.each do |chunk| stripped = chunk.strip lsi << chunk unless stripped.empty? || stripped.split.size == 1 end lsi.build_index summaries = lsi.highest_relative_content count summaries.map(&:strip).join(separator) end ``` How can I resolve this? If you propose a fix, please make it concise.

cardmagic · 2025-12-28T21:46:09Z

@greptile-apps Fixed - stored chunk.strip in variable. Please re-review.

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-28T21:48:54Z

lib/classifier/lsi/summary.rb

  def perform_lsi(chunks, count, separator)
    lsi = Classifier::LSI.new auto_rebuild: false
-    chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
+    chunks.each do |chunk|
+      stripped = chunk.strip
+      lsi << chunk unless stripped.empty? || stripped.split.size == 1
+    end


style: nested conditional - use early return instead

Suggested change

def perform_lsi(chunks, count, separator)

lsi = Classifier::LSI.new auto_rebuild: false

chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }

chunks.each do |chunk|

stripped = chunk.strip

lsi << chunk unless stripped.empty? || stripped.split.size == 1

end

def perform_lsi(chunks, count, separator)

lsi = Classifier::LSI.new auto_rebuild: false

chunks.each do |chunk|

stripped = chunk.strip

next if stripped.empty? || stripped.split.size == 1

lsi << chunk

end

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 40:45 Comment: **style:** nested conditional - use early return instead ```suggestion def perform_lsi(chunks, count, separator) lsi = Classifier::LSI.new auto_rebuild: false chunks.each do |chunk| stripped = chunk.strip next if stripped.empty? || stripped.split.size == 1 lsi << chunk end ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-28T21:48:54Z

lib/classifier/lsi/summary.rb

    lsi.build_index
-    summaries = lsi.highest_relative_content count
-    summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)
+    lsi.highest_relative_content(count).map(&:strip).join(separator)


style: redundant filter - highest_relative_content already returns valid summaries

Suggested change

lsi.highest_relative_content(count).map(&:strip).join(separator)

lsi.highest_relative_content(count).map(&:strip).join(separator)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/lsi/summary.rb Line: 47:47 Comment: **style:** redundant filter - `highest_relative_content` already returns valid summaries ```suggestion lsi.highest_relative_content(count).map(&:strip).join(separator) ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

Sentence splitting now handles common edge cases: - Abbreviations (Mr., Dr., Inc., Corp., etc.) - Decimal numbers ($3.50) - Sentences without trailing spaces Also supports pragmatic_segmenter gem for higher accuracy when installed. Paragraph splitting improved to handle: - Unix (\n\n) and Windows (\r\n\r\n) line endings - Multiple consecutive newlines Closes #68

cardmagic · 2025-12-28T22:45:06Z

@greptile-apps Applied next guard clause. Ready for final review.

greptile-apps

Great! I can see you've applied the next guard clause improvement in the perform_lsi method. The code now properly skips empty chunks and single-word chunks using early returns, which is much cleaner than the previous nested conditional logic.

The implementation looks solid:

✅ Guard clause properly implemented - Using next to skip processing of invalid chunks
✅ Logic preserved - Same filtering behavior (empty strings and single-word chunks)
✅ Readability improved - Clear early returns instead of nested conditions
✅ Comprehensive test coverage - 13 new tests covering all the edge cases for sentence and paragraph splitting

The PR successfully addresses the sentence and paragraph splitting issues while maintaining clean, readable code. The optional integration with pragmatic_segmenter for high-accuracy sentence boundary detection is a nice touch for users who need it.

This is ready to merge! 🚀

cardmagic force-pushed the fix/summary-sentence-splitting branch from 493fb0b to 276f790 Compare December 28, 2025 21:20