Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

Improves text summarization by fixing sentence and paragraph splitting edge cases.

Changes

Sentence splitting now handles:

  • Abbreviations: Mr., Mrs., Dr., Prof., Inc., Corp., etc.
  • Decimal numbers: $3.50, 2.5, etc.
  • No trailing space: "Hello.World" splits correctly

Optional high-accuracy mode:

If pragmatic_segmenter gem is installed, uses it for 98%+ accuracy on edge cases.

Paragraph splitting improved:

  • Handles Unix (\n\n) and Windows (\r\n\r\n) line endings
  • Handles multiple consecutive newlines

Test plan

  • Added 13 new tests for sentence/paragraph splitting
  • Tests cover abbreviations, decimals, exclamations, mixed line endings
  • All 221 tests pass

Closes #68

Sources:

@cardmagic cardmagic force-pushed the fix/summary-sentence-splitting branch from 493fb0b to 276f790 Compare December 28, 2025 21:20
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile Summary

Fixes sentence and paragraph splitting edge cases in text summarization by handling abbreviations (Mr., Dr., Inc., etc.), decimal numbers ($3.50, 2.5), and mixed line endings. Adds optional high-accuracy mode via pragmatic_segmenter gem.

  • Replaces primitive regex patterns with abbreviation-aware splitting logic
  • Handles both Unix and Windows line endings in paragraph splitting
  • Adds 13 comprehensive tests covering all edge cases
  • Simplifies redundant filtering logic in perform_lsi:47
  • Contains nested conditional in perform_lsi:42-45 that violates style guide (should use early return)

Confidence Score: 4/5

  • Safe to merge with minor style improvements recommended
  • Core logic is sound with comprehensive test coverage. Two style issues exist (nested conditional, redundant filter) but don't affect functionality. The improvements address real edge cases that were previously broken.
  • lib/classifier/lsi/summary.rb needs minor refactoring for cleaner conditional structure

Important Files Changed

Filename Overview
lib/classifier/lsi/summary.rb Improved sentence/paragraph splitting with abbreviation and decimal handling. Contains nested conditional and redundant filter that should be simplified.
test/lsi/summary_test.rb Comprehensive test coverage for new splitting logic covering abbreviations, decimals, and line endings. All tests are well-structured.
README.md Added documentation for text summarization feature with optional pragmatic_segmenter gem integration. Clear and accurate.

Sequence Diagram

sequenceDiagram
    participant User
    participant String
    participant Summary as split_sentences_regex
    participant LSI as Classifier::LSI
    
    User->>String: text.summary(count)
    String->>String: split_sentences()
    alt PragmaticSegmenter available
        String->>String: pragmatic_segment()
        Note over String: Uses PragmaticSegmenter gem
    else Fallback to regex
        String->>Summary: split_sentences_regex()
        Summary->>Summary: Replace abbreviations with placeholder
        Note over Summary: Mr., Dr., Inc. → <<<DOT>>>
        Summary->>Summary: Replace decimals with placeholder
        Note over Summary: 3.50 → 3<<<DOT>>>50
        Summary->>Summary: Split on sentence boundaries
        Note over Summary: /(?<=[.!?])(?:\s+|(?=[A-Z]))/
        Summary->>Summary: Restore dots from placeholder
        Summary-->>String: sentences array
    end
    String->>LSI: perform_lsi(sentences, count, separator)
    LSI->>LSI: Create new LSI instance
    loop each sentence
        LSI->>LSI: Filter empty/single-word chunks
        LSI->>LSI: Add chunk to index
    end
    LSI->>LSI: build_index() - SVD computation
    LSI->>LSI: highest_relative_content(count)
    LSI-->>String: top sentences
    String-->>User: joined summary string
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

# License:: LGPL

class String
# Common abbreviations that shouldn't trigger sentence breaks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove comment - constant name is self-explanatory

Suggested change
# Common abbreviations that shouldn't trigger sentence breaks
ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 6:6

Comment:
**style:** remove comment - constant name is self-explanatory

```suggestion
  ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def split_sentences_regex
abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove comment - restates what the code shows

Suggested change
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 36:36

Comment:
**style:** remove comment - restates what the code shows

```suggestion
    text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals
# Split on punctuation followed by space, or punctuation followed by capital letter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove comment - obvious from the regex pattern

Suggested change
# Split on punctuation followed by space, or punctuation followed by capital letter
sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 37:37

Comment:
**style:** remove comment - obvious from the regex pattern

```suggestion
    sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

require_relative '../test_helper'

class SummaryTest < Minitest::Test
# Sentence splitting tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove section comment - test method names are self-documenting

Suggested change
# Sentence splitting tests
def test_split_sentences_basic

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 4:4

Comment:
**style:** remove section comment - test method names are self-documenting

```suggestion
  def test_split_sentences_basic
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

assert_includes sentences[0], 'etc.'
end

# Paragraph splitting tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove section comment - test method names are self-documenting

Suggested change
# Paragraph splitting tests
def test_split_paragraphs_double_newline

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 70:70

Comment:
**style:** remove section comment - test method names are self-documenting

```suggestion
  def test_split_paragraphs_double_newline
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

assert_equal 3, paragraphs.size
end

# Integration tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: remove section comment - test method names are self-documenting

Suggested change
# Integration tests
def test_summary_returns_string

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 104:104

Comment:
**style:** remove section comment - test method names are self-documenting

```suggestion
  def test_summary_returns_string
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic cardmagic force-pushed the fix/summary-sentence-splitting branch from 276f790 to d44e11e Compare December 28, 2025 21:25
@cardmagic
Copy link
Owner Author

@greptile-apps Removed all comments as suggested. Please re-review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. lib/classifier/lsi/summary.rb, line 45 (link)

    logic: summaries.select { |chunk| summaries.include?(chunk) } is redundant - filtering an array by items that are already in it returns the same array

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|')
text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') }
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')
sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: regex pattern (?:\s+|(?=[A-Z])) is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.

Suggested change
sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)
sentences = text.split(/(?<=[.!?])\s+/)
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 36:36

Comment:
**logic:** regex pattern `(?:\s+|(?=[A-Z]))` is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.

```suggestion
    sentences = text.split(/(?<=[.!?])\s+/)
```

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic cardmagic force-pushed the fix/summary-sentence-splitting branch from d44e11e to f101ddd Compare December 28, 2025 21:42
@cardmagic
Copy link
Owner Author

@greptile-apps Fixed the redundant .select call.

Regarding the regex suggestion to remove (?=[A-Z]):

The capital letter lookahead is intentional - it handles cases where sentences are concatenated without spaces (e.g., "Dogs.This text"). This is needed because the existing test data uses .join without separators.

The "inconsistent" behavior is actually correct:

  • "Hello.World" → splits (capital after period = likely new sentence)
  • "Hello.world" → doesn't split (lowercase after period = likely abbreviation or domain)

Removing it would break tests. Please re-review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile


def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: chunk.strip called twice - store in variable

Suggested change
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
stripped = chunk.strip
chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 }

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 42:42

Comment:
**style:** `chunk.strip` called twice - store in variable

```suggestion
    stripped = chunk.strip
    chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 }
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 40 to 50
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
lsi.build_index
summaries = lsi.highest_relative_content count
summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)
summaries.map(&:strip).join(separator)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: calls chunk.strip twice per iteration - store result

Suggested change
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
lsi.build_index
summaries = lsi.highest_relative_content count
summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)
summaries.map(&:strip).join(separator)
end
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
lsi << chunk unless stripped.empty? || stripped.split.size == 1
end
lsi.build_index
summaries = lsi.highest_relative_content count
summaries.map(&:strip).join(separator)
end
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:46

Comment:
**style:** calls `chunk.strip` twice per iteration - store result

```suggestion
  def perform_lsi(chunks, count, separator)
    lsi = Classifier::LSI.new auto_rebuild: false
    chunks.each do |chunk|
      stripped = chunk.strip
      lsi << chunk unless stripped.empty? || stripped.split.size == 1
    end
    lsi.build_index
    summaries = lsi.highest_relative_content count
    summaries.map(&:strip).join(separator)
  end
```

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic cardmagic force-pushed the fix/summary-sentence-splitting branch from f101ddd to dc2f031 Compare December 28, 2025 21:46
@cardmagic
Copy link
Owner Author

@greptile-apps Fixed - stored chunk.strip in variable. Please re-review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 40 to 47
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
chunks.each do |chunk|
stripped = chunk.strip
lsi << chunk unless stripped.empty? || stripped.split.size == 1
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested conditional - use early return instead

Suggested change
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
chunks.each do |chunk|
stripped = chunk.strip
lsi << chunk unless stripped.empty? || stripped.split.size == 1
end
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
next if stripped.empty? || stripped.split.size == 1
lsi << chunk
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:45

Comment:
**style:** nested conditional - use early return instead

```suggestion
  def perform_lsi(chunks, count, separator)
    lsi = Classifier::LSI.new auto_rebuild: false
    chunks.each do |chunk|
      stripped = chunk.strip
      next if stripped.empty? || stripped.split.size == 1
      lsi << chunk
    end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

lsi.build_index
summaries = lsi.highest_relative_content count
summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator)
lsi.highest_relative_content(count).map(&:strip).join(separator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: redundant filter - highest_relative_content already returns valid summaries

Suggested change
lsi.highest_relative_content(count).map(&:strip).join(separator)
lsi.highest_relative_content(count).map(&:strip).join(separator)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 47:47

Comment:
**style:** redundant filter - `highest_relative_content` already returns valid summaries

```suggestion
    lsi.highest_relative_content(count).map(&:strip).join(separator)
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Sentence splitting now handles common edge cases:
- Abbreviations (Mr., Dr., Inc., Corp., etc.)
- Decimal numbers ($3.50)
- Sentences without trailing spaces

Also supports pragmatic_segmenter gem for higher accuracy when installed.

Paragraph splitting improved to handle:
- Unix (\n\n) and Windows (\r\n\r\n) line endings
- Multiple consecutive newlines

Closes #68
@cardmagic cardmagic force-pushed the fix/summary-sentence-splitting branch from dc2f031 to 6a8cb25 Compare December 28, 2025 22:36
@cardmagic
Copy link
Owner Author

@greptile-apps Applied next guard clause. Ready for final review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I can see you've applied the next guard clause improvement in the perform_lsi method. The code now properly skips empty chunks and single-word chunks using early returns, which is much cleaner than the previous nested conditional logic.

The implementation looks solid:

Guard clause properly implemented - Using next to skip processing of invalid chunks
Logic preserved - Same filtering behavior (empty strings and single-word chunks)
Readability improved - Clear early returns instead of nested conditions
Comprehensive test coverage - 13 new tests covering all the edge cases for sentence and paragraph splitting

The PR successfully addresses the sentence and paragraph splitting issues while maintaining clean, readable code. The optional integration with pragmatic_segmenter for high-accuracy sentence boundary detection is a nice touch for users who need it.

This is ready to merge! 🚀

@cardmagic cardmagic merged commit b562e6c into master Dec 28, 2025
7 checks passed
@cardmagic cardmagic deleted the fix/summary-sentence-splitting branch December 28, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve sentence/paragraph splitting in Summary

2 participants