-
Notifications
You must be signed in to change notification settings - Fork 123
fix(lsi): improve sentence and paragraph splitting in Summary #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
493fb0b to
276f790
Compare
Greptile SummaryFixes sentence and paragraph splitting edge cases in text summarization by handling abbreviations (Mr., Dr., Inc., etc.), decimal numbers ($3.50, 2.5), and mixed line endings. Adds optional high-accuracy mode via
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant String
participant Summary as split_sentences_regex
participant LSI as Classifier::LSI
User->>String: text.summary(count)
String->>String: split_sentences()
alt PragmaticSegmenter available
String->>String: pragmatic_segment()
Note over String: Uses PragmaticSegmenter gem
else Fallback to regex
String->>Summary: split_sentences_regex()
Summary->>Summary: Replace abbreviations with placeholder
Note over Summary: Mr., Dr., Inc. → <<<DOT>>>
Summary->>Summary: Replace decimals with placeholder
Note over Summary: 3.50 → 3<<<DOT>>>50
Summary->>Summary: Split on sentence boundaries
Note over Summary: /(?<=[.!?])(?:\s+|(?=[A-Z]))/
Summary->>Summary: Restore dots from placeholder
Summary-->>String: sentences array
end
String->>LSI: perform_lsi(sentences, count, separator)
LSI->>LSI: Create new LSI instance
loop each sentence
LSI->>LSI: Filter empty/single-word chunks
LSI->>LSI: Add chunk to index
end
LSI->>LSI: build_index() - SVD computation
LSI->>LSI: highest_relative_content(count)
LSI-->>String: top sentences
String-->>User: joined summary string
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 6 comments
lib/classifier/lsi/summary.rb
Outdated
| # License:: LGPL | ||
|
|
||
| class String | ||
| # Common abbreviations that shouldn't trigger sentence breaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove comment - constant name is self-explanatory
| # Common abbreviations that shouldn't trigger sentence breaks | |
| ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 6:6
Comment:
**style:** remove comment - constant name is self-explanatory
```suggestion
ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi/summary.rb
Outdated
| def split_sentences_regex | ||
| abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|') | ||
| text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') } | ||
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove comment - restates what the code shows
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals | |
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 36:36
Comment:
**style:** remove comment - restates what the code shows
```suggestion
text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2')
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi/summary.rb
Outdated
| abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|') | ||
| text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') } | ||
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') # Preserve decimals | ||
| # Split on punctuation followed by space, or punctuation followed by capital letter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove comment - obvious from the regex pattern
| # Split on punctuation followed by space, or punctuation followed by capital letter | |
| sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 37:37
Comment:
**style:** remove comment - obvious from the regex pattern
```suggestion
sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/)
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
test/lsi/summary_test.rb
Outdated
| require_relative '../test_helper' | ||
|
|
||
| class SummaryTest < Minitest::Test | ||
| # Sentence splitting tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove section comment - test method names are self-documenting
| # Sentence splitting tests | |
| def test_split_sentences_basic |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 4:4
Comment:
**style:** remove section comment - test method names are self-documenting
```suggestion
def test_split_sentences_basic
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
test/lsi/summary_test.rb
Outdated
| assert_includes sentences[0], 'etc.' | ||
| end | ||
|
|
||
| # Paragraph splitting tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove section comment - test method names are self-documenting
| # Paragraph splitting tests | |
| def test_split_paragraphs_double_newline |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 70:70
Comment:
**style:** remove section comment - test method names are self-documenting
```suggestion
def test_split_paragraphs_double_newline
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
test/lsi/summary_test.rb
Outdated
| assert_equal 3, paragraphs.size | ||
| end | ||
|
|
||
| # Integration tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: remove section comment - test method names are self-documenting
| # Integration tests | |
| def test_summary_returns_string |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/lsi/summary_test.rb
Line: 104:104
Comment:
**style:** remove section comment - test method names are self-documenting
```suggestion
def test_summary_returns_string
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.276f790 to
d44e11e
Compare
|
@greptile-apps Removed all comments as suggested. Please re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
lib/classifier/lsi/summary.rb, line 45 (link)logic:
summaries.select { |chunk| summaries.include?(chunk) }is redundant - filtering an array by items that are already in it returns the same array
3 files reviewed, 2 comments
| abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|') | ||
| text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') } | ||
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') | ||
| sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: regex pattern (?:\s+|(?=[A-Z])) is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.
| sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/) | |
| sentences = text.split(/(?<=[.!?])\s+/) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 36:36
Comment:
**logic:** regex pattern `(?:\s+|(?=[A-Z]))` is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.
```suggestion
sentences = text.split(/(?<=[.!?])\s+/)
```
How can I resolve this? If you propose a fix, please make it concise.d44e11e to
f101ddd
Compare
|
@greptile-apps Fixed the redundant Regarding the regex suggestion to remove The capital letter lookahead is intentional - it handles cases where sentences are concatenated without spaces (e.g., The "inconsistent" behavior is actually correct:
Removing it would break tests. Please re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 2 comments
lib/classifier/lsi/summary.rb
Outdated
|
|
||
| def perform_lsi(chunks, count, separator) | ||
| lsi = Classifier::LSI.new auto_rebuild: false | ||
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: chunk.strip called twice - store in variable
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | |
| stripped = chunk.strip | |
| chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 } |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 42:42
Comment:
**style:** `chunk.strip` called twice - store in variable
```suggestion
stripped = chunk.strip
chunks.each { |chunk| lsi << chunk unless stripped.empty? || stripped.split.size == 1 }
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def perform_lsi(chunks, count, separator) | ||
| lsi = Classifier::LSI.new auto_rebuild: false | ||
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | ||
| lsi.build_index | ||
| summaries = lsi.highest_relative_content count | ||
| summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator) | ||
| summaries.map(&:strip).join(separator) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: calls chunk.strip twice per iteration - store result
| def perform_lsi(chunks, count, separator) | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | |
| lsi.build_index | |
| summaries = lsi.highest_relative_content count | |
| summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator) | |
| summaries.map(&:strip).join(separator) | |
| end | |
| def perform_lsi(chunks, count, separator) | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| chunks.each do |chunk| | |
| stripped = chunk.strip | |
| lsi << chunk unless stripped.empty? || stripped.split.size == 1 | |
| end | |
| lsi.build_index | |
| summaries = lsi.highest_relative_content count | |
| summaries.map(&:strip).join(separator) | |
| end |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:46
Comment:
**style:** calls `chunk.strip` twice per iteration - store result
```suggestion
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
lsi << chunk unless stripped.empty? || stripped.split.size == 1
end
lsi.build_index
summaries = lsi.highest_relative_content count
summaries.map(&:strip).join(separator)
end
```
How can I resolve this? If you propose a fix, please make it concise.f101ddd to
dc2f031
Compare
|
@greptile-apps Fixed - stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 2 comments
| def perform_lsi(chunks, count, separator) | ||
| lsi = Classifier::LSI.new auto_rebuild: false | ||
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | ||
| chunks.each do |chunk| | ||
| stripped = chunk.strip | ||
| lsi << chunk unless stripped.empty? || stripped.split.size == 1 | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested conditional - use early return instead
| def perform_lsi(chunks, count, separator) | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | |
| chunks.each do |chunk| | |
| stripped = chunk.strip | |
| lsi << chunk unless stripped.empty? || stripped.split.size == 1 | |
| end | |
| def perform_lsi(chunks, count, separator) | |
| lsi = Classifier::LSI.new auto_rebuild: false | |
| chunks.each do |chunk| | |
| stripped = chunk.strip | |
| next if stripped.empty? || stripped.split.size == 1 | |
| lsi << chunk | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:45
Comment:
**style:** nested conditional - use early return instead
```suggestion
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
next if stripped.empty? || stripped.split.size == 1
lsi << chunk
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| lsi.build_index | ||
| summaries = lsi.highest_relative_content count | ||
| summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator) | ||
| lsi.highest_relative_content(count).map(&:strip).join(separator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: redundant filter - highest_relative_content already returns valid summaries
| lsi.highest_relative_content(count).map(&:strip).join(separator) | |
| lsi.highest_relative_content(count).map(&:strip).join(separator) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 47:47
Comment:
**style:** redundant filter - `highest_relative_content` already returns valid summaries
```suggestion
lsi.highest_relative_content(count).map(&:strip).join(separator)
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.Sentence splitting now handles common edge cases: - Abbreviations (Mr., Dr., Inc., Corp., etc.) - Decimal numbers ($3.50) - Sentences without trailing spaces Also supports pragmatic_segmenter gem for higher accuracy when installed. Paragraph splitting improved to handle: - Unix (\n\n) and Windows (\r\n\r\n) line endings - Multiple consecutive newlines Closes #68
dc2f031 to
6a8cb25
Compare
|
@greptile-apps Applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I can see you've applied the next guard clause improvement in the perform_lsi method. The code now properly skips empty chunks and single-word chunks using early returns, which is much cleaner than the previous nested conditional logic.
The implementation looks solid:
✅ Guard clause properly implemented - Using next to skip processing of invalid chunks
✅ Logic preserved - Same filtering behavior (empty strings and single-word chunks)
✅ Readability improved - Clear early returns instead of nested conditions
✅ Comprehensive test coverage - 13 new tests covering all the edge cases for sentence and paragraph splitting
The PR successfully addresses the sentence and paragraph splitting issues while maintaining clean, readable code. The optional integration with pragmatic_segmenter for high-accuracy sentence boundary detection is a nice touch for users who need it.
This is ready to merge! 🚀
Summary
Improves text summarization by fixing sentence and paragraph splitting edge cases.
Changes
Sentence splitting now handles:
Optional high-accuracy mode:
If
pragmatic_segmentergem is installed, uses it for 98%+ accuracy on edge cases.Paragraph splitting improved:
\n\n) and Windows (\r\n\r\n) line endingsTest plan
Closes #68
Sources: