Releases: apache/stormcrawler
stormcrawler-3.5.0
Summary
Apache StormCrawler 3.5.0 decouples Selenium from the core module (#1604), improving modularity and reducing unnecessary dependencies. The release also introduces an advanced metadata filtering system (#1647) that supports complex logical operations like key=>val OR (key2=>val2 AND key3=>val3), addressing issue #711.
Additionally, multiple dependencies were upgraded, core tests improved, and deprecated code cleaned up, enhancing overall stability and maintainability.
Breaking Changes
Users upgrading and using Selenium now need to add the new Maven module stormcrawler-selenium to their pom.xml as follows:
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler-selenium</artifactId>
<version>3.5.0</version>What's Changed
- Bump testcontainers.version from 1.21.2 to 1.21.3 by @dependabot[bot] in #1584
- #1580 [Improvement] Convert Tags in GH actions into SHA by @Evergreenies in #1581
- Bump com.microsoft.playwright:playwright from 1.52.0 to 1.53.0 by @dependabot[bot] in #1582
- Bump junit.version from 5.13.1 to 5.13.2 by @dependabot[bot] in #1583
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1585
- Make MultiProxyManagerTest.testMultiProxyManagerConstructorFile() OS independent by @sigee in #1589
- Bump dependencies version (Commons CLI 1.9.0, OpenSearch 2.19.2) by @sigee in #1586
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.5.0 to 3.6.0 by @dependabot[bot] in #1593
- Bump junit.version from 5.13.2 to 5.13.3 by @dependabot[bot] in #1591
- Bump selenium.version from 4.33.0 to 4.34.0 by @dependabot[bot] in #1594
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1595
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1596
- Bump tika.version from 3.2.0 to 3.2.1 by @dependabot[bot] in #1599
- Bump com.github.ben-manes.caffeine:caffeine from 3.2.1 to 3.2.2 by @dependabot[bot] in #1600
- #108 Replace custom HttpHeaders constants with the org.apache.http.HttpHeaders ones by @sigee in #1587
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1602
- Clean up core test assertions by @sigee in #1603
- Create consistent API for Metadata by @sigee in #1598
- Bump com.github.crawler-commons:crawler-commons from 1.4 to 1.5 by @dependabot[bot] in #1590
- Bump org.netpreserve:jwarc from 0.31.1 to 0.32.0 by @dependabot[bot] in #1606
- Bump aws.version from 1.12.787 to 1.12.788 by @dependabot[bot] in #1605
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.6.0 to 3.6.1 by @dependabot[bot] in #1607
- Bump okhttp.version from 4.12.0 to 5.1.0 by @dependabot[bot] in #1601
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1610
- Fix java 5 language level issues by @sigee in #1609
- #1560 - Test coverage for Solr cloud by @mvolikas in #1608
- Bump org.apache.solr:solr-solrj from 9.8.1 to 9.9.0 by @dependabot[bot] in #1616
- Bump com.microsoft.playwright:playwright from 1.53.0 to 1.54.0 by @dependabot[bot] in #1615
- Bump opensearch.version from 2.19.2 to 2.19.3 by @dependabot[bot] in #1613
- Bump junit.version from 5.13.3 to 5.13.4 by @dependabot[bot] in #1614
- Bump commons-cli:commons-cli from 1.9.0 to 1.10.0 by @dependabot[bot] in #1618
- Bump dev.langchain4j:langchain4j-open-ai from 1.1.0 to 1.2.0 by @dependabot[bot] in #1620
- Bump actions/cache from 4.2.3 to 4.2.4 by @dependabot[bot] in #1622
- Bump storm-client.version from 2.8.1 to 2.8.2 by @dependabot[bot] in #1619
- Bump dev.langchain4j:langchain4j from 1.1.0 to 1.2.0 by @dependabot[bot] in #1621
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1617
- Bump dev.langchain4j:langchain4j-open-ai from 1.2.0 to 1.3.0 by @dependabot[bot] in #1623
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1624
- Bump tika.version from 3.2.1 to 3.2.2 by @dependabot[bot] in #1625
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1626
- Bump actions/checkout from 4.2.2 to 5.0.0 by @dependabot[bot] in #1627
- Fix NOP logger configuration in core/pom.xml by @HrishikeshUchake in #1628
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1629
- Bump selenium.version from 4.34.0 to 4.35.0 by @dependabot[bot] in #1634
- chore: remove magic number from filterPathRepeat method by @TamimEhsan in #1631
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.11.2 to 3.11.3 by @dependabot[bot] in #1633
- Bump org.mockito:mockito-core from 5.18.0 to 5.19.0 by @dependabot[bot] in #1632
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1635
- Bump actions/setup-java from 4.7.1 to 5.0.0 by @dependabot[bot] in #1636
- #1597 replace deprecated use of URL constructor by @TamimEhsan in #1630
- Bump org.jsoup:jsoup from 1.21.1 to 1.21.2 by @dependabot[bot] in #1637
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1638
- Fix java7 issues by @sigee in #1639
- Bump dev.langchain4j:langchain4j-open-ai from 1.3.0 to 1.4.0 by @dependabot[bot] in #1641
- Bump dev.langchain4j:langchain4j from 1.3.0 to 1.4.0 by @dependabot[bot] in #1640
- Bump com.microsoft.playwright:playwright from 1.54.0 to 1.55.0 by @dependabot[bot] in #1643
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1644
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1645
- #1604 - Externalise Selenium by @jnioche in #1646
- Improve MetadataFilter by @sigee in #1647
- Bump org.jetbrains:annotations from 26.0.2 to 26.0.2-1 by @dependabot[bot] in #1654
- Bump aws.version from 1.12.788 to 1.12.791 by @dependabot[bot] in #1653
- #1650 - Manage Commons Compress to avoid runtime error with Tika 3.2.2 by @rzo1 in #1652
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1655
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.3 to 3.5.4 by @dependabot[bot] in #1657
- Bump tika.version from 3.2.2 to 3.2.3 by @dependabot[bot] in #1658
- Regenerated License file after dependency upgrades by @github-actions[bot] in #1659
New Contributors
- @Evergreenies made their first contribution in #1581
- @HrishikeshUchake made their first contribution in #1628
- @TamimEhsan made their first contribution in #1631
Full Changelog: stormcrawler-3.4.0...stormcrawler-3.5.0
stormcrawler-3.4.0
⚠️ Breaking Change: TextExtractor Renamed and Refactored
Applies to: Users who directly used, extended, or overrode TextExtractor via textextractor.class in crawler.yaml.
What Changed:
-
TextExtractorhas been renamed and is now an interface. -
The default implementation is now called
JSoupTextExtractor. -
If you previously specified
TextExtractorviatextextractor.class, you must now use the fully qualified name of the new class:textextractor.class: "org.apache.stormcrawler.parse.JSoupTextExtractor"
or just remove the line as it is the default anyway.
No Action Needed If:
-
You did not override textextractor.class in your crawler.yaml.
-
You did not directly extend the old TextExtractor class.
Migration Notes:
-
Update custom implementations to implement the new TextExtractor interface.
-
Update any references to the old TextExtractor class to JSoupTextExtractor if applicable.
What's Changed
- Rel stormcrawler 3.3.0 rc1 by @tballison in #1507
- Bump junit.version from 5.12.0 to 5.12.1 by @dependabot in #1498
- Bump org.apache:apache from 33 to 34 by @dependabot in #1506
- Bump com.microsoft.playwright:playwright from 1.50.0 to 1.51.0 by @dependabot in #1504
- Bump org.apache.solr:solr-solrj from 9.8.0 to 9.8.1 by @dependabot in #1500
- Bump org.mockito:mockito-core from 5.16.0 to 5.16.1 by @dependabot in #1499
- Bump selenium.version from 4.29.0 to 4.30.0 by @dependabot in #1505
- Regenerated License file after dependency upgrades by @github-actions in #1508
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.2 to 3.5.3 by @dependabot in #1509
- #621 Async queries in Solr by @mvolikas in #1488
- Update README and compiler target to Java 17 in several plugins by @rzo1 in #1518
- #1516 - Add config options to change the response buffer size in OpenSearch by @rzo1 in #1517
- Bump de.thetaphi:forbiddenapis from 3.8 to 3.9 by @dependabot in #1513
- Bump org.jacoco:jacoco-maven-plugin from 0.8.12 to 0.8.13 by @dependabot in #1511
- Bump selenium.version from 4.30.0 to 4.31.0 by @dependabot in #1510
- Bump org.mockito:mockito-core from 5.16.1 to 5.17.0 by @dependabot in #1512
- Regenerated License file after dependency upgrades by @github-actions in #1519
- Bump junit.version from 5.12.1 to 5.12.2 by @dependabot in #1520
- Bump com.ibm.icu:icu4j from 76.1 to 77.1 by @dependabot in #1501
- Regenerated License file after dependency upgrades by @github-actions in #1521
- Fixes Update NOTICE File to Reflect 2025 by @rzo1 in #1522
- #1298 - Re-enable hold on failure (on coverage fail) by @rzo1 in #1523
- Bump testcontainers.version from 1.20.6 to 1.21.0 by @dependabot in #1524
- Bump org.jsoup:jsoup from 1.19.1 to 1.20.1 by @dependabot in #1530
- Regenerated License file after dependency upgrades by @github-actions in #1531
- Bump aws.version from 1.12.782 to 1.12.783 by @dependabot in #1529
- Bump com.microsoft.playwright:playwright from 1.51.0 to 1.52.0 by @dependabot in #1527
- Bump selenium.version from 4.31.0 to 4.32.0 by @dependabot in #1526
- Bump org.wiremock:wiremock from 3.12.1 to 3.13.0 by @dependabot in #1525
- Regenerated License file after dependency upgrades by @github-actions in #1534
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.1 to 3.4.0 by @dependabot in #1535
- Bump org.apache.maven.archetype:archetype-packaging from 3.3.1 to 3.4.0 by @dependabot in #1536
- Bump org.mockito:mockito-core from 5.17.0 to 5.18.0 by @dependabot in #1540
- Bump selenium.version from 4.32.0 to 4.33.0 by @dependabot in #1539
- Regenerated License file after dependency upgrades by @github-actions in #1541
- Remove Incubating references since we have graduated by @rzo1 in #1538
- Fix versions of SC in the READMEs + added instructions in RELEASING by @jnioche in #1543
- #1545 Use same version of URLFrontier as in the module by @jnioche in #1546
- Bump testcontainers.version from 1.21.0 to 1.21.1 by @dependabot in #1549
- Bump junit.version from 5.12.2 to 5.13.0 by @dependabot in #1548
- Bump aws.version from 1.12.783 to 1.12.785 by @dependabot in #1551
- Bump junit.version from 5.13.0 to 5.13.1 by @dependabot in #1550
- Bump tika.version from 3.1.0 to 3.2.0 by @dependabot in #1547
- Bump com.github.ben-manes.caffeine:caffeine from 3.2.0 to 3.2.1 by @dependabot in #1553
- Regenerated License file after dependency upgrades by @github-actions in #1554
- #1555 - Storm 2.8.1 by @rzo1 in #1556
- Regenerated License file after dependency upgrades by @github-actions in #1557
- Bump aws.version from 1.12.785 to 1.12.787 by @dependabot in #1563
- Bump org.apache:apache from 34 to 35 by @dependabot in #1562
- Bump org.wiremock:wiremock from 3.13.0 to 3.13.1 by @dependabot in #1561
- #1246 - Make ProxyManager to return optional incase no proxy is used by @quangdutran in #1532
- Regenerated License file after dependency upgrades by @github-actions in #1564
- Enable GH discussions by @rzo1 in #1565
- #621 add batching for cloud updates, fix cloud requests by @mvolikas in #1544
- #1558 - Add a LLM-based TextExtractor (OpenAI API compatible) by @rzo1 in #1559
- Bump testcontainers.version from 1.21.1 to 1.21.2 by @dependabot in #1568
- Bump org.codehaus.mojo:license-maven-plugin from 2.5.0 to 2.6.0 by @dependabot in #1567
- Regenerated License file after dependency upgrades by @github-actions in #1569
- Bump dev.langchain4j:langchain4j from 1.0.1 to 1.1.0 by @dependabot in #1574
- Regenerated License file after dependency upgrades by @github-actions in #1575
- Bump org.jsoup:jsoup from 1.20.1 to 1.21.1 by @dependabot in #1576
- Regenerated License file after dependency upgrades by @github-actions in #1577
New Contributors
- @quangdutran made their first contribution in #1532
Full Changelog: stormcrawler-3.3.0...stormcrawler-3.4.0
stormcrawler-3.3.0
What's Changed
- Rel stormcrawler 3.2.0 rc3 by @tballison in #1438
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.11.1 to 3.11.2 by @dependabot in #1437
- Update RELEASING.md by @tballison in #1413
- Bump aws.version from 1.12.779 to 1.12.780 by @dependabot in #1439
- Regenerated License file after dependency upgrades by @github-actions in #1441
- Bump log4j2.version from 2.24.1 to 2.24.3 by @dependabot in #1440
- Regenerated License file after dependency upgrades by @github-actions in #1443
- Save energy by avoid building multipe times by @rzo1 in #1442
- Remove log4j override by @rzo1 in #1444
- Regenerated License file after dependency upgrades by @github-actions in #1446
- Bump org.mockito:mockito-core from 5.14.2 to 5.15.2 by @dependabot in #1447
- Bump junit.version from 5.11.3 to 5.11.4 by @dependabot in #1445
- Regenerated License file after dependency upgrades by @github-actions in #1449
- Bump org.jetbrains:annotations from 26.0.1 to 26.0.2 by @dependabot in #1454
- Bump com.github.ben-manes.caffeine:caffeine from 3.1.8 to 3.2.0 by @dependabot in #1450
- Regenerated License file after dependency upgrades by @github-actions in #1455
- Storm 2.8.0 by @rzo1 in #1457
- Regenerated License file after dependency upgrades by @github-actions in #1458
- Bump org.apache.solr:solr-solrj from 9.7.0 to 9.8.0 by @dependabot in #1452
- Bump selenium.version from 4.27.0 to 4.28.1 by @dependabot in #1451
- Bump org.wiremock:wiremock from 3.10.0 to 3.11.0 by @dependabot in #1461
- Regenerated License file after dependency upgrades by @github-actions in #1462
- Bump tika.version from 3.0.0 to 3.1.0 by @dependabot in #1460
- Updated README files to use version 3.2.0 by @Brijeshthummar02 in #1463
- Regenerated License file after dependency upgrades by @github-actions in #1464
- Remove reference to SNAPSHOT in Solr readme by @mvolikas in #1465
- Bump com.microsoft.playwright:playwright from 1.49.0 to 1.50.0 by @dependabot in #1466
- Regenerated License file after dependency upgrades by @github-actions in #1467
- Bump org.yaml:snakeyaml from 2.3 to 2.4 by @dependabot in #1472
- Bump org.wiremock:wiremock from 3.11.0 to 3.12.0 by @dependabot in #1471
- Bump aws.version from 1.12.780 to 1.12.781 by @dependabot in #1469
- Bump opensearch.version from 2.18.0 to 2.19.0 by @dependabot in #1470
- Regenerated License file after dependency upgrades by @github-actions in #1473
- Bump aws.version from 1.12.781 to 1.12.782 by @dependabot in #1483
- Bump junit.version from 5.11.4 to 5.12.0 by @dependabot in #1482
- Bump org.awaitility:awaitility from 4.2.2 to 4.3.0 by @dependabot in #1481
- Bump testcontainers.version from 1.20.4 to 1.20.5 by @dependabot in #1480
- Bump selenium.version from 4.28.1 to 4.29.0 by @dependabot in #1479
- Regenerated License file after dependency upgrades by @github-actions in #1485
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.13.0 to 3.14.0 by @dependabot in #1478
- #1474 -- catch OpenSearchExceptions to prevent percolation of Runtime… by @tballison in #1476
- #1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off by @tballison in #1477
- fix(docs, lint): Fixed broken image in README, updated official site link, and resolved major lint issues across the repository by @Brijeshthummar02 in #1487
- Bump opensearch.version from 2.19.0 to 2.19.1 by @dependabot in #1489
- Regenerated License file after dependency upgrades by @github-actions in #1490
- Bump org.jsoup:jsoup from 1.18.3 to 1.19.1 by @dependabot in #1492
- Bump org.wiremock:wiremock from 3.12.0 to 3.12.1 by @dependabot in #1493
- Bump testcontainers.version from 1.20.5 to 1.20.6 by @dependabot in #1495
- Bump org.mockito:mockito-core from 5.15.2 to 5.16.0 by @dependabot in #1494
- #1411 Add "Task" to ISSUE_TEMPLATE and include issue type YAML files by @Brijeshthummar02 in #1491
- Regenerated License file after dependency upgrades by @github-actions in #1496
New Contributors
- @Brijeshthummar02 made their first contribution in #1463
Full Changelog: stormcrawler-3.2.0...stormcrawler-3.3.0
stormcrawler-3.2.0
What's Changed
- Release 3.1.0 by @rzo1 in #1316
- Bump Apache Storm from 3.1.1 to 2.6.4 & archetype 3.0 to 3.1.0 by @kunalpal97 in #1319
- #1299 - Add DISCLAIMER to JAR files by @rzo1 in #1320
- #1300 - Fix "files in jars have odd dates" by @rzo1 in #1321
- Bump org.yaml:snakeyaml from 2.2 to 2.3 by @dependabot in #1307
- Bump org.awaitility:awaitility from 4.2.0 to 4.2.2 by @dependabot in #1310
- Bump org.jacoco:jacoco-maven-plugin from 0.8.11 to 0.8.12 by @dependabot in #1305
- Bump org.netpreserve:jwarc from 0.29.0 to 0.30.0 by @dependabot in #1304
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.2.1 to 3.5.0 by @dependabot in #1308
- Bump aws.version from 1.12.663 to 1.12.772 by @dependabot in #1302
- Bump org.apache.solr:solr-solrj from 9.6.1 to 9.7.0 by @dependabot in #1309
- Bump com.microsoft.playwright:playwright from 1.46.0 to 1.47.0 by @dependabot in #1306
- Bump org.wiremock:wiremock from 3.5.4 to 3.9.1 by @dependabot in #1311
- Bump selenium.version from 4.24.0 to 4.25.0 by @dependabot in #1314
- #1323 Update archetype Storm version from 2.6.4 by @mvolikas in #1325
- Regenerated License file after dependency upgrades by @github-actions in #1322
- Bump OpenSearch to 2.17 + fix archetype version in README by @jnioche in #1324
- Bump org.mockito:mockito-core from 5.13.0 to 5.14.0 by @dependabot in #1334
- Bump junit.version from 5.11.0 to 5.11.1 by @dependabot in #1333
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.2.1 to 3.3.0 by @dependabot in #1332
- Bump org.apache.maven.archetype:archetype-packaging from 3.2.1 to 3.3.0 by @dependabot in #1330
- Regenerated License file after dependency upgrades by @github-actions in #1326
- Regenerated License file after dependency upgrades by @github-actions in #1335
- Bump log4j2.version from 2.23.0 to 2.24.1 by @dependabot in #1328
- Regenerated License file after dependency upgrades by @github-actions in #1337
- Bump org.jetbrains:annotations from 24.1.0 to 25.0.0 by @dependabot in #1331
- Regenerated License file after dependency upgrades by @github-actions in #1338
- Bump com.github.crawler-commons:urlfrontier-API from 2.3.1 to 2.4 by @dependabot in #1327
- Regenerated License file after dependency upgrades by @github-actions in #1340
- Store metadata as WARC Metadata records by @jnioche in #1341
- Improve robustness of WARC generation by @jnioche in #1342
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.0 to 3.5.1 by @dependabot in #1350
- Bump junit.version from 5.11.1 to 5.11.2 by @dependabot in #1345
- Fix configuration for Github's linguist by @mvolikas in #1344
- Bump testcontainers.version from 1.20.1 to 1.20.2 by @dependabot in #1346
- Bump org.mockito:mockito-core from 5.14.0 to 5.14.1 by @dependabot in #1349
- Bump aws.version from 1.12.772 to 1.12.773 by @dependabot in #1351
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.0 to 3.10.1 by @dependabot in #1347
- Regenerated License file after dependency upgrades by @github-actions in #1352
- #1354 Fix: fix some typos in project by @psxjoy in #1355
- Fix #1312 "Sha512 hash of source release is missing the file part " by @rzo1 in #1356
- Bump de.thetaphi:forbiddenapis from 3.7 to 3.8 by @dependabot in #1359
- Bump org.jetbrains:annotations from 25.0.0 to 26.0.0 by @dependabot in #1358
- Regenerated License file after dependency upgrades by @github-actions in #1360
- Trivial: version number in warc/README fix #1317 by @jnioche in #1363
- Bugfix nofollow instructions in rel tags ignored by @jnioche in #1362
- Bump org.jetbrains:annotations from 26.0.0 to 26.0.1 by @dependabot in #1368
- Bump com.microsoft.playwright:playwright from 1.47.0 to 1.48.0 by @dependabot in #1366
- Connect to a remote instance using web sockets by @jnioche in #1361
- Bump aws.version from 1.12.773 to 1.12.776 by @dependabot in #1367
- Bump org.mockito:mockito-core from 5.14.1 to 5.14.2 by @dependabot in #1369
- Regenerated License file after dependency upgrades by @github-actions in #1370
- Bump tika.version from 2.9.2 to 3.0.0 by @dependabot in #1365
- Apache Storm 2.7.0 by @rzo1 in #1371
- Regenerated License file after dependency upgrades by @github-actions in #1372
- #1353 Fix for URLFrontier spout not taking into account the crawl ID by @klockla in #1373
- Bump junit.version from 5.11.2 to 5.11.3 by @dependabot in #1375
- Bump com.ibm.icu:icu4j from 75.1 to 76.1 by @dependabot in #1376
- Bump aws.version from 1.12.776 to 1.12.777 by @dependabot in #1377
- Bump org.wiremock:wiremock from 3.9.1 to 3.9.2 by @dependabot in #1378
- Bump testcontainers.version from 1.20.2 to 1.20.3 by @dependabot in #1379
- Remove references to ES in OpenSearch module by @jnioche in #1374
- Regenerated License file after dependency upgrades by @github-actions in #1380
- Fix #1313 "Exclude "__files" from Source Release Artifacts"" by @rzo1 in #1384
- #1301 - add build doc for the source release by @rzo1 in #1383
- [1385] bugfix - check for null before the for-each loop by @jnioche in #1386
- Sync conf files in root and archetype + explicit values for sniff conf by @jnioche in #1388
- Detect multi addresses separated by ; in a single String. Fixes #1382 by @jnioche in #1387
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.0 to 3.3.1 by @dependabot in #1390
- Bump selenium.version from 4.25.0 to 4.26.0 by @dependabot in #1393
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.1 to 3.5.2 by @dependabot in #1392
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.1 to 3.11.1 by @dependabot in #1394
- Bump org.apache.maven.archetype:archetype-packaging from 3.3.0 to 3.3.1 by @dependabot in #1395
- Regenerated License file after dependency upgrades by @github-actions in #1398
- #620 Add support for shards - SolrSpout by @mvolikas in #1343
- #1403 - Downgrade log4j2 to Storm's version. Fixes #1403 by @tballison in #1404
- #140...
Apache StormCrawler 3.1.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our 2nd release after joining the ASF incubator as a poddling. It contains the new playwright module, which can be used for scraping dynamic content.
What's Changed
- send email if CI build fails by @pjfanning in #1217
- Fixes #1214 - "Update Release Docs with Feedback from 3.0 RC2 Vote" by @rzo1 in #1218
- Fix #1223 - Remove declareOutputFields from Solr StatusUpdaterBolt by @mvolikas in #1224
- Apache StormCrawler 3.0 (Incubating) by @rzo1 in #1225
- Fix #1226 "Add FileSpout TestCase for Custom Metadata Injections" by @rzo1 in #1227
- 1024 Playwright protocol implementation, fixes #1024 by @jnioche in #1228
- Fix #1230: Set sitemap key before outlink processing by @mvolikas in #1231
- #1220 - Add disclaimer for binary test artifacts by @rzo1 in #1234
- #1221 - Switch Source to tar.gz by @rzo1 in #1233
- #1215 - Update RAT exclusions. Fixes licenses by @rzo1 in #1235
- #1236 - Fix Typos in StormCrawler by @rzo1 in #1237
- #1222 - Fix Release Docs by @rzo1 in #1232
- #1238 - Avoid use of star imports by @rzo1 in #1239
- Fix #1244 "Migrate to JUnit 5" by @rzo1 in #1245
- Fix #1216 - Add RAT Exclusion File for standalone RAT by @rzo1 in #1243
- #1248 - Use pre-compiled patterns for mime type matching in TikaParser by @rzo1 in #1249
- #1251 - Update to Storm 2.6.3 by @rzo1 in #1252
- #626: Add routing field in metadata - Solr StatusUpdaterBolt by @mvolikas in #1242
- #851 Merge branch 851 into main by @mvolikas in #1256
- #1259 - Enable Dependabot by @rzo1 in #1260
- #1261 - Automatically generate THIRD-PARTY.txt via GitHub Action by @rzo1 in #1262
- #1257 - Update to Storm 2.6.4 by @rzo1 in #1258
- #1162 - Replace Coveralls with JaCoCo by @sigee in #1255
- Bump testcontainers.version from 1.19.7 to 1.20.1 by @dependabot in #1277
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.5.0 to 3.10.0 by @dependabot in #1267
- Bump actions/setup-java from 3 to 4 by @dependabot in #1264
- Bump actions/checkout from 3 to 4 by @dependabot in #1265
- Bump org.jsoup:jsoup from 1.17.2 to 1.18.1 by @dependabot in #1271
- Regenerated License file after dependency upgrades by @github-actions in #1280
- Bump tika.version from 2.9.1 to 2.9.2 by @dependabot in #1269
- Bump com.ibm.icu:icu4j from 74.2 to 75.1 by @dependabot in #1272
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.4.1 to 3.5.0 by @dependabot in #1289
- Bump org.apache.maven.plugins:maven-jar-plugin from 3.3.0 to 3.4.2 by @dependabot in #1288
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.11.0 to 3.13.0 by @dependabot in #1285
- Bump org.apache.rat:apache-rat-plugin from 0.15 to 0.16.1 by @dependabot in #1283
- Bump org.apache:apache from 31 to 33 by @dependabot in #1275
- Bump junit.version from 5.10.2 to 5.11.0 by @dependabot in #1278
- Bump org.apache.solr:solr-solrj from 9.5.0 to 9.6.1 by @dependabot in #1281
- Bump org.apache.maven.archetype:archetype-packaging from 2.4 to 3.2.1 by @dependabot in #1287
- Bump org.mockito:mockito-core from 5.10.0 to 5.13.0 by @dependabot in #1279
- Bump com.microsoft.playwright:playwright from 1.43.0 to 1.46.0 by @dependabot in #1268
- Bump selenium.version from 4.18.1 to 4.24.0 by @dependabot in #1266
- Bump log4j2.version from 2.23.0 to 2.24.0 by @dependabot in #1284
- Regenerated License file after dependency upgrades by @github-actions in #1282
- Fix #1290 "Add close/cleanup method to ParseFilters" by @rzo1 in #1291
- Bump opensearch.version from 2.12.0 to 2.16.0 by @dependabot in #1276
- Regenerated License file after dependency upgrades by @github-actions in #1292
- Aligned version of OpenSearch in test with recent upgrade to 2.16 by @jnioche in #1293
- Bump actions/cache from 3 to 4 by @dependabot in #1263
- Revert "Bump log4j2.version from 2.23.0 to 2.24.0" by @rzo1 in #1294
- #1295 - Add workflow to publish SNAPSHOTS to repository.a.o by @rzo1 in #1296
- Regenerated License file after dependency upgrades by @github-actions in #1297
New Contributors
- @sigee made their first contribution in #1255
- @github-actions made their first contribution in #1280
Full Changelog: stormcrawler-3.0...stormcrawler-3.1.0
Apache StormCrawler 3.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.
What's Changed
- Handling of DateTimeParseException in WARCSpout by @michaeldinzinger in #1140
- Generate THIRD-PARTY.txt file, fixes #1145 by @jnioche in #1146
- Remove coveralls maven plugin, fixes #1148 by @jnioche in #1149
- OpenSearch - better handling of mappings by @jnioche in #1155
- Delete CODE_OF_CONDUCT.md by @pjfanning in #1158
- Create DISCLAIMER by @pjfanning in #1159
- Update NOTICE by @pjfanning in #1160
- Changed package names to org.apache by @jnioche in #1165
- Create .asf.yaml by @pjfanning in #1161
- Fix #1174 - Exclude optional artifact from storm-hdfs by @rzo1 in #1175
- Fix #1164 - Change license headers by @rzo1 in #1173
- Removed devs section from pom.xml by @jnioche in #1181
- Fix #1167 - Remove Elasticsearch module by @rzo1 in #1182
- Remove hyphens in storm-crawler by @jnioche in #1177
- Fixes #1178 "Set version to 3.0-SNAPSHOT" by @rzo1 in #1183
- Fixes #1169 - Use Apache Parent POM & Enable RAT by @rzo1 in #1180
- Removed ref to Discord in README by @jnioche in #1184
- Fix #1168 - Add a modified version of CONTRIBUTING.md by @rzo1 in #1186
- Fix #1163 - Change the GitHub templates for PRs to be more ASF specific by @rzo1 in #1185
- Upgrade to Storm 2.6.2, fix #1188 by @jnioche in #1189
- link to ASF web site .asf.yaml by @pjfanning in #1192
- Update README.md by @jnioche in #1195
- 1200 - Fix license headers by @jnioche in #1201
- #1197 - Allow to disable SSL/TLS verification in OpenSearchConnection by @rzo1 in #1199
- Fix #1202 - Add release documentation and comply with source package naming requirements by @rzo1 in #1203
- #1207 -- add forbidden-apis by @tballison in #1208
- #1209 fix for emulation error in tests run on silicon by @joshfischer1108 in #1210
- Resolves #1211 "Fix License Header" by @rzo1 in #1212
- #1205 update archetype in README by @joshfischer1108 in #1206
- Introduce "skip.format.code" to skip code formatting by default by @rzo1 in #1213
New Contributors
- @pjfanning made their first contribution in #1158
- @tballison made their first contribution in #1208
- @joshfischer1108 made their first contribution in #1210
Full Changelog: 2.11...stormcrawler-3.0
StormCrawler 2.11
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Upgrade to OpenSearch 2.11 #1113 by @jnioche in #1114
- Use mock server for selenium tests, fix #1116 by @jnioche in #1119
- Issue #728: Adding asterisk for metadata transfer by @michaeldinzinger in #1117
- WARCSpout loads inputs using HDFS by @jnioche in #1122
- Fix wrong most recent date was set by @chhsiao90 in #1126
- Glob field mapping for indexer.md.mapping by @jnioche in #1130
- Add committer statement by @michaeldinzinger in #1134
- Implement configurable getDocumentID in DeletionBolt by @chhsiao90 in #1135
- Add two tests for SiteMapParserBolt by @michaeldinzinger in #1138
- dependency upgrades by @jnioche in #1139
New Contributors
- @chhsiao90 made their first contribution in #1126
Full Changelog: 2.10...2.11
What's new in StormCrawler 2.10
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Selenium test by @jnioche in #1093
- refactoring timeouts Selenium by @jnioche in #1102
- Improvements and fixes to HttpRobotRulesParser when following redirects by @sebastian-nagel in #1103
and a lot more!
Full Changelog: 2.9...2.10
See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements
What's new in StormCrawler 2.9
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Change HttpProtocol to defer to configured values for retryOnConnectionFailure and followRedirects by @ndtreviv in #1056
- Cache redirected robots.txt for target host only if path is /robots.txt and query is empty by @sebastian-nagel in #1057
- Issue #1043: Fixing problems after restart of Frontier service by @michaeldinzinger in #1054
- #1049 Replace "Collapse and Expand Results" Solr query with "Result Grouping" query. by @syefimov in #1053
- OpenSearch 2.7.0 + renamed OpenSearchConnection by @jnioche in #1064
- BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 by @syefimov in #1062
- Dependency upgrades. fixes #1066 by @jnioche in #1067
- Automatic creation of index definitions should use the bolt type by @jnioche in #1069
- mechanism to retrieve more generic value of configuration by @jnioche in #1071
- Create DeletionBolt.java for Solr. #1050 by @syefimov in #1073
- Increase the number of redirects to 5 for Robots.txt fetching by @michaeldinzinger in #1074
- Issue #1042: Adapt parsing of robots.txt files by @michaeldinzinger in #1055
- Test URL Filtering from the command line by @jnioche in #1081
New Contributors
- @michaeldinzinger made their first contribution in #1054
- @syefimov made their first contribution in #1053
Full Changelog: 2.8...2.9
What's new in StormCrawler 2.8
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Enforce Java 11 in archetypes by @msghasan in #1029
- Fix #1027: Ensure SC can be build with Java 17 by @rzo1 in #1030
- Indexer ES document id by @Mikwiss in #1028
- JsoupFilter as Interface by @Mikwiss in #1026
- Create method to add SearchHit info to metadata by @Mikwiss in #1034
- Status ES document id by @Mikwiss in #1036
- Limit the amount of text to be returned by the text extraction, #1038 by @jnioche in #1039
- Allow override on HttpProtocol's method addHeadersToRequest by @Mikwiss in #1041
- Fixes #1045. Remove range syntax from snakeyaml by @rzo1 in #1046
- Fix #1032: Catch the exception inside the loop to avoid breaking if one remote instance is misbehaving by @rzo1 in #1047
New Contributors
Full Changelog: 2.7...2.8