2024-09-09 Interim Release
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Compatibility Note
Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.
New Features
- JDK 22 support
- Added
ConfigurableExtractorJS
for more flexible JavaScript extraction. (#602) - Added
HostnameQueueAssignmentPolicyWithLimits
with optional name length limits. (#598) ExtractorHTML
can now extract more variants of alternative resolution image URLs. (#605)- Attributes are now matched case-insensitively (previously
src
andSRC
worked but notSrc
) - New
<img>
attributes:data-full-src
,data-lazy-srcset
,data-src-small
,data-src-medium
- New
<link>
attribute:imagesrcset
- Attributes are now matched case-insensitively (previously
ExtractorHTTP
can now be configured with extra inferred paths (#597)ExtractorYoutubeDL
metadata records can now be optionally logged to crawl.log (#593)
Removals
- Removed
ExtractorChrome
from contrib (#601)
Fixes
- Reduced false positive speculative URLs from meta tags (#595)
- Fixed BdbModule resource leak on job teardown (f428001)
- Corrected function name in
ScriptedProcessor
Javadoc. (#599) - Updated Maven builds to use HTTPS for resolving dependencies.
- Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
- Fixed older junit3 tests not being run (#592)
- Increased DiskSpaceMonitor default pause threshold to 8 GiB to avoid BDB issue (#499)
- Stopped logging authentication failures when auth header is missing (#539)
- Fixed console still showing job running after crash (#549)
Dependency Upgrades
- Transitioned
PDFParser
andExtractorPDF
to pdfbox (#575) - Transitioned
ExtractorYoutubeDL
to yt-dlp - commons-net 3.9.0
- com.rabbitmq:amqp-client 5.18.0
- dnsjava 3.6.0
- groovy 4.0.21
- kryo 5.6.0
- spring-expression 5.3.39