Release 2024-09-09 Interim Release · internetarchive/heritrix3

Full Changelog | Javadoc | Maven Central

Compatibility Note

Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.

New Features

JDK 22 support
Added ConfigurableExtractorJS for more flexible JavaScript extraction. (#602)
Added HostnameQueueAssignmentPolicyWithLimits with optional name length limits. (#598)
ExtractorHTML can now extract more variants of alternative resolution image URLs. (#605)
- Attributes are now matched case-insensitively (previously src and SRC worked but not Src)
- New <img> attributes: data-full-src, data-lazy-srcset, data-src-small, data-src-medium
- New <link> attribute: imagesrcset
ExtractorHTTP can now be configured with extra inferred paths (#597)
ExtractorYoutubeDL metadata records can now be optionally logged to crawl.log (#593)

Removals

Removed ExtractorChrome from contrib (#601)

Fixes

Reduced false positive speculative URLs from meta tags (#595)
Fixed BdbModule resource leak on job teardown (f428001)
Corrected function name in ScriptedProcessor Javadoc. (#599)
Updated Maven builds to use HTTPS for resolving dependencies.
Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
Fixed older junit3 tests not being run (#592)
Increased DiskSpaceMonitor default pause threshold to 8 GiB to avoid BDB issue (#499)
Stopped logging authentication failures when auth header is missing (#539)
Fixed console still showing job running after crash (#549)

Dependency Upgrades

Transitioned PDFParser and ExtractorPDF to pdfbox (#575)
Transitioned ExtractorYoutubeDL to yt-dlp
commons-net 3.9.0
com.rabbitmq:amqp-client 5.18.0
dnsjava 3.6.0
groovy 4.0.21
kryo 5.6.0
spring-expression 5.3.39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-09-09 Interim Release

Compatibility Note

New Features

Removals

Fixes

Dependency Upgrades