Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.16.16
0.16.16
Enhancements
Features
- Vectorize layout (inferred, extracted, and OCR) data structure Using
np.ndarray
to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.
Fixes
- Add auto-download for NLTK for Python Enviroment When user import tokenize, It will automatic download nltk data from
tokenize.py
file. AddedAUTO_DOWNLOAD_NLTK
flag intokenize.py
to downloadNLTK_DATA
. - Correctly patch pdfminer to avoid PDF repair. The patch applied to pdfminer's parser caused it to occasionally split tokens in content streams, throwing
PDFSyntaxError
. Repairing these PDFs sometimes failed (since they were not actually invalid) resulting in unnecessary OCR fallback.
- Drop usage of ndjson dependency
0.16.15
0.16.14
Enhancements
Features
Fixes
- Fix an issue with multiple values for
infer_table_structure
when paritioning email with image attachements the kwarg calls intopartition
to partition the image already containsinfer_table_structure
. Nowpartition
function checks if thekwarg
hasinfer_table_structure
already
0.16.13
Enhancements
- Add character-level filtering for tesseract output. It is controllable via
TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD
environment variable.
Features
Fixes
- Fix NLTK Download to use nltk assets in docker image
- removed the ability to automatically download nltk package if missing
0.16.12
0.16.12
Enhancements
- Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.
- Add NDJSON file type support.
Features
Fixes
- Base image has been updated.
- Upgrade ruff to latest. Previously the ruff version was pinned to <0.5. Remove that pin and fix the handful of lint items that resulted.
- CSV with asserted XLS content-type is correctly identified as CSV. Resolves a bug where a CSV file with an asserted content-type of
application/vnd.ms-excel
was incorrectly identified as an XLS file. - Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive
Title
elements. - Improve element-type mapping for HTML. Fixes bug where certain non-title elements were classified as
Title
.
0.16.11
Enhancements
- Enhance quote standardization tests with additional Unicode scenarios
- Relax table segregation rule in chunking. Previously a
Table
element was always segregated into its own pre-chunk such that theTable
appeared alone in a chunk or was split into multipleTableChunk
elements, but never combined withText
-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows. - Compute chunk length based solely on
element.text
. Previously.metadata.text_as_html
was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
Features
Fixes
- Fix ipv4 regex to correctly include up to three digit octets.
0.16.10
0.16.9
What's Changed
- chore: fix CHANGELOG formatting by @cragwolfe in #3800
- Use native ntlk download by @vangheem in #3796
Full Changelog: 0.16.8...0.16.9
0.16.8
0.16.7
0.16.7
Enhancements
- Add image_alt_mode to partition_html Adds an
image_alt_mode
parameter topartition_html()
to control how alt text is extracted from images in HTML documents forhtml_parser_version=v2
. The parameter can be set toto_text
to extract alt text as text from<img>
html tags