Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add incremental capability to PdfWriter #117

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4bd54bd
DEV: Test against Python 3.13 (#2776)
stefan6419846 Jul 28, 2024
d4df20d
STY: Remove boolean value comparison (#2779)
j-t-1 Jul 31, 2024
3ad9234
ROB: Handle images with empty data when processing an image from byte…
williamgagnonpoka Aug 2, 2024
582557e
SEC: Fix GitHub workflow vulnerable to script injection (#2787)
diogoteles08 Aug 2, 2024
38f3925
MAINT: Remove unused paeth_predictor (#2773)
j-t-1 Aug 5, 2024
09f9b7e
MAINT: Remove unused AnnotationFlag
j-t-1 Aug 5, 2024
b2d7204
BUG: Handle Sequence as an IndirectObject when extracting text with l…
owurman Aug 5, 2024
5abd590
STY: Refactor b_ (#2772)
j-t-1 Aug 7, 2024
219eb13
MAINT: Drop Python 3.7 support (#2793)
pubpub-zz Aug 12, 2024
46c89dd
MAINT: Remove b_ and str_ (#2792)
pubpub-zz Aug 12, 2024
a9758ae
MAINT: Improve test coverage (#2796)
pubpub-zz Aug 12, 2024
cf7fcfd
ENH: Compress PDF files merging identical objects (#2795)
pubpub-zz Aug 13, 2024
2eb565d
ROB: Fix extract_text() issues on damaged PDFs (#2760)
pubpub-zz Aug 13, 2024
d9a8c54
ENH: Report PdfReadError instead of RecursionError (#2800)
pubpub-zz Aug 14, 2024
799630d
BUG: Fix sheared image (#2801)
pubpub-zz Aug 15, 2024
454a62a
MAINT: Fix mypy type output (#2799)
pubpub-zz Aug 15, 2024
0c81f3c
ENH: Accept utf strings for metadata (#2802)
pubpub-zz Aug 16, 2024
d2d520b
MAINT: Remove unused code (#2805)
pubpub-zz Aug 22, 2024
9f08cd0
ROB: Raise PdfReadError when missing /Root in trailer (#2808)
BertrandBordage Aug 23, 2024
b7b3c8c
MAINT: Improve wording of set_data error message (#2810)
stefan6419846 Aug 23, 2024
fba73a4
ENH: add incremental capability to PdfWriter
pubpub-zz Aug 23, 2024
0543709
fix test
pubpub-zz Aug 24, 2024
29030d4
fixes + first test
pubpub-zz Aug 25, 2024
1067b74
coverage
pubpub-zz Aug 25, 2024
f1d3fbe
coverage
pubpub-zz Aug 25, 2024
ae97bc7
cope with multiple level pages
pubpub-zz Aug 26, 2024
d9a99d9
test + doc
pubpub-zz Aug 26, 2024
3c4cfdc
coverage
pubpub-zz Aug 26, 2024
38d4b35
coverage
pubpub-zz Aug 26, 2024
79eca73
coverage
pubpub-zz Aug 26, 2024
290c5a6
coverage
pubpub-zz Aug 26, 2024
173578d
coverage
pubpub-zz Aug 26, 2024
b2b0c9e
Merge branch 'main' into incremental
pubpub-zz Aug 26, 2024
1a6eda5
simplification
pubpub-zz Aug 26, 2024
d43d25b
coverage
pubpub-zz Aug 27, 2024
f55d332
ENH: Robustify on missing font for Tf operator in text_extract() (#2…
pubpub-zz Aug 27, 2024
7e2e74d
Merge branch 'main' into incremental
pubpub-zz Aug 27, 2024
38ea8c5
ENH: Add UniGB-UTF16 encodings (#2819)
pubpub-zz Aug 28, 2024
708e449
Merge branch 'main' into incremental
pubpub-zz Aug 28, 2024
82eac7e
ROB: Robustify .set_data() (#2821)
pubpub-zz Aug 29, 2024
ff76e02
Merge remote-tracking branch 'py-pdf/main' into incremental
pubpub-zz Sep 1, 2024
14a93f1
move to X-reference stream for increment
pubpub-zz Sep 1, 2024
53e141f
coverage
pubpub-zz Sep 1, 2024
b4b7c1b
coverage
pubpub-zz Sep 1, 2024
7bc3abd
coverage
pubpub-zz Sep 1, 2024
ffa2f0c
fix
pubpub-zz Sep 1, 2024
b072952
mypy
pubpub-zz Sep 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .github/workflows/github-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,12 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13-dev"]
use-crypto-lib: ["cryptography"]
include:
- python-version: "3.7"
- python-version: "3.8"
use-crypto-lib: "pycryptodome"
- python-version: "3.7"
- python-version: "3.8"
use-crypto-lib: "none"
steps:
- name: Update APT packages
Expand All @@ -83,14 +83,14 @@ jobs:
key: cache-downloaded-files
- name: Setup Python
uses: actions/setup-python@v5
if: matrix.python-version == '3.7' || matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
if: matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
cache-dependency-path: '**/requirements/ci.txt'
- name: Setup Python (3.11+)
uses: actions/setup-python@v5
if: matrix.python-version == '3.11' || matrix.python-version == '3.12'
if: matrix.python-version == '3.11' || matrix.python-version == '3.12' || matrix.python-version == '3.13-dev'
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true
Expand All @@ -102,11 +102,11 @@ jobs:
- name: Install requirements (Python 3)
run: |
pip install -r requirements/ci.txt
if: matrix.python-version == '3.7' || matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
if: matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
- name: Install requirements (Python 3.11+)
run: |
pip install -r requirements/ci-3.11.txt
if: matrix.python-version == '3.11' || matrix.python-version == '3.12'
if: matrix.python-version == '3.11' || matrix.python-version == '3.12' || matrix.python-version == '3.13-dev'
- name: Remove pycryptodome and cryptography
run: |
pip uninstall pycryptodome cryptography -y
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ on:
permissions:
contents: write

env:
HEAD_COMMIT_MESSAGE: ${{ github.event.head_commit.message }}

jobs:
build_and_publish:
name: Publish a new version
Expand All @@ -24,15 +27,15 @@ jobs:
- name: Extract version from commit message
id: extract_version
run: |
VERSION=$(echo "${{ github.event.head_commit.message }}" | grep -oP '(?<=REL: )\d+\.\d+\.\d+')
VERSION=$(echo "$HEAD_COMMIT_MESSAGE" | grep -oP '(?<=REL: )\d+\.\d+\.\d+')
echo "version=$VERSION" >> $GITHUB_OUTPUT

- name: Extract tag message from commit message
id: extract_message
run: |
VERSION="${{ steps.extract_version.outputs.version }}"
delimiter="$(openssl rand -hex 8)"
MESSAGE=$(echo "${{ github.event.head_commit.message }}" | sed "0,/REL: $VERSION/s///" )
MESSAGE=$(echo "$HEAD_COMMIT_MESSAGE" | sed "0,/REL: $VERSION/s///" )
echo "message<<${delimiter}" >> $GITHUB_OUTPUT
echo "$MESSAGE" >> $GITHUB_OUTPUT
echo "${delimiter}" >> $GITHUB_OUTPUT
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ history and [GitHub's 'Contributors' feature](https://github.com/py-pdf/pypdf/gr
* [ediamondscience](https://github.com/ediamondscience)
* [Ermeson, Felipe](https://github.com/FelipeErmeson)
* [Freitag, François](https://github.com/francoisfreitag)
* [Gagnon, William G.](https://github.com/williamgagnon)
* [Górny, Michał](https://github.com/mgorny)
* [Grillo, Miguel](https://github.com/Ineffable22)
* [Gutteridge, David H.](https://github.com/dhgutteridge)
Expand Down
8 changes: 3 additions & 5 deletions docs/modules/PageObject.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,12 @@ The PageObject Class
:undoc-members:
:show-inheritance:

.. autoclass:: pypdf._utils.ImageFile
.. autoclass:: pypdf._page.VirtualListImages
:members:
:undoc-members:
:show-inheritance:
:exclude-members: IndirectObject

.. autoclass:: pypdf._utils.File
.. autoclass:: pypdf._page.ImageFile
:members:
:inherited-members: File
:undoc-members:
:show-inheritance:
:exclude-members: IndirectObject
20 changes: 7 additions & 13 deletions docs/user/file-size.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,17 @@ Some PDF documents contain the same object multiple times. For example, if an
image appears three times in a PDF it could be embedded three times. Or it can
be embedded once and referenced twice.

This can be done by reading and writing the file:
When adding data to a PdfWriter, the data is copied while respecting the original format.
For example, if two pages include the same image which is duplicated in the source document, the object will be duplicated in the PdfWriter object.

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("big-old-file.pdf")
writer = PdfWriter()
Additionally, when you delete objects in a document, pypdf cannot easily identify whether the objects are used elsewhere or not or if the user wants to keep them in. When writing the PDF file, these objects will be hidden within (part of the file, but not displayed).

for page in reader.pages:
writer.add_page(page)
In order to reduce the file size, use a compression call: `writer.compress_identical_objects(remove_identicals=True, remove_orphans=True)`

if reader.metadata is not None:
writer.add_metadata(reader.metadata)
* `remove_identicals` enables/disables compression merging identical objects.
* `remove_orphans` enables/disables suppression of unused objects.

with open("smaller-new-file.pdf", "wb") as fp:
writer.write(fp)
```
It is recommended to apply this process just before writing to the file/stream.

It depends on the PDF how well this works, but we have seen an 86% file
reduction (from 5.7 MB to 0.8 MB) within a real PDF.
Expand Down
34 changes: 14 additions & 20 deletions pypdf/_cmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@
from typing import Any, Dict, List, Tuple, Union, cast

from ._codecs import adobe_glyphs, charset_encoding
from ._utils import b_, logger_error, logger_warning
from ._utils import logger_error, logger_warning
from .generic import (
DecodedStreamObject,
DictionaryObject,
IndirectObject,
NullObject,
StreamObject,
)
Expand Down Expand Up @@ -127,6 +126,8 @@ def build_char_map_from_dict(
"/ETenms-B5-V": "cp950",
"/UniCNS-UTF16-H": "utf-16-be",
"/UniCNS-UTF16-V": "utf-16-be",
"/UniGB-UTF16-H": "gb18030",
"/UniGB-UTF16-V": "gb18030",
# UCS2 in code
}

Expand Down Expand Up @@ -258,8 +259,8 @@ def prepare_cm(ft: DictionaryObject) -> bytes:
tu = ft["/ToUnicode"]
cm: bytes
if isinstance(tu, StreamObject):
cm = b_(cast(DecodedStreamObject, ft["/ToUnicode"]).get_data())
elif isinstance(tu, str) and tu.startswith("/Identity"):
cm = cast(DecodedStreamObject, ft["/ToUnicode"]).get_data()
else: # if (tu is None) or cast(str, tu).startswith("/Identity"):
# the full range 0000-FFFF will be processed
cm = b"beginbfrange\n<0000> <0001> <0000>\nendbfrange"
if isinstance(cm, str):
Expand Down Expand Up @@ -448,34 +449,27 @@ def compute_space_width(
en: int = cast(int, ft["/LastChar"])
if st > space_code or en < space_code:
raise Exception("Not in range")
if w[space_code - st] == 0:
if w[space_code - st].get_object() == 0:
raise Exception("null width")
sp_width = w[space_code - st]
sp_width = w[space_code - st].get_object()
except Exception:
if "/FontDescriptor" in ft and "/MissingWidth" in cast(
DictionaryObject, ft["/FontDescriptor"]
):
sp_width = ft["/FontDescriptor"]["/MissingWidth"] # type: ignore
sp_width = ft["/FontDescriptor"]["/MissingWidth"].get_object() # type: ignore
else:
# will consider width of char as avg(width)/2
m = 0
cpt = 0
for x in w:
if x > 0:
m += x
for xx in w:
xx = xx.get_object()
if xx > 0:
m += xx
cpt += 1
sp_width = m / max(1, cpt) / 2

if isinstance(sp_width, IndirectObject):
# According to
# 'Table 122 - Entries common to all font descriptors (continued)'
# the MissingWidth should be a number, but according to #2286 it can
# be an indirect object
obj = sp_width.get_object()
if obj is None or isinstance(obj, NullObject):
return 0.0
return obj # type: ignore

if sp_width is None or isinstance(sp_width, NullObject):
sp_width = 0.0
return sp_width


Expand Down
71 changes: 61 additions & 10 deletions pypdf/_doc_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@
from ._page import PageObject, _VirtualList
from ._page_labels import index2label as page_index2page_label
from ._utils import (
b_,
deprecate_with_replacement,
logger_warning,
parse_iso8824_date,
Expand All @@ -66,9 +65,7 @@
from .constants import FieldDictionaryAttributes as FA
from .constants import PageAttributes as PG
from .constants import PagesAttributes as PA
from .errors import (
PdfReadError,
)
from .errors import PdfReadError, PyPdfError
from .generic import (
ArrayObject,
BooleanObject,
Expand Down Expand Up @@ -255,6 +252,8 @@ class PdfDocCommon:

_encryption: Optional[Encryption] = None

_readonly: bool = False

@property
@abstractmethod
def root_object(self) -> DictionaryObject:
Expand Down Expand Up @@ -350,7 +349,7 @@ def get_num_pages(self) -> int:
return self.root_object["/Pages"]["/Count"] # type: ignore
else:
if self.flattened_pages is None:
self._flatten()
self._flatten(self._readonly)
assert self.flattened_pages is not None
return len(self.flattened_pages)

Expand All @@ -367,10 +366,47 @@ def get_page(self, page_number: int) -> PageObject:
A :class:`PageObject<pypdf._page.PageObject>` instance.
"""
if self.flattened_pages is None:
self._flatten()
self._flatten(self._readonly)
assert self.flattened_pages is not None, "hint for mypy"
return self.flattened_pages[page_number]

def _get_page_in_node(
self,
page_number: int,
) -> Tuple[DictionaryObject, int]:
"""
Retrieve the node and position within the /Kids containing the page
if page_number is greater than the number of page, it returns top node, -1
"""
top = cast(DictionaryObject, self.root_object["/Pages"])

def recurs(node: DictionaryObject, mi: int) -> Tuple[Optional[PdfObject], int]:
ma = cast(int, node.get("/Count", 1)) # default 1 for /Page types
if node["/Type"] == "/Page":
if page_number == mi:
return node, -1
# else:
return None, mi + 1
if (page_number - mi) >= ma: # not in nodes below
if node == top:
return top, -1
# else
return None, mi + ma
for idx, kid in enumerate(cast(ArrayObject, node["/Kids"])):
kid = cast(DictionaryObject, kid.get_object())
n, i = recurs(kid, mi)
if n is not None: # page has just been found ...
if i < 0: # ... just below!
return node, idx
# else: # ... at lower levels
return n, i
mi = i
raise PyPdfError("abnormal, can not find the node")

node, idx = recurs(top, 0)
assert isinstance(node, DictionaryObject)
return node, idx

@property
def named_destinations(self) -> Dict[str, Any]:
"""
Expand Down Expand Up @@ -1083,10 +1119,19 @@ def page_mode(self) -> Optional[PagemodeType]:

def _flatten(
self,
list_only: bool = False,
pages: Union[None, DictionaryObject, PageObject] = None,
inherit: Optional[Dict[str, Any]] = None,
indirect_reference: Optional[IndirectObject] = None,
) -> None:
"""
prepare the document pages to ease searching
args:
list_only: will only list the pages witin _flatten_pages
pages,
inherit,
indirect_reference: used recursively to flatten the /Pages object
"""
inheritable_page_attributes = (
NameObject(PG.RESOURCES),
NameObject(PG.MEDIABOX),
Expand Down Expand Up @@ -1122,15 +1167,21 @@ def _flatten(
obj = page.get_object()
if obj:
# damaged file may have invalid child in /Pages
self._flatten(obj, inherit, **addt)
try:
self._flatten(list_only, obj, inherit, **addt)
except RecursionError:
raise PdfReadError(
"Maximum recursion depth reached during page flattening."
)
elif t == "/Page":
for attr_in, value in list(inherit.items()):
# if the page has it's own value, it does not inherit the
# parent's value:
if attr_in not in pages:
pages[attr_in] = value
page_obj = PageObject(self, indirect_reference)
page_obj.update(pages)
if not list_only:
page_obj.update(pages)

# TODO: Could flattened_pages be None at this point?
self.flattened_pages.append(page_obj) # type: ignore
Expand All @@ -1154,7 +1205,7 @@ def remove_page(
or destinations to reference a detached page.
"""
if self.flattened_pages is None:
self._flatten()
self._flatten(self._readonly)
assert self.flattened_pages is not None
if isinstance(page, IndirectObject):
p = page.get_object()
Expand Down Expand Up @@ -1258,7 +1309,7 @@ def xfa(self) -> Optional[Dict[str, Any]]:
if isinstance(f, IndirectObject):
field = cast(Optional[EncodedStreamObject], f.get_object())
if field:
es = zlib.decompress(b_(field._data))
es = zlib.decompress(field._data)
retval[tag] = es
return retval

Expand Down
6 changes: 3 additions & 3 deletions pypdf/_encryption.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
rc4_encrypt,
)

from ._utils import b_, logger_warning
from ._utils import logger_warning
from .generic import (
ArrayObject,
ByteStringObject,
Expand Down Expand Up @@ -78,7 +78,7 @@ def encrypt_object(self, obj: PdfObject) -> PdfObject:
elif isinstance(obj, StreamObject):
obj2 = StreamObject()
obj2.update(obj)
obj2.set_data(self.stm_crypt.encrypt(b_(obj._data)))
obj2.set_data(self.stm_crypt.encrypt(obj._data))
for key, value in obj.items(): # Dont forget the Stream dict.
obj2[key] = self.encrypt_object(value)
obj = obj2
Expand All @@ -96,7 +96,7 @@ def decrypt_object(self, obj: PdfObject) -> PdfObject:
data = self.str_crypt.decrypt(obj.original_bytes)
obj = create_string_object(data)
elif isinstance(obj, StreamObject):
obj._data = self.stm_crypt.decrypt(b_(obj._data))
obj._data = self.stm_crypt.decrypt(obj._data)
for key, value in obj.items(): # Dont forget the Stream dict.
obj[key] = self.decrypt_object(value)
elif isinstance(obj, DictionaryObject):
Expand Down
Loading