pillow is not working properly #42

Redsandro · 2022-02-20T16:36:59Z

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer:

For comparison, here is -J kakadu:

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

Use sane defaults for pillow so quality is reasonable.
Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
Update documentation with Pillow compression flags.

The text was updated successfully, but these errors were encountered:

MerlijnWajer · 2022-04-03T07:44:09Z

What version did you try this with? I recently updated some of the compression parameters to be more in line with the kakadu ones. Could you retry with the latest version?

Pillow should be the same as openjpeg.

MerlijnWajer · 2022-05-02T18:12:35Z

I think Kakadu is doing a better job adopting to the input images, at least with my default parameters. It's just a standard reduction, whereas I think Kakadu might do something more clever. You could experiment with other values like the -q flags, instead of -r.

MerlijnWajer · 2022-05-02T18:19:45Z

If you use the build from issue #41 you could toy around some with it, but I tried again to use a single value for q and it ends up real ugly at the same filesize as kakadu. I agree the foreground layer could be better - but does it make a big difference in the mrc-combined final result?

Redsandro · 2022-05-02T20:22:38Z

Thanks for the tip. I will experiment more with the latest version when I have a moment and let you know, although the previous reduction I observed does not make me confident there will be any interesting results.

does it make a big difference in the mrc-combined final result?

You mean optically speaking, right?

I will get back to you.

MerlijnWajer · 2022-05-02T20:28:43Z

Right, I meant if the resulting PDF optically looks much worse. kakadu definitely seems to be better, but there's probably ways to make OpenJPEG better, I just haven't invested a lot of time in trying all the different knobs.

Redsandro · 2022-05-04T18:03:14Z

I tried using internetarchivepdf 1.4.13 and verified that -J pillow looks very bad by default. Not 'a bit worse', but extremely bad. The mask makes the text readable, but the colors are smudged. There is hardly any high frequency data at all.

If it was simply super compressed, there would be a use case somewhere for someone, but it has about the same compression ratio as kakadu so it makes you wonder: How does pillow waste so much space if it doesn't show any detail beyond low frequency smudges?

The initially reported problem still exists, so I cannot use -r or experiment with -q.

recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Do you get similar results or is it just me? I case of the former, if pillow can be tweaked to look half decent, I would suggest adding some pillow-specific defaults. If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

mara004 · 2022-05-04T18:30:47Z

If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

Can you please state which version of Pillow you are using?

python3 -m pip show pillow |grep Version

If recent versions of Pillow do not provide reasonable JP2 quality, perhaps someone should file an issue requesting that they improve their encoder?

MerlijnWajer · 2022-05-04T18:39:57Z

I think Pillow uses OpenJPEG so that might not help. I think we can get better quality with Pillow/OpenJPEG and Grok, but I just didn't invest the time in trying to find the right flags. Maybe see what happens with multi-layer encoding, as the help options also suggest?

Redsandro · 2022-05-04T19:44:11Z

Can you please state which version of Pillow you are using?

$ python3 -m pip show pillow | grep Version
Version: 8.3.2

I think Pillow uses OpenJPEG so that might not help.

Once I get #41 working I can do some comparisons.

I think we can get better quality with Pillow/OpenJPEG and Grok

I was interested in Grok because it sounds promising, but I couldn't get Grok to build or install on Ubuntu, so that's the one I haven't tried yet.

Maybe see what happens with multi-layer encoding, as the help options also suggest?

Could you show me where exactly I can read about this?

MerlijnWajer · 2022-05-04T19:49:13Z

I was thinking of this:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means
            quality layer 1: compress 20x,
            quality layer 2: compress 10x
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

MerlijnWajer · 2022-05-04T19:54:08Z

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this:

quality_mode:"rates";quality_layers:[500]

MerlijnWajer · 2022-05-04T20:36:12Z

You can see all the supported flags here: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg-2000

Redsandro · 2022-05-05T19:43:40Z

Thank you @MerlijnWajer this helps.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this: quality_mode:"rates";quality_layers:[500]

It works! I'm not getting the error. No spaces allowed. So to address the second part of the initial issue, perhaps you can catch ValueError for all implementation dependent compression flags, and output an error message something like this:

Invalid compression flags for {implementation}.

Turns out pillow is just really quite bad at lower quality settings but cleans up with some better quality. To me it becomes acceptable at around 220:

recode_pdf -v --dpi 300 -J pillow \
  --fg-compression-flags 'quality_layers:[220]' \
  -I in.png --hocr-file in.hocr -o out-pillow-r220.pdf

So to address the first part of the initial issue, you could set these as the default fg flags if the user doesn't set otherwise, so users won't think it's broken like I did. 😅

pillow default: 👎

pillow quality_layers:[220]: 👍

kakadu default: 👍

Redsandro · 2022-05-05T19:54:38Z

I wanted to do a simple PR for the help output, but I'm not sure how so I've added a 3rd checkbox to the initial issue in stead.

Right now recode_pdf tells us:

Default for kakadu is '-slope 44250',default for grok/openjpeg is '-r 500'. Pass with quoted and with a space at the start:' --flag value'

There is no space at the start in the examples, and with pillow the space causes the error. I think this text is outdated.

Suggestion:

Defaults are kakadu: '-slope 44250'; grok/openjpeg: '-r 500'; pillow: 'quality_layers:[220]'. Pass with quotes.

MerlijnWajer · 2022-05-05T21:34:36Z

Right, pillow flags aren't documented there and those should not start with a space. The thing with the space is that if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required - there's no easy way around that unfortunately.

Regarding the default pillow/openjpeg flags, could you compare the filesizes? My suspicion is that now the resulting PDFs will be quite a bit larger than the kakadu ones. I tried to have similar file sizes, rather than similar quality (which I agree might not have been the best idea).

Redsandro · 2022-05-06T01:29:05Z

if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

Regarding the default pillow/openjpeg flags, could you compare the filesizes?

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

MerlijnWajer · 2022-05-06T07:57:08Z

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

Ok, that is fair enough, I guess that's a sensible reasoning. Reminds me again that maybe having some "compression profiles" makes sense, so like:

standard: where kakadu/pillow/openjpeg look the same, but do not have the same file sizes
kakadu-roi-standard: as above, but kakadu only, with roi
aggressive: really agressive compression
quality: quality over compression (mostly)

And there could also be profiles for specific content, like:

books
comicbooks
scanned film material
etc

MerlijnWajer · 2022-05-06T08:00:16Z

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

https://stackoverflow.com/questions/16174992/cant-get-argparse-to-read-quoted-string-with-dashes-in-it

Rereading this thread, I think the better solution is to use --bg-compression-flags='--foo'

I will test this, update the documentation, and remove the space stripping hack.

MerlijnWajer · 2022-05-06T08:02:33Z

Yeah, that works:

recode_pdf --dpi 300 --bg-compression-flags='-q 25' --fg-compression-flags='-q 26' -J openjpeg -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-openjpeg.pdf

Redsandro · 2022-05-06T20:19:03Z

Reminds me again that maybe having some "compression profiles" makes sense

Absolutely. It saves users a lot of time. When the h264 encoder got presets (fast, slow etc) and content profiles (grainy film, cartoon etc) it became a lot more pleasurable to use.

Takes a lot of effort to setup though. So if the default makes sense, that's a good start. You may want to take time to hone presets and keep it undocumented until you are happy with the result.

MerlijnWajer · 2022-05-07T12:14:52Z

I have created an issue for this feature request #48

If I add the flags that you recommend, then I think we can close this bug, right?

Maybe we should ask for help on the openjpeg mailing list - they might have some tips/advice.

Redsandro · 2022-05-07T14:33:57Z

I have created an issue for this feature request #48

If I add the flags that you recommend, then I think we can close this bug, right?

Here are my recommended tasks from the original issue. Feel free to close this issue.

Suggested actionables

Use sane defaults for pillow so quality is reasonable. (Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu #48)

Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.

Update documentation with Pillow compression flags. (pillow is not working properly #42 (comment))

(with "documentation" I actually meant recode_pdf --help)

Redsandro · 2022-05-25T13:44:46Z

@MerlijnWajer is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

I think re-doing them with kakadu may remove half the filesize with minimal to no quality loss since the images are already so blurry.

MerlijnWajer · 2022-05-26T14:23:37Z

You could try to render them to a page (combining the MRC into a normal page), and then recompressing them. I don't have a tool to do this exact thing, but it should not be too hard with pymupdf. Maybe mutool can just render the final pages to images, and then you can try to recompress them.

Redsandro · 2022-05-26T15:48:02Z

Thank you for pointing me in the right direction. I think I should keep the mask as generated by recode-pdf , and just re-encode the pillow jp2 with kakadu externally. PyMuPDF can indeed do just that: PyMuPDF-Utilities/image-replacement. It's slightly more complicated than it sounds though. I need to read up on the xrefs to understand what the example is doing, or find a tool that automates the xref business so I can just take care of re-encoding all jp2 images, preferably including some way to distinguish between what recode-pdf intended to be foreground and background.

mara004 · 2022-05-26T16:24:00Z

is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

Redsandro · 2022-05-26T16:52:49Z

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

Absolutely, always, 100%.

But the original data is gone and so is the analog paper. It's just that the pillow images are so inconceivably bad compared to their file size, it's like 150kb per page. I think those jp2 can be re-compressed with kakadu at a third their size with hardly a quality loss because pillow turned it into blurry smudges.

Maybe the end result is 85 kb versus 150 kb but it adds up if you have scanned and destroyed a lot of material before realizing kakadu was not used by default when available.

Redsandro mentioned this issue May 6, 2022

openjpeg is not working properly #41

Closed

3 tasks

MerlijnWajer mentioned this issue May 7, 2022

Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pillow is not working properly #42

pillow is not working properly #42

Redsandro commented Feb 20, 2022 •

edited

Loading

MerlijnWajer commented Apr 3, 2022

MerlijnWajer commented May 2, 2022

MerlijnWajer commented May 2, 2022

Redsandro commented May 2, 2022

MerlijnWajer commented May 2, 2022

Redsandro commented May 4, 2022

mara004 commented May 4, 2022 •

edited

Loading

MerlijnWajer commented May 4, 2022

Redsandro commented May 4, 2022

MerlijnWajer commented May 4, 2022

MerlijnWajer commented May 4, 2022

MerlijnWajer commented May 4, 2022

Redsandro commented May 5, 2022 •

edited

Loading

Redsandro commented May 5, 2022 •

edited

Loading

MerlijnWajer commented May 5, 2022

Redsandro commented May 6, 2022

MerlijnWajer commented May 6, 2022

MerlijnWajer commented May 6, 2022

MerlijnWajer commented May 6, 2022

Redsandro commented May 6, 2022

MerlijnWajer commented May 7, 2022

Redsandro commented May 7, 2022

Suggested actionables

Redsandro commented May 25, 2022

MerlijnWajer commented May 26, 2022

Redsandro commented May 26, 2022 •

edited

Loading

mara004 commented May 26, 2022

Redsandro commented May 26, 2022

pillow is not working properly #42

pillow is not working properly #42

Comments

Redsandro commented Feb 20, 2022 • edited Loading

Additional info

Test scan to experiment with

Suggested actionables

MerlijnWajer commented Apr 3, 2022

MerlijnWajer commented May 2, 2022

MerlijnWajer commented May 2, 2022

Redsandro commented May 2, 2022

MerlijnWajer commented May 2, 2022

Redsandro commented May 4, 2022

mara004 commented May 4, 2022 • edited Loading

MerlijnWajer commented May 4, 2022

Redsandro commented May 4, 2022

MerlijnWajer commented May 4, 2022

MerlijnWajer commented May 4, 2022

MerlijnWajer commented May 4, 2022

Redsandro commented May 5, 2022 • edited Loading

Redsandro commented May 5, 2022 • edited Loading

MerlijnWajer commented May 5, 2022

Redsandro commented May 6, 2022

MerlijnWajer commented May 6, 2022

MerlijnWajer commented May 6, 2022

MerlijnWajer commented May 6, 2022

Redsandro commented May 6, 2022

MerlijnWajer commented May 7, 2022

Redsandro commented May 7, 2022

Suggested actionables

Redsandro commented May 25, 2022

MerlijnWajer commented May 26, 2022

Redsandro commented May 26, 2022 • edited Loading

mara004 commented May 26, 2022

Redsandro commented May 26, 2022

Redsandro commented Feb 20, 2022 •

edited

Loading

mara004 commented May 4, 2022 •

edited

Loading

Redsandro commented May 5, 2022 •

edited

Loading

Redsandro commented May 5, 2022 •

edited

Loading

Redsandro commented May 26, 2022 •

edited

Loading