Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pillow is not working properly #42

Open
1 of 3 tasks
Redsandro opened this issue Feb 20, 2022 · 27 comments
Open
1 of 3 tasks

pillow is not working properly #42

Redsandro opened this issue Feb 20, 2022 · 27 comments

Comments

@Redsandro
Copy link
Contributor

Redsandro commented Feb 20, 2022

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer:
pillow

For comparison, here is -J kakadu:
kakadu

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

  • Use sane defaults for pillow so quality is reasonable.
  • Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
  • Update documentation with Pillow compression flags.
@MerlijnWajer
Copy link
Collaborator

What version did you try this with? I recently updated some of the compression parameters to be more in line with the kakadu ones. Could you retry with the latest version?

Pillow should be the same as openjpeg.

@MerlijnWajer
Copy link
Collaborator

I think Kakadu is doing a better job adopting to the input images, at least with my default parameters. It's just a standard reduction, whereas I think Kakadu might do something more clever. You could experiment with other values like the -q flags, instead of -r.

@MerlijnWajer
Copy link
Collaborator

If you use the build from issue #41 you could toy around some with it, but I tried again to use a single value for q and it ends up real ugly at the same filesize as kakadu. I agree the foreground layer could be better - but does it make a big difference in the mrc-combined final result?

@Redsandro
Copy link
Contributor Author

Thanks for the tip. I will experiment more with the latest version when I have a moment and let you know, although the previous reduction I observed does not make me confident there will be any interesting results.

does it make a big difference in the mrc-combined final result?

You mean optically speaking, right?

I will get back to you.

@MerlijnWajer
Copy link
Collaborator

Right, I meant if the resulting PDF optically looks much worse. kakadu definitely seems to be better, but there's probably ways to make OpenJPEG better, I just haven't invested a lot of time in trying all the different knobs.

@Redsandro
Copy link
Contributor Author

I tried using internetarchivepdf 1.4.13 and verified that -J pillow looks very bad by default. Not 'a bit worse', but extremely bad. The mask makes the text readable, but the colors are smudged. There is hardly any high frequency data at all.

If it was simply super compressed, there would be a use case somewhere for someone, but it has about the same compression ratio as kakadu so it makes you wonder: How does pillow waste so much space if it doesn't show any detail beyond low frequency smudges?

The initially reported problem still exists, so I cannot use -r or experiment with -q.

recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Do you get similar results or is it just me? I case of the former, if pillow can be tweaked to look half decent, I would suggest adding some pillow-specific defaults. If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

@mara004
Copy link
Contributor

mara004 commented May 4, 2022

If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

Can you please state which version of Pillow you are using?

python3 -m pip show pillow |grep Version

If recent versions of Pillow do not provide reasonable JP2 quality, perhaps someone should file an issue requesting that they improve their encoder?

@MerlijnWajer
Copy link
Collaborator

I think Pillow uses OpenJPEG so that might not help. I think we can get better quality with Pillow/OpenJPEG and Grok, but I just didn't invest the time in trying to find the right flags. Maybe see what happens with multi-layer encoding, as the help options also suggest?

@Redsandro
Copy link
Contributor Author

Can you please state which version of Pillow you are using?

$ python3 -m pip show pillow | grep Version
Version: 8.3.2

I think Pillow uses OpenJPEG so that might not help.

Once I get #41 working I can do some comparisons.

I think we can get better quality with Pillow/OpenJPEG and Grok

I was interested in Grok because it sounds promising, but I couldn't get Grok to build or install on Ubuntu, so that's the one I haven't tried yet.

Maybe see what happens with multi-layer encoding, as the help options also suggest?

Could you show me where exactly I can read about this?

@MerlijnWajer
Copy link
Collaborator

I was thinking of this:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means
            quality layer 1: compress 20x,
            quality layer 2: compress 10x
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

@MerlijnWajer
Copy link
Collaborator

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this:

quality_mode:"rates";quality_layers:[500]

@MerlijnWajer
Copy link
Collaborator

You can see all the supported flags here: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg-2000

@Redsandro
Copy link
Contributor Author

Redsandro commented May 5, 2022

Thank you @MerlijnWajer this helps.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this: quality_mode:"rates";quality_layers:[500]

It works! I'm not getting the error. No spaces allowed. So to address the second part of the initial issue, perhaps you can catch ValueError for all implementation dependent compression flags, and output an error message something like this:

Invalid compression flags for {implementation}.

Turns out pillow is just really quite bad at lower quality settings but cleans up with some better quality. To me it becomes acceptable at around 220:

recode_pdf -v --dpi 300 -J pillow \
  --fg-compression-flags 'quality_layers:[220]' \
  -I in.png --hocr-file in.hocr -o out-pillow-r220.pdf

So to address the first part of the initial issue, you could set these as the default fg flags if the user doesn't set otherwise, so users won't think it's broken like I did. 😅

pillow default: 👎
image

pillow quality_layers:[220]: 👍
image

kakadu default: 👍
image

@Redsandro
Copy link
Contributor Author

Redsandro commented May 5, 2022

I wanted to do a simple PR for the help output, but I'm not sure how so I've added a 3rd checkbox to the initial issue in stead.

Right now recode_pdf tells us:

Default for kakadu is '-slope 44250',default for grok/openjpeg is '-r 500'. Pass with quoted and with a space at the start:' --flag value'

There is no space at the start in the examples, and with pillow the space causes the error. I think this text is outdated.

Suggestion:

Defaults are kakadu: '-slope 44250'; grok/openjpeg: '-r 500'; pillow: 'quality_layers:[220]'. Pass with quotes.

@MerlijnWajer
Copy link
Collaborator

Right, pillow flags aren't documented there and those should not start with a space. The thing with the space is that if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required - there's no easy way around that unfortunately.

Regarding the default pillow/openjpeg flags, could you compare the filesizes? My suspicion is that now the resulting PDFs will be quite a bit larger than the kakadu ones. I tried to have similar file sizes, rather than similar quality (which I agree might not have been the best idea).

@Redsandro
Copy link
Contributor Author

if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

Regarding the default pillow/openjpeg flags, could you compare the filesizes?

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

@MerlijnWajer
Copy link
Collaborator

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

Ok, that is fair enough, I guess that's a sensible reasoning. Reminds me again that maybe having some "compression profiles" makes sense, so like:

  • standard: where kakadu/pillow/openjpeg look the same, but do not have the same file sizes
  • kakadu-roi-standard: as above, but kakadu only, with roi
  • aggressive: really agressive compression
  • quality: quality over compression (mostly)

And there could also be profiles for specific content, like:

  • books
  • comicbooks
  • scanned film material
  • etc

@MerlijnWajer
Copy link
Collaborator

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

https://stackoverflow.com/questions/16174992/cant-get-argparse-to-read-quoted-string-with-dashes-in-it

Rereading this thread, I think the better solution is to use --bg-compression-flags='--foo'

I will test this, update the documentation, and remove the space stripping hack.

@MerlijnWajer
Copy link
Collaborator

Yeah, that works:

recode_pdf --dpi 300 --bg-compression-flags='-q 25' --fg-compression-flags='-q 26' -J openjpeg -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-openjpeg.pdf

@Redsandro
Copy link
Contributor Author

Reminds me again that maybe having some "compression profiles" makes sense

Absolutely. It saves users a lot of time. When the h264 encoder got presets (fast, slow etc) and content profiles (grainy film, cartoon etc) it became a lot more pleasurable to use.

Takes a lot of effort to setup though. So if the default makes sense, that's a good start. You may want to take time to hone presets and keep it undocumented until you are happy with the result.

@MerlijnWajer
Copy link
Collaborator

I have created an issue for this feature request #48

If I add the flags that you recommend, then I think we can close this bug, right?

Maybe we should ask for help on the openjpeg mailing list - they might have some tips/advice.

@Redsandro
Copy link
Contributor Author

I have created an issue for this feature request #48

If I add the flags that you recommend, then I think we can close this bug, right?

Here are my recommended tasks from the original issue. Feel free to close this issue.

Suggested actionables

(with "documentation" I actually meant recode_pdf --help)

@Redsandro
Copy link
Contributor Author

@MerlijnWajer is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

I think re-doing them with kakadu may remove half the filesize with minimal to no quality loss since the images are already so blurry.

@MerlijnWajer
Copy link
Collaborator

You could try to render them to a page (combining the MRC into a normal page), and then recompressing them. I don't have a tool to do this exact thing, but it should not be too hard with pymupdf. Maybe mutool can just render the final pages to images, and then you can try to recompress them.

@Redsandro
Copy link
Contributor Author

Redsandro commented May 26, 2022

Thank you for pointing me in the right direction. I think I should keep the mask as generated by recode-pdf , and just re-encode the pillow jp2 with kakadu externally. PyMuPDF can indeed do just that: PyMuPDF-Utilities/image-replacement. It's slightly more complicated than it sounds though. I need to read up on the xrefs to understand what the example is doing, or find a tool that automates the xref business so I can just take care of re-encoding all jp2 images, preferably including some way to distinguish between what recode-pdf intended to be foreground and background.

@mara004
Copy link
Contributor

mara004 commented May 26, 2022

is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

@Redsandro
Copy link
Contributor Author

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

Absolutely, always, 100%.

But the original data is gone and so is the analog paper. It's just that the pillow images are so inconceivably bad compared to their file size, it's like 150kb per page. I think those jp2 can be re-compressed with kakadu at a third their size with hardly a quality loss because pillow turned it into blurry smudges.

Maybe the end result is 85 kb versus 150 kb but it adds up if you have scanned and destroyed a lot of material before realizing kakadu was not used by default when available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants