-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfcomp: new tool, discussion, compression questions #51
Comments
Moving from ocrmypdf/OCRmyPDF#541:
I think the I have started a new github action build for the version that uses kakadu here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2484980161 |
Unfortunately no change whatsoever in compression factor:
however
I'll raw copy the latest version. |
An improvement with the new raw copied version of compress-pdf-images
|
DjVu uses about 25 dpi for the foreground-picture: https://www.cs.tufts.edu/~nr/cs257/archive/leon-bottou/jei-1998.ps.gz |
I've run the bankstatement in the fully open source didjvu, It does not have the faint text nor the strange artifact shown due to bad foreground/background choices. robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy -o bankstatementdi.djvu
The resulting numbers:
All BG44 slices together form the background picture. The foreground picture default has half the resolution of the background. When I diminish the foreground further to only 1/4 of the background I see no visual differences in the result: robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy --fg-subsample 12 -o bankstatementdifg12.djvu
FORM:DJVU [45815] These numbers translated back to jbig2 and jpeg2000 (via djvutoy, but probably straighforward to rewrite):
For some reason the 13k foreground-picture in the resulting PDF is still quite big (64k) for such a small picture. There must be room for improvement. |
Yeah, something seems wrong there. Can you share the changes you made to get to this point? Alternatively, you can just use For my understanding, what is the your end goal / use case specifically? To get DjVu like compression in PDFs? |
I only added fg_downsample=12 inside compress-pdf-images close to fg_downsample=3
I think the most convenient goal for me would be to be able to scan in paperwork that is handed over to me or send to me by mail, to be able to distribute it again and store it electronically without clogging a lot of mailboxes.
Usually it concerns letters with some message and some logo and sometimes even an autograph.
DjVu shows how small you can go with what quality, so it shows room for improvement as far as PDF permits.
There are cases where some AI probably would perform better, but DjVu has had quite a lot of development and finetuning in the past, so could be an example as far as patents and copyright permits.
|
I just tried some bg_slope values, and 43000 results in this:
Where the bg image has a proportional size to the other 2 images and the logo on the left improves a lot. |
Right, there's room to toy with some of this. It depends on how well the binarisation algorithm works, if it knows about the DPI, and in general the specific content being compressed. The default settings for all archive.org books are:
This should correspond mostly to the current defaults of the tool. In some cases where we want might higher accuracy at the cost of compression, we use:
|
I saw the didjvu generate layers, that's definitely interesting for certain content. The hard part (for me) is automatically knowing what content I am dealing with -- which is why the parameters are generalised for all cases, which means they're not optimal for any particular use case, but don't need tweaking. |
Hi, I'm interested in the My interest is sparked by realizing that But, so, it would be nice to take PDFs with HOCR rendered/positioned by tesseract (perhaps using tesseract's As a side note, I think in some discussion of extracting the MRC functionality, perhaps over in the OCRMyPDF repo, there was some consideration of supporting MRC using alternate compression algorithm so |
Hi @jrochkind - I'll try to get back to you with some instructions on how to try it out. A few brief answers right now:
|
Jbig2enc has to be compiled manually against the right version of Libleptonica as there is no packaged version. As far as I can see jbig2enc is updated until Libleptonica 1.83 on Januari 9th:
agl/jbig2enc@ea05019
|
Thanks for the quick answer!
I was also curious -- if the pdf we are giving to (And as an aside not relevant for this ticket, but I wasn't sure where to ask it -- I'm curious if anyone has managed to get archive-pdf-tools installed on MacOS, or has any idea of whether that might even be feasible. I have had no luck, and was guessing that it's not intended for that and not feasible without a lot of work). |
As Mac os is a kind of Unix I would expect all components to be compilable, all sources are available, but I don't know whether anyone has spent the effort to make it a smooth process, and as I posess no Mac or Hackintosh I can't try. As there are efforts to enable Linux on M1 there might be shorter virtualization routes. If you don't fear the size of the result there is a way OCRMyPDF can keep a JPG and add OCR'ed text. Then you won't need MRC.
|
MacOS supports these via Homebrew: https://ocrmypdf.readthedocs.io/en/latest/jbig2.html |
Yep, jbig2 wasn't actually the problem on MacOS. I'll open a separate issue about that, just to keep track of it for any other interested parties, since it's really a separate thing, sorry for bringing it up here. |
I suppose somewhat, but the whole process is lossy anyway. The better the input image quality, the better the output will be. Feels a bit like garbage in - garbage out. I am not sure if there is a way to fix this
I'm happy to try to help you get set up with this. We do build MacOS wheels, but I've personally never tested them. (I only use Linux). Maybe in a separate issue? |
The tool needs command line arguments much like
recode_pdf
(which we might want to rename) - and probably those flags out to be shared mostly.Let's also use this to discuss issues of people testing pdfcomp now.
The text was updated successfully, but these errors were encountered: