-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Differential Binarization model from PaddleOCR to Keras3 #1739
base: master
Are you sure you want to change the base?
Conversation
Let's split this up. Start with ResNetVD backbone? Some notes...
|
1826dce
to
753047d
Compare
753047d
to
a5e5d8f
Compare
@gowthamkpr is the PR ready for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have left a reorganization comment.
example for structuring the code - https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models/sam
Hi @gowthamkpr! can you please refactor the code to KerasHub style?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gowtham! left a few comments!
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
56, | ||
256, | ||
), | ||
run_mixed_precision_check=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the mixed precision check pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I tried adding an explicit dtype
argument, but the problem remains that the mixed precision check checks against each sublayer of the model. The ResNet backbone, which is instantiated separately, therefore has the wrong dtype
.
keras_hub/src/models/differential_binarization/differential_binarization_test.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_test.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR Gowtham! Left a few comments. Can you please also add a demo colab in the PR description to verify the model is working before merging?
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
|
||
@keras_hub_export("keras_hub.layers.DifferentialBinarizationImageConverter") | ||
class DifferentialBinarizationImageConverter(ImageConverter): | ||
backbone_cls = DifferentialBinarizationBackbone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be some resizing/rescaling ops here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends. Basically these image operations are implemented in the super class, ImageConverter
, and can be used as depicted in the demo colab I've added in the PR description. Dedicated code in this class might make sense to resize to resolutions of multiples of 32, which the model requires. On the other hand, it might be confusing for the user if the masks that are predicted have different resolutions than the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to look into Segformer for this. The output masks will need to be resized as well
|
||
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR") | ||
class DifferentialBinarizationOCR(ImageSegmenter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add a new base class for ocr, I don't think ImageSegmenter is a good. one. Do you have a specific reason you chose to subclass ImageSegmenter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you suggested to subclass ImageSegmenter
(here) if I understood correctly. Technically, the task is somewhat similar to segmentation tasks. We can of course add a separate base class for it to catch the semantic differences, but I would rather name it "scene text detection".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it is better to add a new base class for OCR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I suggest to create an ImageTextDetector
base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements.
Does this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gowthamkpr what's the high level use journey we'd expect this to be used with? Is this eventually going to be part of an ocr pipeline? What would the full ocr pipeline code look like? E.g.
ocr = keras_hub.OCR.from_preset("something_paddle_paddle")
ocr.predict(image) # returns text and polygons for text?
Or does this look like a two stage thing where we first go image -> polygons, and second go image and polygons -> text?
I think the thing we are missing is the high level look of the user journey we are trying to cover. Once we have that figured out, the rest of this will become a lot clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest exposing this two stage process to the user, i.e. image -> polygons and image+polygons -> text.
Differential Binarization is not able to perform the actual OCR, so we'll have to staple another model to it to perform the full pipeline. Additionally, it might be useful to have additional preprocessing steps in between, e.g., to rotate a region if a flat, but tilted bounding box has been detected. This is too much detail for hiding it from the user imho.
Maybe something like
# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
map_output = text_detector.predict(image)
# get polygons
polygons = text_detector.postprocess_to_polygons(map_output)
# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
image_patch = keras_hub.utils.extract_patch(image, polygon)
text = ocr.predict(image_patch)
Alternatively, we could combine predict()
and postprocess_to_polygons()
to something like
# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
polygons = text_detector.predict_polygons(image)
# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
image_patch = keras_hub.utils.extract_patch(image, polygon)
text = ocr.predict(image_patch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rules for designing here.
- We should aim for simple usage that stays close to regular Keras abstractions.
- A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good.
- Task class name is always
SomeModelSomeTask(SomeTask)
. So if the task is calledImageTextExtractor
the subclass isDiffBinImageTextExtractor
. If the task isOCR
the subclass isSomeModelOCR
. I don't think we need exceptions to this. - We can bake preprocessing and postprocessing into the preprocessor layer, and leave it configurable there.
Questions.
- What's the format of a polygon list. Do we keep it as a padded tensor?
- What's the format of a patch? It's not rectangular right? Is it a image plus a mask?
- What will training or fine-tuning look like? What's the format of the input data?
Seems like our main options are, splitting detection and extraction as you are doing so far...
# Inference.
detector = keras_hub.TextDetector.from_preset("diff_bin_small")
# Some for of masking here so these can stay square tensors?
polygons = detector.predict(images)
extractor = keras_hub.TextExtractor.from_preset("other_thing")
texts = extractor.predict({
"images": images,
"polygons": polygons,
})
# Fine-tuning.
# load some data. think this through! what is the format? do we need masking?
images, polygons, texts = ...
detector.fit(x=images, y=polygons)
extractor.fit(x={"images": images, "polygons": polygons}, y=texts)
Or trying to bake this all into one.
# Inference.
ocr = keras_hub.OCR.from_preset("diff_bin_and_other_thing")
# Some for of masking here so these can stay square tensors?
texts, polygons = ocr.predict(images)
# Fine-tuning. All at once? Split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two stage process is fine if it's standard, but we might want to clean it up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rules for designing here.
* We should aim for simple usage that stays close to regular Keras abstractions. * A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good. * Task class name is always `SomeModelSomeTask(SomeTask)`. So if the task is called `ImageTextExtractor` the subclass is `DiffBinImageTextExtractor`. If the task is `OCR` the subclass is `SomeModelOCR`. I don't think we need exceptions to this.
Changed.
* What's the format of a polygon list. Do we keep it as a padded tensor?
At the moment I return a list(batch) of lists (image) of lists(polygon) of 2d tuples (points in a polygon). Considering this nested structure, and considering that the number vertices in a polygon might vary wildly, I think it makes sense to avoid forcing this structure into a tensor.
I'd have to check if it would even be possible to rewrite the contour finding algorithms to only use tensor operations. When using OpenCV as contour finder (which I currently optionally allow), it of course isn't.
* What's the format of a patch? It's not rectangular right? Is it a image plus a mask?
The text recognition model will accept a rectangular patch that only contains the text region. Within this patch, we could possibly mask out regions that are outside the polygon.
* What will training or fine-tuning look like? What's the format of the input data?
Currently the segmentation masks need to be provided as input for fine-tuning, just as the model outputs segmentation masks before postprocessing. Changing this to accept a polygon representation for training should be possible, but then I'd prefer to override train_step
for doing the conversion. Not sure how well this can be performed in the computation gradph, and I'm also not sure how sophisticated Keras' support for ragged tensors is to represent this nested structure of points.
Seems like our main options are, splitting detection and extraction as you are doing so far...
# Inference. detector = keras_hub.TextDetector.from_preset("diff_bin_small") # Some for of masking here so these can stay square tensors? polygons = detector.predict(images) extractor = keras_hub.TextExtractor.from_preset("other_thing") texts = extractor.predict({ "images": images, "polygons": polygons, }) # Fine-tuning. # load some data. think this through! what is the format? do we need masking? images, polygons, texts = ... detector.fit(x=images, y=polygons) extractor.fit(x={"images": images, "polygons": polygons}, y=texts)Or trying to bake this all into one.
# Inference. ocr = keras_hub.OCR.from_preset("diff_bin_and_other_thing") # Some for of masking here so these can stay square tensors? texts, polygons = ocr.predict(images) # Fine-tuning. All at once? Split?
As a user, I'd have a strong preference for the split approach. Both model types are pretty complex, and I'd like to know what comes out of the first one, possibly identifying problems, before passing it on to the other model, when using the models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions and some style stuff.
I am curious what to do with the task here. Does this output a segmentation mask? Maybe most importantly, does this fit into a bigger picture of an OCR system? If so, how do we expect the whole thing to work?
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_ocr.py
Outdated
Show resolved
Hide resolved
|
||
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR") | ||
class DifferentialBinarizationOCR(ImageSegmenter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the output of the task? Seems like this is not quite OCR right? Not text output?
Is this just a piece of what we would need for a full OCR setup?
backbone=backbone | ||
) | ||
|
||
detector(input_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the output of the task look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get the probability map, threshold map and binary map output in the last dimension from the model. I've added some documentation here.
keras_hub/src/models/differential_binarization/differential_binarization_presets.py
Outdated
Show resolved
Hide resolved
|
||
`ImageTextDetector` tasks wrap a `keras_hub.models.Task` and | ||
a `keras_hub.models.Preprocessor` to create a model that can be used for | ||
image segmentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is false right? This task will detect polygons in an image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. Fixed
can be used to load a pre-trained config and weights. | ||
|
||
Args: | ||
detection_thresh: The value for thresholding predicted mask outputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these general concepts that will apply to other models? We should only leave common stuff here. If this would not apply to a non-diff bin model, we should leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty typical for such models to output a mask, which then needs to be translated into polygon form, but there are also models that follow different approaches. So as you prefer - we can move it into diffbin_textdetector.py
or even create a separate file for the postproc.
) | ||
return config | ||
|
||
def postprocess_to_polygons(self, masks, contour_finder="simple"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to consider rolling this into the preprocessing layer. That is what we do for generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here. Could you please elaborate or provide pointers for how it is implemented for generation?
|
||
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR") | ||
class DifferentialBinarizationOCR(ImageSegmenter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rules for designing here.
- We should aim for simple usage that stays close to regular Keras abstractions.
- A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good.
- Task class name is always
SomeModelSomeTask(SomeTask)
. So if the task is calledImageTextExtractor
the subclass isDiffBinImageTextExtractor
. If the task isOCR
the subclass isSomeModelOCR
. I don't think we need exceptions to this. - We can bake preprocessing and postprocessing into the preprocessor layer, and leave it configurable there.
Questions.
- What's the format of a polygon list. Do we keep it as a padded tensor?
- What's the format of a patch? It's not rectangular right? Is it a image plus a mask?
- What will training or fine-tuning look like? What's the format of the input data?
Seems like our main options are, splitting detection and extraction as you are doing so far...
# Inference.
detector = keras_hub.TextDetector.from_preset("diff_bin_small")
# Some for of masking here so these can stay square tensors?
polygons = detector.predict(images)
extractor = keras_hub.TextExtractor.from_preset("other_thing")
texts = extractor.predict({
"images": images,
"polygons": polygons,
})
# Fine-tuning.
# load some data. think this through! what is the format? do we need masking?
images, polygons, texts = ...
detector.fit(x=images, y=polygons)
extractor.fit(x={"images": images, "polygons": polygons}, y=texts)
Or trying to bake this all into one.
# Inference.
ocr = keras_hub.OCR.from_preset("diff_bin_and_other_thing")
# Some for of masking here so these can stay square tensors?
texts, polygons = ocr.predict(images)
# Fine-tuning. All at once? Split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gowthamkpr if you have addressed the comments can you please respond or resolve - so we can merge this PR
|
||
|
||
@keras_hub_export("keras_hub.models.DiffBinPreprocessor") | ||
class DiffBinPreprocessor(ImageSegmenterPreprocessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this model is going to be returning polygons - we might need to change the parent class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The functionality that we require here is pretty similar, though (resize the output mask in addition to the model's output). So basically, we'd have to copy ImageSegmenterPreprocessor
.
This adds the Differentiable Binarization model for scene text detection.
I implemented the architecture based on ResNet50_vd from PaddleOCR and ported the weights.
Demo colab: https://colab.research.google.com/gist/gowthamkpr/bd4a7f7742e92e66cfc57827052b8619/keras_paddleocr_v3.ipynb