Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Differential Binarization model from PaddleOCR to Keras3 #1739

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

gowthamkpr
Copy link
Collaborator

@gowthamkpr gowthamkpr commented Aug 6, 2024

This adds the Differentiable Binarization model for scene text detection.

I implemented the architecture based on ResNet50_vd from PaddleOCR and ported the weights.

Demo colab: https://colab.research.google.com/gist/gowthamkpr/bd4a7f7742e92e66cfc57827052b8619/keras_paddleocr_v3.ipynb

@mattdangerw mattdangerw changed the base branch from master to keras-hub August 6, 2024 17:36
@mattdangerw
Copy link
Member

Let's split this up. Start with ResNetVD backbone?

Some notes...

  • Remove the aliases. One ResNetVDBackbone can handle all of these with different presets.
  • Conversion scripts as scripts not colabs.
  • Follow the local style for backbones as closely as possible. See some comments here Add VGG16 and VGG19 backbone #1737
  • Keep models a flat directory. No backbones/xx etc.
  • Add some tests.

@divyashreepathihalli
Copy link
Collaborator

@gowthamkpr is the PR ready for review?

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have left a reorganization comment.

example for structuring the code - https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models/sam

keras_nlp/src/models/diffbin/diffbin.py Outdated Show resolved Hide resolved
keras_nlp/src/models/diffbin/diffbin.py Outdated Show resolved Hide resolved
keras_nlp/src/models/diffbin/losses.py Outdated Show resolved Hide resolved
@gowthamkpr gowthamkpr changed the base branch from keras-hub to master October 22, 2024 20:24
@divyashreepathihalli
Copy link
Collaborator

Hi @gowthamkpr! can you please refactor the code to KerasHub style?

  • Add a preprocessor flow
  • subclass image segementer model for the task class
  • add preset class
  • add standard test routines

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gowtham! left a few comments!

56,
256,
),
run_mixed_precision_check=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the mixed precision check pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I tried adding an explicit dtype argument, but the problem remains that the mixed precision check checks against each sublayer of the model. The ResNet backbone, which is instantiated separately, therefore has the wrong dtype.

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Gowtham! Left a few comments. Can you please also add a demo colab in the PR description to verify the model is working before merging?


@keras_hub_export("keras_hub.layers.DifferentialBinarizationImageConverter")
class DifferentialBinarizationImageConverter(ImageConverter):
backbone_cls = DifferentialBinarizationBackbone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be some resizing/rescaling ops here right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends. Basically these image operations are implemented in the super class, ImageConverter, and can be used as depicted in the demo colab I've added in the PR description. Dedicated code in this class might make sense to resize to resolutions of multiples of 32, which the model requires. On the other hand, it might be confusing for the user if the masks that are predicted have different resolutions than the input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to look into Segformer for this. The output masks will need to be resized as well



@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR")
class DifferentialBinarizationOCR(ImageSegmenter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add a new base class for ocr, I don't think ImageSegmenter is a good. one. Do you have a specific reason you chose to subclass ImageSegmenter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you suggested to subclass ImageSegmenter (here) if I understood correctly. Technically, the task is somewhat similar to segmentation tasks. We can of course add a separate base class for it to catch the semantic differences, but I would rather name it "scene text detection".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it is better to add a new base class for OCR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I suggest to create an ImageTextDetector base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements.
Does this work for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gowthamkpr what's the high level use journey we'd expect this to be used with? Is this eventually going to be part of an ocr pipeline? What would the full ocr pipeline code look like? E.g.

ocr = keras_hub.OCR.from_preset("something_paddle_paddle")
ocr.predict(image) # returns text and polygons for text?

Or does this look like a two stage thing where we first go image -> polygons, and second go image and polygons -> text?

I think the thing we are missing is the high level look of the user journey we are trying to cover. Once we have that figured out, the rest of this will become a lot clearer.

Copy link
Collaborator Author

@gowthamkpr gowthamkpr Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest exposing this two stage process to the user, i.e. image -> polygons and image+polygons -> text.

Differential Binarization is not able to perform the actual OCR, so we'll have to staple another model to it to perform the full pipeline. Additionally, it might be useful to have additional preprocessing steps in between, e.g., to rotate a region if a flat, but tilted bounding box has been detected. This is too much detail for hiding it from the user imho.

Maybe something like

# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
map_output = text_detector.predict(image)
# get polygons
polygons = text_detector.postprocess_to_polygons(map_output)

# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
  image_patch = keras_hub.utils.extract_patch(image, polygon)
  text = ocr.predict(image_patch)

Alternatively, we could combine predict() and postprocess_to_polygons() to something like

# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
polygons = text_detector.predict_polygons(image)

# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
  image_patch = keras_hub.utils.extract_patch(image, polygon)
  text = ocr.predict(image_patch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rules for designing here.

  • We should aim for simple usage that stays close to regular Keras abstractions.
  • A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good.
  • Task class name is always SomeModelSomeTask(SomeTask). So if the task is called ImageTextExtractor the subclass is DiffBinImageTextExtractor. If the task is OCR the subclass is SomeModelOCR. I don't think we need exceptions to this.
  • We can bake preprocessing and postprocessing into the preprocessor layer, and leave it configurable there.

Questions.

  • What's the format of a polygon list. Do we keep it as a padded tensor?
  • What's the format of a patch? It's not rectangular right? Is it a image plus a mask?
  • What will training or fine-tuning look like? What's the format of the input data?

Seems like our main options are, splitting detection and extraction as you are doing so far...

# Inference.
detector  = keras_hub.TextDetector.from_preset("diff_bin_small")
# Some for of masking here so these can stay square tensors?
polygons = detector.predict(images)
extractor = keras_hub.TextExtractor.from_preset("other_thing")
texts = extractor.predict({
    "images": images,
    "polygons": polygons,
})

# Fine-tuning.
# load some data. think this through! what is the format? do we need masking?
images, polygons, texts = ...
detector.fit(x=images, y=polygons)
extractor.fit(x={"images": images, "polygons": polygons}, y=texts)

Or trying to bake this all into one.

# Inference.
ocr  = keras_hub.OCR.from_preset("diff_bin_and_other_thing")
# Some for of masking here so these can stay square tensors?
texts, polygons = ocr.predict(images)

# Fine-tuning. All at once? Split?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two stage process is fine if it's standard, but we might want to clean it up a bit.

Copy link
Collaborator Author

@gowthamkpr gowthamkpr Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rules for designing here.

* We should aim for simple usage that stays close to regular Keras abstractions.

* A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good.

* Task class name is always `SomeModelSomeTask(SomeTask)`. So if the task is called `ImageTextExtractor` the subclass is `DiffBinImageTextExtractor`. If the task is `OCR` the subclass is `SomeModelOCR`. I don't think we need exceptions to this.

Changed.

* What's the format of a polygon list. Do we keep it as a padded tensor?

At the moment I return a list(batch) of lists (image) of lists(polygon) of 2d tuples (points in a polygon). Considering this nested structure, and considering that the number vertices in a polygon might vary wildly, I think it makes sense to avoid forcing this structure into a tensor.

I'd have to check if it would even be possible to rewrite the contour finding algorithms to only use tensor operations. When using OpenCV as contour finder (which I currently optionally allow), it of course isn't.

* What's the format of a patch? It's not rectangular right? Is it a image plus a mask?

The text recognition model will accept a rectangular patch that only contains the text region. Within this patch, we could possibly mask out regions that are outside the polygon.

* What will training or fine-tuning look like? What's the format of the input data?

Currently the segmentation masks need to be provided as input for fine-tuning, just as the model outputs segmentation masks before postprocessing. Changing this to accept a polygon representation for training should be possible, but then I'd prefer to override train_step for doing the conversion. Not sure how well this can be performed in the computation gradph, and I'm also not sure how sophisticated Keras' support for ragged tensors is to represent this nested structure of points.

Seems like our main options are, splitting detection and extraction as you are doing so far...

# Inference.
detector  = keras_hub.TextDetector.from_preset("diff_bin_small")
# Some for of masking here so these can stay square tensors?
polygons = detector.predict(images)
extractor = keras_hub.TextExtractor.from_preset("other_thing")
texts = extractor.predict({
    "images": images,
    "polygons": polygons,
})

# Fine-tuning.
# load some data. think this through! what is the format? do we need masking?
images, polygons, texts = ...
detector.fit(x=images, y=polygons)
extractor.fit(x={"images": images, "polygons": polygons}, y=texts)

Or trying to bake this all into one.

# Inference.
ocr  = keras_hub.OCR.from_preset("diff_bin_and_other_thing")
# Some for of masking here so these can stay square tensors?
texts, polygons = ocr.predict(images)

# Fine-tuning. All at once? Split?

As a user, I'd have a strong preference for the split approach. Both model types are pretty complex, and I'd like to know what comes out of the first one, possibly identifying problems, before passing it on to the other model, when using the models.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions and some style stuff.

I am curious what to do with the task here. Does this output a segmentation mask? Maybe most importantly, does this fit into a bigger picture of an OCR system? If so, how do we expect the whole thing to work?



@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR")
class DifferentialBinarizationOCR(ImageSegmenter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the output of the task? Seems like this is not quite OCR right? Not text output?

Is this just a piece of what we would need for a full OCR setup?

backbone=backbone
)

detector(input_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the output of the task look like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get the probability map, threshold map and binary map output in the last dimension from the model. I've added some documentation here.

keras_hub/src/models/differential_binarization/losses.py Outdated Show resolved Hide resolved
keras_hub/src/models/differential_binarization/losses.py Outdated Show resolved Hide resolved
keras_hub/src/models/differential_binarization/losses.py Outdated Show resolved Hide resolved
@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Dec 10, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 10, 2024
@gowthamkpr gowthamkpr added the kokoro:force-run Runs Tests on GPU label Dec 17, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 17, 2024

`ImageTextDetector` tasks wrap a `keras_hub.models.Task` and
a `keras_hub.models.Preprocessor` to create a model that can be used for
image segmentation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is false right? This task will detect polygons in an image?

Copy link
Collaborator Author

@gowthamkpr gowthamkpr Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. Fixed

can be used to load a pre-trained config and weights.

Args:
detection_thresh: The value for thresholding predicted mask outputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these general concepts that will apply to other models? We should only leave common stuff here. If this would not apply to a non-diff bin model, we should leave it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty typical for such models to output a mask, which then needs to be translated into polygon form, but there are also models that follow different approaches. So as you prefer - we can move it into diffbin_textdetector.py or even create a separate file for the postproc.

)
return config

def postprocess_to_polygons(self, masks, contour_finder="simple"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to consider rolling this into the preprocessing layer. That is what we do for generation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here. Could you please elaborate or provide pointers for how it is implemented for generation?



@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR")
class DifferentialBinarizationOCR(ImageSegmenter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rules for designing here.

  • We should aim for simple usage that stays close to regular Keras abstractions.
  • A task should definite a type of input and output data for any model that we build for that task. The goal is standardization, swap the model you are using without changing the high-level code using the model. So in this case, using OCR for two different types of inputs and outputs is not good.
  • Task class name is always SomeModelSomeTask(SomeTask). So if the task is called ImageTextExtractor the subclass is DiffBinImageTextExtractor. If the task is OCR the subclass is SomeModelOCR. I don't think we need exceptions to this.
  • We can bake preprocessing and postprocessing into the preprocessor layer, and leave it configurable there.

Questions.

  • What's the format of a polygon list. Do we keep it as a padded tensor?
  • What's the format of a patch? It's not rectangular right? Is it a image plus a mask?
  • What will training or fine-tuning look like? What's the format of the input data?

Seems like our main options are, splitting detection and extraction as you are doing so far...

# Inference.
detector  = keras_hub.TextDetector.from_preset("diff_bin_small")
# Some for of masking here so these can stay square tensors?
polygons = detector.predict(images)
extractor = keras_hub.TextExtractor.from_preset("other_thing")
texts = extractor.predict({
    "images": images,
    "polygons": polygons,
})

# Fine-tuning.
# load some data. think this through! what is the format? do we need masking?
images, polygons, texts = ...
detector.fit(x=images, y=polygons)
extractor.fit(x={"images": images, "polygons": polygons}, y=texts)

Or trying to bake this all into one.

# Inference.
ocr  = keras_hub.OCR.from_preset("diff_bin_and_other_thing")
# Some for of masking here so these can stay square tensors?
texts, polygons = ocr.predict(images)

# Fine-tuning. All at once? Split?

@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Jan 22, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Jan 22, 2025
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gowthamkpr if you have addressed the comments can you please respond or resolve - so we can merge this PR



@keras_hub_export("keras_hub.models.DiffBinPreprocessor")
class DiffBinPreprocessor(ImageSegmenterPreprocessor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this model is going to be returning polygons - we might need to change the parent class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The functionality that we require here is pretty similar, though (resize the output mask in addition to the model's output). So basically, we'd have to copy ImageSegmenterPreprocessor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants