Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR dataset results #46

Open
cagey-squirrel opened this issue Jan 7, 2025 · 5 comments
Open

OCR dataset results #46

cagey-squirrel opened this issue Jan 7, 2025 · 5 comments

Comments

@cagey-squirrel
Copy link
Contributor

cagey-squirrel commented Jan 7, 2025

Could you please share results from using OCR text extracting approaches on the datasets shared on HF? I am having trouble with PPOCR and would like to replicate results with text for RAG.

What would work best for me is the output of calling eval on ChartQA, InfoVQA, MP-DocVQA and SlideVQA datasets but with their text content being used for RAG instead of images (model used and OCR method does not matter too much, would prefer the one with best results).

@tcy6
Copy link
Collaborator

tcy6 commented Jan 11, 2025

@cagey-squirrel

Could you please share results from using OCR text extracting approaches on the datasets shared on HF? I am having trouble with PPOCR and would like to replicate results with text for RAG.

I’m very sorry, but we couldn’t find the OCR results after conducting the experiments.

What would work best for me is the output of calling eval on ChartQA, InfoVQA, MP-DocVQA and SlideVQA datasets but with their text content being used for RAG instead of images (model used and OCR method does not matter too much, would prefer the one with best results).

Could you please explain this in more detail? I’m having trouble understanding it.

@cagey-squirrel
Copy link
Contributor Author

Hi, thanks for the response.

When eval.sh script is called it produces embeddings.corpus, embeddings.query, test_result.log and test..rec files
It would be great if you had the test.
.trec files which are generated after running eval.sh on OCR versions of datasets ChartQA, InfoVQA, MP-DocVQA and SlideVQA.

@tcy6
Copy link
Collaborator

tcy6 commented Jan 11, 2025

Hi, thanks for the response.

When eval.sh script is called it produces embeddings.corpus, embeddings.query, test_result.log and test..rec files It would be great if you had the test..trec files which are generated after running eval.sh on OCR versions of datasets ChartQA, InfoVQA, MP-DocVQA and SlideVQA.

Let me try to find it~
BTW, which OCR are you referring to?

@cagey-squirrel
Copy link
Contributor Author

Not too important, if there are multiple then one with the best results.
Also I wanted to know how did you partition the text into chunks for RAG with OCR? Did you extract text page by page or did you group the chunks differently?

@tcy6
Copy link
Collaborator

tcy6 commented Jan 12, 2025

Not too important, if there are multiple then one with the best results. Also I wanted to know how did you partition the text into chunks for RAG with OCR? Did you extract text page by page or did you group the chunks differently?

We extract text page by page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants