text-extraction

Here are 273 public repositories matching this topic...

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Feb 8, 2025
Python

miso-belica / sumy

Sponsor

Star

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated May 16, 2024
Python

unidoc / unipdf

Star

Golang PDF library for creating and processing PDF files (pure go)

golang pdf signing text-extraction pdf-generator pdf-generation pdf-reader pdf-manipulation pdf-library pdf-document-processor pdf-compression pdf-sign pdf-reports

Updated Jan 26, 2025
Go

chrismattmann / tika-python

Sponsor

Star

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Apr 14, 2024
Python

whitelok / image-text-localization-recognition

Star

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Sep 17, 2023

miso-belica / jusText

Sponsor

Star

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated May 9, 2024
Python

unidoc / unidoc

Star

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

ICIJ / datashare

Star

A self-hosted search engine for documents.

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Feb 13, 2025
Java

ropensci / pdftools

Star

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated Feb 7, 2025
C++

cdown / srt

Star

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

shixzie / nlp

Star

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp go golang natural-language-processing parse text text-extraction

Updated Sep 18, 2017
Go

flairNLP / fundus

Star

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated Feb 13, 2025
Python

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Oct 13, 2023
HTML

py-pdf / benchmarks

Star

Benchmarking PDF libraries

pdf benchmark text-extraction mupdf data-extraction pypdf2 poppler-utils

Updated Oct 31, 2023
Python

iamarunbrahma / vision-parse

Star

Parse PDFs into markdown using Vision LLMs

text-extraction pdf-parser document-parser pdf-to-markdown

Updated Feb 8, 2025
Python

bookieio / breadability

Star

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated May 9, 2024
HTML

weareprestatech / hotpdf

Star

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

python pdf text-extraction text-search

Updated Dec 15, 2024
Python

Goldziher / kreuzberg

Sponsor

Star

A text extraction library supporting PDFs, images, office documents and more

pdf ocr text-extraction asyncio docx

Updated Feb 12, 2025
Python

SapienzaNLP / extend

Star

Entity Disambiguation as text extraction (ACL 2022)

nlp natural-language-processing acl pytorch text-extraction entity-linking entity-disambiguation entity-disambiguation-models acl2022

Updated Apr 17, 2022
Python

skylander86 / lambda-text-extractor

Star

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction

Here are 273 public repositories matching this topic...

adbar / trafilatura

miso-belica / sumy

unidoc / unipdf

chrismattmann / tika-python

whitelok / image-text-localization-recognition

miso-belica / jusText

unidoc / unidoc

ICIJ / datashare

ropensci / pdftools

cdown / srt

shixzie / nlp

flairNLP / fundus

pd3f / pd3f

py-pdf / benchmarks

iamarunbrahma / vision-parse

bookieio / breadability

weareprestatech / hotpdf

Goldziher / kreuzberg

SapienzaNLP / extend

skylander86 / lambda-text-extractor

Improve this page

Add this topic to your repo