Skip to content

Latest commit

 

History

History
361 lines (226 loc) · 11.5 KB

text.md

File metadata and controls

361 lines (226 loc) · 11.5 KB

https://www.reddit.com/r/textdatamining/

https://en.m.wikipedia.org/wiki/Binary-to-text_encoding#Base58

https://lobste.rs/s/7ttwt8/aho_corasick_string_search

https://blog.floydhub.com/language-translator/ http://jalammar.github.io/illustrated-transformer/

https://explained.ai/decision-tree-viz/index.html

https://www.zverovich.net/2021/06/16/safe-formatting-api.html

https://news.ycombinator.com/item?id=27536447

https://mewo2.com/notes/markov-history/

https://news.ycombinator.com/item?id=28658297

https://github.com/apankrat/notes/tree/master/fast-case-conversion

https://news.ycombinator.com/item?id=28854808

Search

https://github.com/pyjarrett/septum Context-based code search tool, Ada

Boyer-Moore Fast String Searching Algorithm

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

https://news.ycombinator.com/item?id=26910982 https://yurichev.com/news/20210421_boyer_moore/ https://news.ycombinator.com/item?id=26900640

https://www.linuxjournal.com/article/6652 How to Index Anything

https://github.com/valeriansaliou/sonic

https://news.ycombinator.com/item?id=33315237

https://blog.sqlitecloud.io/real-time-full-text-site-search-with-sqlite-fts5-extension

https://news.ycombinator.com/item?id=35975355

https://neuml.github.io/txtai/workflow/

UTF-8

https://blog.pgdp.net/2021/06/01/cha%e1%b9%9b%e1%be%80%cf%82t%ce%adr-%e2%99%ad%e1%bf%a7ilding-character-building/

https://nullprogram.com/blog/2017/10/06/

Unicode

http://cldr.unicode.org/

http://tapiov.net/unicodetiles.js/

Crissov/unicode-proposals#410

https://news.ycombinator.com/item?id=26900749

https://github.com/qntm/base65536

https://news.ycombinator.com/item?id=14468818

https://rolisz.com/the-best-text-classification-library-for-a-quick-baseline/

https://news.ycombinator.com/item?id=27583185

https://devlog.hexops.com/2021/unicode-sorting-why-browsers-added-special-emoji-matching

https://baturin.org/blog/life-before-unicode/ ru

https://zig.news/dude_the_builder/unicode-string-operations-536e

https://heistak.github.io/your-code-displays-japanese-wrong/

https://news.ycombinator.com/item?id=29022906

https://gregtatum.com/writing/2021/diacritical-marks/

https://lobste.rs/s/jkay7p/diacritical_marks_unicode

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.

https://shapecatcher.com/

https://mcilloni.ovh/2023/07/23/unicode-is-hard/

https://news.ycombinator.com/item?id=36865287

ASCII

https://bestasciitable.com/

https://news.ycombinator.com/item?id=34399598

http://www.figlet.org/fontdb.cgi

https://queue.acm.org/detail.cfm?id=1871406 To move forward with programming languages we need to break free from the tyranny of ASCII.

https://news.ycombinator.com/item?id=27649431

http://www.network-science.de/ascii/

https://news.ycombinator.com/item?id=28736997

https://blog.asciinema.org/post/smaller-faster/

https://news.ycombinator.com/item?id=29387761

https://madned.substack.com/p/ascii-double-murder

https://news.ycombinator.com/item?id=35004503

https://blogs.oracle.com/mysql/mysql%3a-character-sets%2c-unicode%2c-and-uca-compliant-collations

https://codewords.recurse.com/issues/seven/data-driven-literary-analysis

Encoding

https://datatracker.ietf.org/doc/draft-faltstrom-base45/

https://news.ycombinator.com/item?id=27603173

https://kunststube.net/encoding/

https://news.ycombinator.com/item?id=30384223

Generator

https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/

https://news.ycombinator.com/item?id=27443528

https://github.com/google/jax

Summary

https://github.com/gregdurrett/berkeley-doc-summarizer

https://news.ycombinator.com/item?id=27637902

https://github.com/LIAAD/yake

https://medium.com/besedo-engineering/text-summarization-part-2-state-of-the-art-ae900e2ac55f

https://labs.kagi.com/ai/sum

https://news.ycombinator.com/item?id=34646389

https://news.ycombinator.com/item?id=36470297

Spell corrector

https://norvig.com/spell-correct.html

https://news.ycombinator.com/item?id=28551468

Editor

https://twitter.com/dm_0ney/status/1414742742530498566

https://news.ycombinator.com/item?id=27926758

https://code.visualstudio.com/blogs/2021/09/29/bracket-pair-colorization

https://news.ycombinator.com/item?id=28692470

Wrap

https://www.ctrl.blog/entry/text-wrap-balance.html

https://news.ycombinator.com/item?id=28887008

News

https://adi.earth/apps/duplex/

https://news.ycombinator.com/item?id=42540397 Latin

https://news.ycombinator.com/item?id=41797271

https://www.bibtex.com/e/entry-types/

https://eggcorns.lascribe.net/

https://news.ycombinator.com/item?id=40720548

https://news.ycombinator.com/item?id=40530719

https://news.ycombinator.com/item?id=40254384

https://news.ycombinator.com/item?id=39614816

https://news.ycombinator.com/item?id=38427343

https://www.embopress.org/doi/full/10.15252/msb.202211325

https://news.ycombinator.com/item?id=37216702

https://learn.microsoft.com/en-us/windows/powertoys/text-extractor

https://saeedesmaili.com/demystifying-text-data-with-the-unstructured-python-library/

https://news.ycombinator.com/item?id=36616799

https://evanhahn.com/utf-21/

https://news.ycombinator.com/item?id=36269343

https://ionathan.ch/2023/06/06/angarr.html

https://news.ycombinator.com/item?id=36369553

https://www.oilshell.org/blog/2023/06/surrogate-pair.html

https://thephd.dev/cuneicode-and-the-future-of-text-in-c

https://news.ycombinator.com/item?id=36224893

https://stephenramsay.net/posts/groff-mom.html

https://news.ycombinator.com/item?id=35971338

https://www.stefanjudis.com/today-i-learned/how-to-split-javascript-strings-with-intl-segmenter/

https://news.ycombinator.com/item?id=35650699

https://buttondown.email/hillelwayne/archive/tag-systems/

https://news.ycombinator.com/item?id=35597934

https://blog.adacore.com/introduction-to-vss-library

https://github.com/pop-os/cosmic-text

https://news.ycombinator.com/item?id=35004705

https://github.com/neuml/paperetl

https://inventlikeanowner.com/blog/the-story-behind-asins-amazon-standard-identification-numbers/

https://news.ycombinator.com/item?id=34501344

https://www.calligrapher.ai/

https://news.ycombinator.com/item?id=34530011

https://rhodesmill.org/brandon/2012/one-sentence-per-line/

https://news.ycombinator.com/item?id=34438665

https://github.com/christianvoigt/argdown

https://news.ycombinator.com/item?id=34428680

https://github.com/open-taggy

https://news.ycombinator.com/item?id=34454713

https://www.openstenoproject.org/plover/ steno

https://news.ycombinator.com/item?id=34298063

https://www.linode.com/docs/guides/differences-between-grep-sed-awk/

https://news.ycombinator.com/item?id=34280281

https://lemire.me/blog/2022/12/30/quickly-checking-that-a-string-belongs-to-a-small-set/

https://news.ycombinator.com/item?id=34184627

https://raphlinus.github.io/text/2020/10/26/text-layout.html

https://news.ycombinator.com/item?id=34173290

https://en.wikipedia.org/wiki/Overlapping_markup

https://news.ycombinator.com/item?id=33951613

https://daniel.haxx.se/blog/2022/12/06/faster-base64-in-curl/

https://news.ycombinator.com/item?id=33877374

https://news.ycombinator.com/item?id=33767301

https://resoomer.com/

https://libs.suckless.org/libgrapheme/

https://arxiv.org/abs/2211.05166 Grammatical Error Correction: A Survey of the State of the Art

https://raphlinus.github.io/text/2022/11/08/minikin.html

https://github.com/qntm/base2048 twitter

https://github.com/kohlschutter/boilerpipe

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=ThQGwioAAAAJ&sortby=pubdate&citation_for_view=ThQGwioAAAAJ:u-x6o8ySG0sC

https://en.wikipedia.org/wiki/Cistercian_numerals

https://logseq.com/?

https://news.ycombinator.com/item?id=33218561

https://omniglot.com/conscripts/fakoo.htm

https://news.ycombinator.com/item?id=33092239

https://blog.unicode.org/2022/09/announcing-icu4x-10.html

https://twitter.com/jonty/status/1571615998335123457

https://news.ycombinator.com/item?id=32896989

https://github.com/bartp5/libtexprintf

https://lwn.net/Articles/908032/

https://news.ycombinator.com/item?id=32842207

https://kinzler.com/me/align/

https://github.com/simdutf/simdutf

https://news.ycombinator.com/item?id=32700315

https://benhoyt.com/writings/count-words/

https://news.ycombinator.com/item?id=32214419

https://languagetool.org/en/dev Open-source Grammarly alternative

https://news.ycombinator.com/item?id=32236608

https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/

https://news.ycombinator.com/item?id=31858311

https://www.gnu.org/software/recutils/

https://news.ycombinator.com/item?id=31832564

https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/

https://news.ycombinator.com/item?id=31793143

https://news.ycombinator.com/item?id=31779260 Ask HN: I created a news shortening algorithm and am not sure how to utilize it

https://google-research.github.io/self-organising-systems/2022/diff-fsm/

https://news.ycombinator.com/item?id=31663702

https://dl.acm.org/doi/pdf/10.1145/3152823 FontCode: Embedding Information in Text Documents Using Glyph Perturbation

https://dl.acm.org/doi/pdf/10.1145/3152823

https://github.com/birchb1024/frangipanni test2tree

https://news.ycombinator.com/item?id=26622548

https://lemire.me/blog/2022/04/05/string-representations-are-not-unique-learn-to-normalize/

https://www.mcclimon.org/blog/writing-text-with-flag-emojis/

https://github.com/wolfgarbe/SymSpell Spelling correction & Fuzzy search

https://serhack.me/articles/unveiling-anonymous-author-stylometry-techniques/

https://news.ycombinator.com/item?id=30571932

https://www.norvig.com/spell-correct.html

https://news.ycombinator.com/item?id=30575416

https://blog.opensyllabus.org/about-the-open-syllabus-project/

https://github.com/neuml/txtai

https://github.com/larrykollar/Unix-Text-Processing

https://news.ycombinator.com/item?id=30396667

https://www.revk.uk/2022/02/crlf-has-long-history.html

https://news.ycombinator.com/item?id=30253968

https://arxiv.org/abs/2202.00848 Some Reflections on Drawing Causal Inference using Textual Data: Parallels Between Human Subjects and Organized Texts

https://drewdevault.com/2022/01/28/Implementing-mime-in-xxxx.html

https://github.com/Uzay-G/espial/blob/main/ARCHITECTURE.md

https://cendyne.dev/posts/2022-01-23-base64.html

https://davidamos.dev/why-cant-you-reverse-a-flag-emoji/

https://news.ycombinator.com/item?id=30104292

https://www.wired.com/story/kingdom-of-characters-jing-tsu-china-language-information/

https://news.ycombinator.com/item?id=30086441

https://quickwit.io/blog/quickwit-0.2/

https://news.ycombinator.com/item?id=29904607

https://blog.adamchalmers.com/nom-chars/

https://news.ycombinator.com/item?id=29897328

https://newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python

https://troff.org/

http://transcultura.org/?q=node%2F8

https://opus.nlpl.eu/

https://news.ycombinator.com/item?id=28179877

https://www.carolemieux.com/arvada_ase21.pdf Learning Highly Recursive Input Grammars

http://defoe.sourceforge.net/folio/knuth-plass.html

https://news.ycombinator.com/item?id=28537923

https://users.cecs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf TR-CS-06-02 A Comparison of Personal Name Matching: Techniques and Practical Issues

https://github.com/minimaxir/big-list-of-naughty-strings

https://web.stanford.edu/~jurafsky/slp3/ Speech and Language Processing

https://news.ycombinator.com/item?id=28891230