Text normalization for reading Wikipedia articles #283
xenotropic
started this conversation in
General
Replies: 1 comment
-
Pretttty sure that chat gpt or gpt 3.5 can do text normalization for you. I'd look into that. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So I ran a few Wikipedia articles through Tortoise, just to see how it does, and because it does seem like a real problem domain where tortoise might be used. You can listen here to the outputs, which is a podcast RSS feed. It does quite well, but the main gap seems to be a need for "text normalization". That is, making like "300m" into "300 meters" and "$1.2 billion" into "one point two billion dollars". Listening to Tortoise read "65 kn (120 km/h; 75 mph)" in the Hurricane article is quite disorienting!
It seems like NVidia's NeMo is the main open-source option I'm seeing around Github, although it is slow, has a lot of dependencies, and just throwing in some articles it came out with at least some errors ("kmh" into "Kilometers per H" in particular). Before I dive into learning about it, I want to ask: are there any other candidates to look at that anyone is aware of? It seems like there was some big Google Kaggle on text normalization a few years ago, but at least on a quick look they seem optimized to the contest itself and not adapted for ready application to real world use. Thanks -- Joe
Beta Was this translation helpful? Give feedback.
All reactions