You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing because I'm trying to learn about Zipformers but really like the way they perform so far in testing them. At some point I'd like to tune a model for healthcare (when I understand them better), specifically something that could be used for dictation that doesn't require internet access (more specifically for emergency or disaster relief scenarios). I've been trying different models, and so far the one that performs far and away the best for me is the csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-21. But there are a few that use what appears to be newer architectures (like csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26). The difference when I've been comparing them, 06-21 has a WER of around 12% while the 06-26 tends to run over 20%. My dataset is audio from medical lectures, medical podcasts, and some actual doctor-patient interactions (from publicly available datasets). I was just curious about the difference in accuracy. Is it possibly from how they were trained or something in the architecture itself? If looked to me that 06-21 was trained on librispeech and gigaspeech, while 06-26 used only librispeech. Just curious if you have any ideas, or if anyone else has run across this previously.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm writing because I'm trying to learn about Zipformers but really like the way they perform so far in testing them. At some point I'd like to tune a model for healthcare (when I understand them better), specifically something that could be used for dictation that doesn't require internet access (more specifically for emergency or disaster relief scenarios). I've been trying different models, and so far the one that performs far and away the best for me is the csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-21. But there are a few that use what appears to be newer architectures (like csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26). The difference when I've been comparing them, 06-21 has a WER of around 12% while the 06-26 tends to run over 20%. My dataset is audio from medical lectures, medical podcasts, and some actual doctor-patient interactions (from publicly available datasets). I was just curious about the difference in accuracy. Is it possibly from how they were trained or something in the architecture itself? If looked to me that 06-21 was trained on librispeech and gigaspeech, while 06-26 used only librispeech. Just curious if you have any ideas, or if anyone else has run across this previously.
Beta Was this translation helpful? Give feedback.
All reactions