-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write tool which can convert translated files back to PO #30
Comments
I'd be happy to take a look at this! It would be a separate bin crate then?
|
Hejsa! Thanks for looking at this! I was thinking to use some of the Rust book translations to start with, e.g., pick one here which is somewhat current and which you can read 🙂 The Rust Embedded book is also translated into a few languages — that's where the idea came from. You might want be able to reuse some of the logic in let msgids = ["msgid_1", "msgid_2", "msgid_1"];
let msgstrs = ["msgstr_1", "msgstr_2", "msgstr_1"]; Finally, it zips those lists and output the One feature your tool could have would be to (attempt to) synchronize the translations on various points. So if the source text is # The Foo Project
Welcome to the Foo projects...
## Getting Started
To install Foo, do ... and the translated document is # Foo-projektet
Velkommen til Foo-projektet...
## Komme godt igang
For at installere Foo, ... then the conversion tool could align on the Similar with lists and block quotes, they could perhaps also be synchronization points. This is just an idea, I'm not sure it's useful 😄 |
I missed this part, I think it could just be a new binary next to the others to start with. If people like the tool but somehow don't want to see |
Yes, that's what I mean by "separate bin crate". I agree that we might do a separate "package" later on if needed. Additionally, I'm a bit held up at the moment with other work, but I'll keep working in this whenever I have the time.
From that link, how do I get the actual translated .md file? It seems that all that's available is either .po or .html files. |
Hejsa!
I see, great!
That's completely fine, I know the feeling 😄
Clicking through to the Swedish translation (to keep a Scandinavian theme — I'm from Denmark and I guess you're from Sweden), leads me to https://github.com/sebras/book/tree/master/src where I see a bunch of translated files. So I would use these two files an a starting point:
I'm not sure if the idea actually works... will 10% of the paragraphs line up, or 80%? Just from skimming those two files, it looks like a lot of text lines up very nicely. |
Hejsan! I've (finally) started looking into this. The code required is pretty straight forward for reading, parsing and extracting. The hard part is the syncing. Having looked at a couple different translations of the book, they can differ a lot, a little, or not at all. Some examples. ch00-00-introduction.en.md: 45 ch01-02-hello-world.en.md: 45 ch02-00-guessing-game-tutorial.en.md: 177 Trying to line them up might work in really simple scenarios, where the structure is still intact. But in the PT case above, the EN version is really far ahead. The main problem I see is that there is a lack of context in the MD files. If a single heading (or other sync point of choice) is left out in either file, the remaining message pairs will be garbage. Even with a small difference like 42 vs 45, there is no way to tell if the diff was at message 1 or 40. So what is "good enough" here? Should we fail as soon as something appears to be off? Or greedily try to do what we can? What's the threshold? Strict or lax? Need some guidance here. |
That is a very good point and I don't think there is a clear right answer here. Our When we normalize each mdbook-i18n-helpers/i18n-helpers/src/normalize.rs Lines 203 to 220 in 2b14491
If there are extra messages in the
You mention a "sync point"... which is something we haven't looked into before! For the use case of generating a
and perhaps others? I haven't thought this through, but I could imagine using a largest common subsequence or diffing algorithm to find common parts of the Markdown AST between the two documents. The flow would be something like this:
At the end of the day, one document will be the source. So regardless of what the translated document contains, the source will win: if there are 10 paragraphs in the source document, then there should also be 10 paragraphs in the translation. This is kind of a built in limitation of the way we attempt to chop up the source document when we do translations. |
Nope, I would not expect this — everything in this library should be completely in-process and not spawn anything. It only does text to text transformations. The |
Hi @memark, let me unassign this for now — we can assign it back to you or someone else later when activity picks up. |
This idea is from rust-embedded/book#326: we should write a converter tool which takes two Markdown files as input and outputs a PO file.
More concretely, the tool should take an
en/foo.md
andxx/foo.md
file and output axx.po
file. The tool will callextract_messages
on both files and line up the results. It will use the messages fromen/foo.md
as themsgid
and the corresponding message fromfoo/xx.md
as themsgstr
.The output is marked fuzzy to ensure that a human translator double-checks it all before publication.
The text was updated successfully, but these errors were encountered: