How to write machine translation ready source content
Wordbee offers solutions for Computer-Assisted Translation (CAT), translation management, content integrations, and more. To find out about the various ways to incorporate machine translation into a Wordbee workflow, get in touch!
To date, machine translation (MT) has become an indispensable tool in many domains and with different kinds of texts: from technical documentation to user-generated content, from websites to e-commerce, from tourism to software. If you’re still debating whether MT is the right solution for you, we recommend you read “A Machine Translation Checklist for New Users.”
The quality of an MT engine raw output and the final results after post-editing depend on various factors, one of which is the quality of source content. For this reason, the pre-editing stage of a text is as important as the post-editing.
Writing for MT: Back to Basics
Since the early beginning of MT, companies as well as governmental bodies and agencies using rule-based MT (RbMT) learnt that it was essential to write content in a way that was fully translatable and, most importantly, machine-translatable. This was the case, for example, of IBM’s user documentation and the weather reports from the Canadian METEO System.
Content adaptation for rule-based machine translation brought about the development of the first controlled languages, like Caterpillar Fundamental English and Simplified Technical English.
But you don’t need to develop and implement your own controlled language, though, to achieve controlled authoring. You can just follow basic content strategies and writing rules to attain the three main elements of content adaptation for MT: plainness, consistency, conciseness.
The combination of plainness, consistency and conciseness translate into a few, easy-to-follow rules that are still valid today:
- Write short sentences, i.e. do not write more than one clause and do not use conjunctions.
- Use the active voice.
- Keep sentences short.
- Keep away from colloquial phrases and idioms: machine translation engines have difficulty translating them and their meaning could prove difficult or inadequate to international users. Also, disambiguation is a yet unresolved problem in MT systems.
- Be direct and use your words consistently, i.e. do not use synonyms and jargon and don’t be afraid of repetitions; in this respect, a glossary containing also short recurrent sentences can prove very useful. For example, safety warnings, which include various short sentences; in the case of software localization, you might want to include strings (just like the first Microsoft glossary).
- Always check your text for spelling mistakes and grammar errors and, where possible, run a readability test.
For examples and more information, we recommend reading the article by Uwe Muegge on Clout (Controlled Language Optimized for Uniform Translation).
Neural Machine Translation: The Need for Structure and Context
With the advent of neural machine translation (NMT), context has become king. NMT engines translate entire sentences and can therefore handle long sentences (some experts say you can use up to 60 words), although consistent terminology is still a hurdle.
The main difference between a human brain and a machine translation engine is that the human brain can guess. Mistakes like misspelling or wrong interpunction will not hinder the understanding of a text by a human reader, while they can still trouble a MT engine, however sophisticated.
Therefore, writing for NMT means pushing to the extreme all the rules defined in the very beginning, when RbMT was in use. But most importantly, to be machine-ready a text needs to be intrinsically coherent and consistent.
A coherent text is a text where every paragraph make sense and it is clear and easy to understand for the reader. A consistent text depends, on the other hand, on the writing style.
If you use words that vary in spelling but have the same meaning (center/centre, color/colour) interchangeably, that isn’t consistent. If you use some short, choppy paragraphs and then move into longer, more elegant text, that’s not consistent either.
Consistency with information is also important: A car can’t be brown in one paragraph and green in the next. This is particularly important because NMT engines recognize pattern: the intrinsic coherence of a text contributes to the quality of the output.
Useful Tools to Produce MT-Ready Texts
There are many tools available for controlled authoring, depending on your language needs.
Besides the more traditional tools, like spellcheckers, grammar checkers and glossaries, Smart Compose is worth a try. It is an AI-powered feature available, for example, for the business version of Gmail and Google Docs. It offers personalized suggestions tailored to the user’s writing style. If integrated on a translation platform, it could help reuse resources, like translation memories and glossaries.
One final element to keep in mind is the readability index, which is an estimation of how difficult a text is to read. The index is usually calculated by measuring attributes as word lengths, sentence lengths, syllable counts, and so on. An academic text will be more complex than, let’s say, a user manual. The readability index of an academic text will be, therefore, higher than the one of more general content, like movie reviews or articles on sports.
By the way, the readability text for this article is 44.7. Check it yourself.
In conclusion, writing for machine translation means - in the words of William Zissner - writing well.
Wordbee is a translation management system and CAT tool with solutions designed to meet the most demanding enterprise requirements. If you are an enterprise and you suspect that you have workflow or management issues related to localization and translation, contact us for more details.