I promise this is the last one for today ... there is so much going on
and to be written these days ... crazy.
Well have a read - tomorrow the second part of this will hopefully be
ready.
Cheers, Sabine
*****
Translating Wikipedia articles ...
... into less resourced languages. Well, time has come that we can start
to think about how to go about a faster creation of contents for the
many small Wikipedias. As you all know, often we have just a handful of
people creating and translating and then adapting articles. Well ...
combining various Open Source and Open Content projects we can now go a
further step into the direction of fast contents creation, but that does
not mean: stub upload. This is a completely different way of doing
things.
Apertium is a machine translation tool that works really great with
similar languages. Approx. a year ago I had a translation from Spanish
to Catalan done by Apertium through the online interface
(http://xixona.dlsi.ua.es/apertium/) and asked some people of the
Catalan Wikipedia to have a look at it. They told me that of course it
was not perfect, but that it would be easy to proofread it and much
faster than actually translating it. In March I made a similar test
during a masters for translation studies in Pisa. I asked one of the
students who was bilingual Spanish and Catalan to have a look at the
outcome of the machine translation of a general text. The grammar was
almost perfect and and also the terminology. There were just 5
corrections in a bit more than half a page (A4).
Now what does this mean to us: if we have a bilingual wordlist for two
similar languages under a free license, we can pass it on to the
Apertium people. From there we are a step closer of getting machine
translation for that specific language combinations on their way.
One note inbetween for the Apertium people who might read this: please
don't mind me not using specific terminology to describe what needs to
be done. It could become to techy.
So the next step is to identify what a term is and how it needs to be
handled. That is for example a verb needs to be declared as such, then
one needs to give it a tag that indicates which conjugation scheme needs
to be applied. This needs doing for all word types, that is verbs,
nouns, adjectives etc. After that grammar rules need to be considered.
Step by step the correctness level will be improved and the time
invested to complete wordlists which will be available as google doc
spreadsheet and to add all the additional information will help to save
a lot of time. That is: now it will take longer, once the engine
"learnt" how to deal with the terminology and grammar for that specific
language combination creating contents will become much faster. This
will help the small projects in such a way that the few editors can
concentrate on proof reading and adapting and will result in a faster
contents growth that has quite high quality.
This project that is going to care about less resourced languages will
be one of the first lead through Vox Humanitatis. Should you be
interested in helping with the wordlists, please let us know which
language combination you would like to work on (that is starting from
English right now and step by step from others since most of the
Terminology is there in English). We will get you the access to the
online document. If you need to work offline, please let us know. You
can contact me by e-mail: s.cretella (at) voxhumanitatis.org
I just received a list of the supported language combinations as well as
an example for Catalan-Occitan and some notes on evaluation of machine
translation co-operating with a Wikipedia community. This means I have
quite some further stuff to tell you. I'll post that info tomorrow,
otherwise this blog would become too long.
Please also note that the documents will be released under CC-BY license
and therefore they can be integrated into any wiktionary.
--
Posted By Sabine Cretella to words & more at 12/06/2007 07:27:00 PM