Automating word list translations, e.g. English to Pinyin






[caption id="" align="alignleft" width="210" caption="Tim Ferriss' FHWW blog"]Tim Ferriss fourhourworkweek blog[/caption]Sarah and I are both a big fan of Tim Ferriss and his blog and check out what he's writing at from time to time. He has fascinating views on language learning and I found the great article linked above, which produces two, "top 100" English word lists, written and spoken - which I saved. I'd been thinking about making a script to translate single English words into Pinyin to save me time going to a website and looking it up.


Googling around got me to this Chinese translation site. Now what I needed was a URL that would accept words and, preferably a single translation, output in text so that I could manipulate it and display just what I wanted. Looking around the site I settled on this page which conveniently gave mostly single results in testing - the main translation page gave over 20 results for "house". The only trouble was the submit buttons were javascripted and the submitting URL was not visible in the URL bar of my browser. The "Live HTTP headers" Firefox plugin came in useful yet again and capturing a search gave me the exact string I was looking for. For convenience I swapped round some of the fields (taking care to preserve the position of the "&"s) and ended up for a search of "tractor" with:
http://usa.mdbg.net/chindict/chindict.php?page=translate&trst=0&trlang=0&wddmtm=1&trqs=tractor
.
Now all I had to do was manipulate the output from elinks (I prefer to use this tool as it offers a lot of flexibility with downloading and preserves page layout as faithfully as a text browser can do, which can be handy when grepping for stuff sometimes) and put that into a script and I was done. After about an hour's worth of experimenting and tweaking, I was able to pull out the Chinese character and Pinyin translations reliably whatever English word I chose, though a few non-essential words were missing or seemed a poor choice. The resulting line is shown here.
elinks -dump -force-html -dump-width 1600 -no-numbering -no-references "${url}=${word}" | sed -n "/pronouncedetailsgoogle/s/.*pronouncedetailsgoogle\(.*\)${word}/\1/pi" | cut -d' ' -f1-30

I put that into a script called pinyin-lookup (you can see that line in there) and for good measure made a quick and simple gui with zenity using that script as the backend. So that's the scripts done.

Then I took the saved list "100-most-common-words-written", cut out the numbers, looped through each word as a variable inputed to the pinyin-lookup script, giving the following oneliner.
for w in $(cat 100-most-common-words-written | awk '{print $2}') ; do pinyin-lookup $w >> 100-most-common-words-written-pinyin ; done

BTW the actual output can be seen in yesterday's post "Howto: Pinyin on Ubuntu" and here is part of the output:
after 以后 yǐ hòu
again 再 zài
all 所有 suǒ yǒu
Now the hard part - learning these words! If you use any of the scripts or techniques applied to other languages perhaps, I'd be delighted to hear about it. Have fun!

Comments

  1. Using the Firefox "Live HTTP headers" again I found the underlying posted string which generates translated text: e.g.
    http://babelfish.yahoo.com/translate_txt?ei=UTF-8&doit=done&fr=bf-home&intl=1&tt=urltext&trtext=eat&lp=en_fr&btnTrTxt=Translate

    ReplyDelete
  2. For example....
    [kiat@kiat-t61-uk pinyin]$ for word in $(cat 100-most-common-words-spoken | awk '{print $2}'); do export lang=fr ; echo -n "${word} = " ; elinks -dump -force-html -dump-width 1600 -no-numbering -no-references "http://babelfish.yahoo.com/translate_txt?ei=UTF-8&doit=done&fr=bf-res&intl=1&tt=urltext&lp=en_${lang}&btnTrTxt=Translate&trtext=${word}" | sed -n '1,/Search the web with this text/p' | tail -n 2 | head -1 ; done
    a = a
    after = ensuite
    again = encore
    all = tous
    almost = presque
    also = aussi
    always = toujours
    and = et
    because = parce que
    before = avant
    big = grand
    but = mais

    ReplyDelete

Post a Comment

Thanks for commenting! Because of so many spam posts, to save us all from them, I've had to regrettably moderate submitted comments. Thanks for your understanding.
Kiat