Monolingual Corpora

For the set of 3 languages (Danish, English, Italian) and the set of 12 languages (Bulgarian, Czech, Danish, German, Greek, English, Spanish, Finnish, French, Hungarian, Italian, and Swedish), monolingual corpora were constructed from a combination of text from the Leipzig Corpora Collection and Europarl. For all other languages, only text from Leipzig was used. The full tokenized corpora can be downloaded below.

Download (12 languages, Europarl+Informatik): Swedish | Danish | Bulgarian | Hungarian | Finnish | Greek | Czech | Spanish | French | German | English | Italian

Download (All other languages, Informatik): Tar archive

Parallel Corpora

Parallel text was taken from Europarl, Wikipedia titles, and parallel news commentary. Our tokenized versions can be downloaded below.

Download: Swedish-English | Danish-English | Bulgarian-English | Hungarian-English | Finnish-English | Greek-English | Czech-English | Spanish-English | French-English | German-English | Italian-English

Bilingual Dictionaries

For the 3- and 12-language sets, parallel dictionaries were created by thresholding word alignment probabilities from the parallel corpora at 0.1. For all other languages, dictionaries were formed by translating the 20k most common words in the English monolingual corpus with Google Translate, ignoring translation paris with identical surface forms. All dictionaries can be downloaded below.

Download (12 languages): Swedish-English | Danish-English | Bulgarian-English | Hungarian-English | Finnish-English | Greek-English | Czech-English | Spanish-English | French-English | German-English | Italian-English

Download (All other languages): Single tar archive | Individual downloads

Trained Embeddings

Download (3 languages): multiCluster (size=40) | multiCluster (size=512) | multiCCA (size=40) | multiCCA (size=512) | multiSkip (size=40) | multiSkip (size=512) | translation invariance (size=40) | translation invariance (size=512)

Download (12 languages): multiCluster (size=40) | multiCluster (size=512) | multiCCA (size=40) | multiCCA (size=512) | multiSkip (size=40) | multiSkip (size=512) | translation invariance (size=40) | translation invariance (size=512)

Download (59 languages): multiCluster (size=40) | multiCluster (size=512) | multiCCA (size=40) | multiCCA (size=512)