Comparable Corpora BootCat (CCBC) Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd
BootCaT Bootstrapping Corpora and Terms Translators – Know the language – Not domain experts – Can interpret domain terms but can’t guess them Instant domain corpus from the web Marco Baroni and Silvia Bernardini (2004)
BootCaT method Piggyback on a search engine – Google, Yahoo, Bing Set of seed terms Repeat – Take random 3 seeds – Send to search engine – Gather ‘search hits’ pages Remove, duplicates, find terms – Can iterate
WebBootCaT Web interface Improved cleaning, duplicate removal Integrated with corpus tool (Sketch Engine)
Going multilingual Google-translate – English: volcanology volcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic – French: vulcanologue volcanologie "éruption volcanique " sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologie stratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques And do the same thing for French
By July 2011 – All steps integrated – Propose bilingual terminology