AT Language Solutions attends EMNLP 2018 and WMT18

ATBlog

The annual conference on empirical methods in natural language processing (EMNLP2018) and the third world conference on machine translation (WMT18) were held in Brussels from 31 October to 4 November. WMT18 is one of the most important international events covering research and development in machine translation (MT). Participants from businesses and universities come together to present and discuss the latest advances in this field. This year, AT Language Solutions successfully took part in the shared WMT18 task on parallel corpus filtering.

Data-based machine translation

The task in question dealt with the problem of cleaning noisy parallel corpora. This is a common scenario in the development of current data-based machine translation systems, which require huge amounts of training data to function properly. Training data can be obtained through web crawling, for instance. However, this type of procedure tends to result in noisy data. Parallel corpora obtained through web crawling can contain sentences in a third language, mismatched phrases, incorrect or incomplete translations, etc. At WMT18 participants in the shared parallel corpus filtering task were asked to design a method to select valid translation pairs from an extremely noisy German-English corpus which had been obtained through web crawling, and present the resulting subset of clean sentence pairs. The proposals were assessed by measuring the quality of the machine translation systems trained, based on the selected data.

Participation of AT Language Solutions

In our presentation we dealt with the issue within the framework of machine learning, where the aim is to estimate to what extent two parallel sentences in two languages match and can therefore be considered translations of each other. The article presented at the conference, which contains all the technical details, is publicly available here. The presentation was given in a group session where the different participants presented their approaches. Ours can be seen here. The score we obtained placed us in the top third of all participants, just a few points away from the top-scoring systems. The detailed results of the task can be viewed here.