|Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, Sriram Venkatapathy|
|The 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, July 8 - July 14 2012.|
Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose. It is well-known that the greater the amount of parallel corpus, the better the expected level of accuracy of the resulting system. However, creation of parallel data is costly and time-intensive, and a prior assessment of the amount of human translations that should be produced in order to achieve a satisfactory accuracy level would be very useful. The prediction of the size of the parallel corpus is our primary goal here. In this work, we predict a learning curve that plots the size of the parallel corpus against the expected accuracy of a machine translation system. We consider two scenarios, 1) a monolingual corpus sample in the source language is available and 2) a small amount of parallel corpus is available. We propose methods for predicting learning curves for both these scenarios, as well as for combining these two scenarios in order to obtain a more accurate learning curve.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.