Speech recognition or machine translation have entered the lives of millions of people but, to make the machine learning (ML) algorithms behind them work better, it takes a lot of annotated and structured data. One way to get this data is by creating your own using specialized tools, an approach for which Christophe Ré coined the term ‘Data Programming’. We now compare corpora annotated by hand and by humans as ‘Gold Standard’ with ‘Silver Standard’ data created semi-automatically by artificial means. While Ré’s group has produced its own set of tools to do this (called ‘Snorkel’), we decided to address the problem from the angle of programming.
Having spent many years doing research on formal grammars, I watched these so-called symbolic methods gradually decline in favour of statistical approaches. This changed with the advent of Data Programming which, rather paradoxically, gave these lacklustre linguistic rules a new lease of life simply because, to detect word patterns, you need to apply context-free rules. Whilst thinking about this new space to implement rule engines and working alongside machine learning scientists who needed annotated data ,it seemed to make sense to combine the two. My colleagues were looking for not just annotated corpora in which recurrent patterns of words were identified and labelled, but also synthetic corpora with text generated by following specific grammars. In machine translation they were also keen for ‘noisy corpora’ where known errors are artificially introduced so that the model can learn to recognize them and translate them correctly.
Instead of creating a specialized tool I created a new programming language. A language that could help annotate a corpus, but also generate new text, while offering the widest possible range of instructions and freedom to users. Welcome to ‘Tamgu’.