TAMGU: A new open source programming language to help create, annotate and augment corpora and data. - Naver Labs Europe
preloder

Tamgu on GitHub imageSpeech recognition or machine translation have entered the lives of millions of people but, to make the machine learning (ML) algorithms behind them work better, it takes a lot of annotated and structured data. One way to get this data is by creating your own using specialized tools, an approach for which Christophe Ré coined the term ‘Data Programming’. We now compare corpora annotated by hand and by humans as ‘Gold Standard’ with ‘Silver Standard’ data created semi-automatically by artificial means. While Ré’s group has produced its own set of tools to do this (called ‘Snorkel’), we decided to address the problem from the angle of programming.

Having spent many years doing research on formal grammars, I watched these so-called symbolic methods gradually decline in favour of statistical approaches. This changed with the advent of Data Programming which, rather paradoxically, gave these lacklustre linguistic rules a new lease of life simply because, to detect word patterns, you need to apply context-free rules. Whilst thinking about this new space to implement rule engines and working alongside machine learning scientists who needed annotated data ,it seemed to make sense to combine the two. My colleagues were looking for not just annotated corpora in which recurrent patterns of words were identified and labelled, but also synthetic corpora with text generated by following specific grammars. In machine translation they were also keen for ‘noisy corpora’ where known errors are artificially introduced so that the model can learn to recognize them and translate them correctly.
Instead of creating a specialized tool I created a new programming language. A language that could help annotate a corpus, but also generate new text, while offering the widest possible range of instructions and freedom to users. Welcome to ‘Tamgu’.

Functional, Imperative and Logical

ChoosingTamgu (which means ‘research’ or ‘exploration’ in Korean) is a FIL programming language i.e. Functional, Imperative and Logical. The majority of languages today are ‘IF’, Imperative with a touch of Functional like Kotlin or Swift, others are frankly Functional like Haskell or Lisp and there are some purely Logical languages such as Prolog.

Functional Module

A FIL language poses particular interoperability problems. However, if you look at the objects manipulated by all these approaches, you can see they have something in common: they more or less manipulate the same objects i.e. numbers, strings and containers. So, Haskell inspired me for the functional part, which proved to be the ideal language to write compact and efficient lambda functions. I adapted it by allowing the language to manipulate Tamgu dictionaries (maps) directly or to iterate on external variables as you can see below:

fonctional module figure

Logical Module

Although Prolog has long been the language of choice for building text generators with a syntax that allows rich and complex grammars to be written in a few lines, it does have some limitations when it comes to managing large lexicons. This can be solved by integrating them into a FIL environment by giving management to suitable objects.

For the Logical part, I had to slightly modify the syntax of the language to make it interoperable with the rest of Tamgu. For example, variables are identified in Prolog by a capital letter at the beginning of their name, while lowercase words are considered immutable atoms. This is problematic in a FIL language that has none of these restrictions so, I was inspired by the SPARQL syntax where variables in logical expressions start with a”? ». By identifying the variables differently, I was able to replace the atoms with character strings. As for the Tamgu vectors, they’re transparently reinterpreted as Prolog vectors. More precisely, Tamgu vectors have the ability to be unified in a Prolog execution. The default unification is reduced to a simple comparison of equality between two objects, which then allows any Tamgu object to be used in a logical expression (see below).

logical module figure

Language Primitives

The “Imperative” part of the language is composed of the traditional modules of most existing languages. We can declare variables, functions, threads (micro-threads even), and classes i.e.

Language Primitives figure1

Unlike Python, Tamgu needs variables to be declared with a type. I also tried to simplify the handling of threads by implicitly protecting all potentially dangerous variables (mainly containers). Tamgu offers a very rich range of different types: strings, floats, integers, vectors, maps.

Language Primitives figure2

Character string management

Tamgu provides an impressive arsenal for manipulating your strings. First of all, it dynamically recognizes the encoding of a string, even if it accidentally mixes different encodings.

There are many ways to access the content of a string;

* With indexes: str[1], str[2 :4], str[-2 :]

* With sub-strings: str[“beg” : “end”]

* With regular expressions: str[r “ab%d+”].

You can also chain the descriptions: str[“a”:” e”][1:-1]

But, more importantly, you can modify the content of a string in this way:

Character string management figure

Glossaries and Rules

Last, but not least, Tamgu offers a lexicon mechanism based on transducers. Offering both compactness and speed of access, they’re the best way to encode a lexicon. The version implemented in Tamgu also lets you identify a word by traversing the transducer with an edit distance. That makes it possible to recognize words with common errors such as switching two characters, missing a character or, on the contrary, the presence of a supernumerary character.

But above all, these lexicons can be coupled with context-free rules, which can be written directly in the code. You can write your own vocabulary, add general lexicons of English or French if necessary and then write rules to identify complex patterns. In the example below, we define a few words to which we associate the label _food_. We then create a simple rule that detects the sequence _the food_. We create an _annotator_ that will automatically be associated with these rules and we apply it to the sentence.

Glossaries and Rules figure

So, in just a few lines, we can describe a lexicon coupled with rules to detect, in the text, the positions of the textual elements that interest us.

The example’s pretty simple, but you can increase the vocabulary over time. It’s compiled as a transducer on the fly with the number of rules. Again, there’s full interoperability between this mechanism and the FIL language.

Libraries

The implementation of an external library in Tamgu obeys exactly the same rules (the same derivation) as an internal object. In other words, implementing an embedding mechanism based on Word2Vec corresponds more or less to the implementation of the _string type. Unlike Java or Python where you can only implement external methods, in Tamgu there’s a direct correspondence between a Tamgu object and its implementation as a C++ object. Tamgu offers libraries that encapsulate cURL, liblinear, word2vec, SQLite, FLTK (GUI) or Wapiti (CRF). You can produce a template to create your own library using the very simple script we provide.

Conclusion

Tamgu has been designed to make cleaning or creating corpora as simple as possible. You can identify complex patterns in a few lines of code or generate text by applying grammars. Abiding by known programming languages, it’s easy to learn yet offers all the power needed to give your machine learning algorithms the data they need. Go try it out!

About the author: Claude Roux is a senior research engineer in the Systemic AI team. Other contributors to Tamgu are Caroline Brun, Alexandre Bérard, Ioan Calapodescu and Julien Perez.