ChoosingTamgu (which means ‘research’ or ‘exploration’ in Korean) is a FIL programming language i.e. Functional, Imperative and Logical. The majority of languages today are ‘IF’, Imperative with a touch of Functional like Kotlin or Swift, others are frankly Functional like Haskell or Lisp and there are some purely Logical languages such as Prolog.
A FIL language poses particular interoperability problems. However, if you look at the objects manipulated by all these approaches, you can see they have something in common: they more or less manipulate the same objects i.e. numbers, strings and containers. So, Haskell inspired me for the functional part, which proved to be the ideal language to write compact and efficient lambda functions. I adapted it by allowing the language to manipulate Tamgu dictionaries (maps) directly or to iterate on external variables as you can see below:
Although Prolog has long been the language of choice for building text generators with a syntax that allows rich and complex grammars to be written in a few lines, it does have some limitations when it comes to managing large lexicons. This can be solved by integrating them into a FIL environment by giving management to suitable objects.
For the Logical part, I had to slightly modify the syntax of the language to make it interoperable with the rest of Tamgu. For example, variables are identified in Prolog by a capital letter at the beginning of their name, while lowercase words are considered immutable atoms. This is problematic in a FIL language that has none of these restrictions so, I was inspired by the SPARQL syntax where variables in logical expressions start with a”? ». By identifying the variables differently, I was able to replace the atoms with character strings. As for the Tamgu vectors, they’re transparently reinterpreted as Prolog vectors. More precisely, Tamgu vectors have the ability to be unified in a Prolog execution. The default unification is reduced to a simple comparison of equality between two objects, which then allows any Tamgu object to be used in a logical expression (see below).
The “Imperative” part of the language is composed of the traditional modules of most existing languages. We can declare variables, functions, threads (micro-threads even), and classes i.e.
Unlike Python, Tamgu needs variables to be declared with a type. I also tried to simplify the handling of threads by implicitly protecting all potentially dangerous variables (mainly containers). Tamgu offers a very rich range of different types: strings, floats, integers, vectors, maps.
Character string management
Tamgu provides an impressive arsenal for manipulating your strings. First of all, it dynamically recognizes the encoding of a string, even if it accidentally mixes different encodings.
There are many ways to access the content of a string;
* With indexes: str, str[2 :4], str[-2 :]
* With sub-strings: str[“beg” : “end”]
* With regular expressions: str[r “ab%d+”].
You can also chain the descriptions: str[“a”:” e”][1:-1]
But, more importantly, you can modify the content of a string in this way:
Glossaries and Rules
Last, but not least, Tamgu offers a lexicon mechanism based on transducers. Offering both compactness and speed of access, they’re the best way to encode a lexicon. The version implemented in Tamgu also lets you identify a word by traversing the transducer with an edit distance. That makes it possible to recognize words with common errors such as switching two characters, missing a character or, on the contrary, the presence of a supernumerary character.
But above all, these lexicons can be coupled with context-free rules, which can be written directly in the code. You can write your own vocabulary, add general lexicons of English or French if necessary and then write rules to identify complex patterns. In the example below, we define a few words to which we associate the label _food_. We then create a simple rule that detects the sequence _the food_. We create an _annotator_ that will automatically be associated with these rules and we apply it to the sentence.
So, in just a few lines, we can describe a lexicon coupled with rules to detect, in the text, the positions of the textual elements that interest us.
The example’s pretty simple, but you can increase the vocabulary over time. It’s compiled as a transducer on the fly with the number of rules. Again, there’s full interoperability between this mechanism and the FIL language.
The implementation of an external library in Tamgu obeys exactly the same rules (the same derivation) as an internal object. In other words, implementing an embedding mechanism based on Word2Vec corresponds more or less to the implementation of the _string type. Unlike Java or Python where you can only implement external methods, in Tamgu there’s a direct correspondence between a Tamgu object and its implementation as a C++ object. Tamgu offers libraries that encapsulate cURL, liblinear, word2vec, SQLite, FLTK (GUI) or Wapiti (CRF). You can produce a template to create your own library using the very simple script we provide.
Tamgu has been designed to make cleaning or creating corpora as simple as possible. You can identify complex patterns in a few lines of code or generate text by applying grammars. Abiding by known programming languages, it’s easy to learn yet offers all the power needed to give your machine learning algorithms the data they need. Go try it out!