This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests
In this session I'll show a continuously learning word-concept model, which can process a text stream and categorize tokens into concepts and further train the existing word-concept model, which is stored in a feature vector space.
For each token (word), the model computes a syntactic co-occurrence feature vector, which assigns the word to a concept category with a probability estimate. The concept category can be a previously found, manually defined, or a newly found candidate.
By the hypothesis, a concept category can be assigned for even a partial (synonymy, anaphora) reference in the sample text. To demonstrate the functionality and evaluation of word concept similarity, I am using the Stanford Parser with English PCFG language model, and using extracted abstracts from Medline articles as text samples. I'll explain the vector space model which maps both the word tokens and their syntactic relation-attribute pairs into a common vector space, a nd discuss the benefits of this representation.
I will further discuss the metrics of syntactic and semantic similarity, and ideas how to improve and understand these results further. These will include machine learning, self-organized maps and deep learning methods.
In a technological aspect, the experiment also aims at developing a limited-memory model, which grows sub-linearly compared against the corpus size, and provides fast seek times even for large text sets.
The presentation discusses briefly some new ways to apply the morphological two-level, including historical linguistics (see http://www.ep.liu.se/ecp/article.asp?issue=087&article=004&volume=) and the possibility of discovering two-level rules either informally by a linguist or even automatically (see http://jlm.ipipan.waw.pl/index.php/JLM/article/view/62). The the benefits of the inclusion of morphophonemic alternations in representations is pointed out. Alternations arise directly from the alignment and they carry information beyond traditional underlying phonemic representations - often from the historical stages. The presentation discusses also the utility of collecting adequate sets of examples as the starting point before one begins to think of the rules.
Methods of using morphology and syntax to improve NMT
Post-editing has been shown to be more efficient than translation from scratch in various languages and domains. However, efficiency gains are often offset by a suboptimal interplay between translation workbenches (frontend) and machine translation systems (backend), meaning that translators cannot take advantage – or are not even aware – of what state-of-the-art machine translation technology can offer.
In this talk, I will summarise lessons learned from implementing post-editing workflows in the automotive and software localisation industry, as well as compare and contrast them with findings from academic research. I will focus on recent work on interactive machine translation protocols in particular, pointing out open research problems and opportunities in the intersection of machine translation, human–computer interaction and translation process research.Short Bio:
Samuel Läubli holds a Master’s degree in Artificial Intelligence from the University of Edinburgh. From 2014 to 2016, he designed and deployed machine translation systems as a Senior Computational Linguist at Autodesk. The systems were primarily used for software localisation through post-editing. In August 2016, Samuel started a PhD at the University of Zurich, focussing on interactive machine translation with neural networks.
Classical formal language theory is concerned with string sets. Between string sets and string relations there are interesting kinds of sets that consist of aligned multistrings. The aligned multistrings may be generated by an infinite code, instead of a finite alphabet. Still the code itself builds on a finite alphabet and can be a regular language. Regular closure of such a code is closed under all Boolean operations, unlike regular relations. Due to this property, two-level relations over aligned multistrings can be described with local constraints and rules and recognized with recurrent devices such as finite transducers and neural networks. The operation of contraction of multistring languages is a generalization of projection for regular relations, and it can produce crossing alignments even if the original relation does not have them. For these crossing alignments, there is a context-free code that is inspired by interval graphs. This advanced finding might be a beginning for a search for codes that encode various more complex linguistic graphs and support implementation of Bayesian networks over multiple string variables as proposed by Cotterell, Peng and Eisner (2015).
The applied and less formal section of the talk overviews linguistic relevance of multistring relations. The applications include phonological representations, two-level morphological relations, syntactic codes, and corefential annotation, text alignment and semantics. The recent advances in sequence labeling and neural machine translation via deep LSTM recurrent neural networks have even increased the possible relevance of sequential representations of structures such as syntactic trees, alignments and morphological segmentation. The applications suggest that multistrings are a natural class of objects prompting further research.