Research Seminar in Language Technology - Autumn 2016

Course Information

This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests

Time:

  • Thursdays 14-16 (different locations!)

Registration and Announcements:

  • Please subscribe to our e-mail list lt-research by sending a message to majordomo@helsinki.fi with the following text in the body of your e-mail:

    subscribe lt-research your.email@helsinki.fi

Tentative Schedule:

8 September
Place: U40 sali 12
Title: Information and News
Let's use the first session to inform each other about recent and coming events. Many of us participate in various conferences and workshops and during this seminar we can collect highlights and research trends to keep everyone up-to-date with on-going work. We will also use remaining time to mention coming events and other news related to our research group.
22 September: Seppo Nyrkkö
Place: U40, room 25
Title: Word-concept identification and reference resolution in text stream
Abstract:

In this session I'll show a continuously learning word-concept model, which can process a text stream and categorize tokens into concepts and further train the existing word-concept model, which is stored in a feature vector space.

For each token (word), the model computes a syntactic co-occurrence feature vector, which assigns the word to a concept category with a probability estimate. The concept category can be a previously found, manually defined, or a newly found candidate.

By the hypothesis, a concept category can be assigned for even a partial (synonymy, anaphora) reference in the sample text. To demonstrate the functionality and evaluation of word concept similarity, I am using the Stanford Parser with English PCFG language model, and using extracted abstracts from Medline articles as text samples. I'll explain the vector space model which maps both the word tokens and their syntactic relation-attribute pairs into a common vector space, a nd discuss the benefits of this representation.

I will further discuss the metrics of syntactic and semantic similarity, and ideas how to improve and understand these results further. These will include machine learning, self-organized maps and deep learning methods.

In a technological aspect, the experiment also aims at developing a limited-memory model, which grows sub-linearly compared against the corpus size, and provides fast seek times even for large text sets.

6 October: CANCELED!
Place: U40 sali 12
13 October: Kimmo Koskenniemi
Place: U40 sali 12
Title: Alignment and two-level relations

The presentation discusses briefly some new ways to apply the morphological two-level, including historical linguistics (see http://www.ep.liu.se/ecp/article.asp?issue=087&article=004&volume=) and the possibility of discovering two-level rules either informally by a linguist or even automatically (see http://jlm.ipipan.waw.pl/index.php/JLM/article/view/62). The the benefits of the inclusion of morphophonemic alternations in representations is pointed out. Alternations arise directly from the alignment and they carry information beyond traditional underlying phonemic representations - often from the historical stages. The presentation discusses also the utility of collecting adequate sets of examples as the starting point before one begins to think of the rules.

20 October: BAULT seminar - Austin Matthews, CMU
Place: U40 sali 12
Title: Leveraging Linguistic Knowledge in Neural Machine Translation

Methods of using morphology and syntax to improve NMT

27 October: BAULT seminar - Samuel Läubli, University of Zürich
Place: U40 sali 12
Title: Post-editing of Machine Translation in Practice: Is it the Way Forward?
Abstract:

Post-editing has been shown to be more efficient than translation from scratch in various languages and domains. However, efficiency gains are often offset by a suboptimal interplay between translation workbenches (frontend) and machine translation systems (backend), meaning that translators cannot take advantage – or are not even aware – of what state-of-the-art machine translation technology can offer.

In this talk, I will summarise lessons learned from implementing post-editing workflows in the automotive and software localisation industry, as well as compare and contrast them with findings from academic research. I will focus on recent work on interactive machine translation protocols in particular, pointing out open research problems and opportunities in the intersection of machine translation, human–computer interaction and translation process research.

Short Bio:

Samuel Läubli holds a Master’s degree in Artificial Intelligence from the University of Edinburgh. From 2014 to 2016, he designed and deployed machine translation systems as a Senior Computational Linguist at Autodesk. The systems were primarily used for software localisation through post-editing. In August 2016, Samuel started a PhD at the University of Zurich, focussing on interactive machine translation with neural networks.

10 November:: CANCELED!
Place: U40 sali 28
17 November: Anssi Yli-Jyrä
Place: U40, common room at the 6th floor (B610)
Title: Towards a (formal) language theory of aligned multistrings and structured strings with relevance to linguistics and graphical annotations
Abstract:

Classical formal language theory is concerned with string sets. Between string sets and string relations there are interesting kinds of sets that consist of aligned multistrings. The aligned multistrings may be generated by an infinite code, instead of a finite alphabet. Still the code itself builds on a finite alphabet and can be a regular language. Regular closure of such a code is closed under all Boolean operations, unlike regular relations. Due to this property, two-level relations over aligned multistrings can be described with local constraints and rules and recognized with recurrent devices such as finite transducers and neural networks. The operation of contraction of multistring languages is a generalization of projection for regular relations, and it can produce crossing alignments even if the original relation does not have them. For these crossing alignments, there is a context-free code that is inspired by interval graphs. This advanced finding might be a beginning for a search for codes that encode various more complex linguistic graphs and support implementation of Bayesian networks over multiple string variables as proposed by Cotterell, Peng and Eisner (2015).

The applied and less formal section of the talk overviews linguistic relevance of multistring relations. The applications include phonological representations, two-level morphological relations, syntactic codes, and corefential annotation, text alignment and semantics. The recent advances in sequence labeling and neural machine translation via deep LSTM recurrent neural networks have even increased the possible relevance of sequential representations of structures such as syntactic trees, alignments and morphological segmentation. The applications suggest that multistrings are a natural class of objects prompting further research.

24 November: Robert Östling
Place: SSKH 209 (SOC&KOM, Snellmaninkatu 12?)
Title: How many languages can a language model model?
Abstract: This is a presentation that will be given at VarDial 2016. One of the purposes of the VarDial workshop series is to encourage research into NLP meth- ods that treat human languages as a continuum, by designing models that exploit the similarities between languages and variants. In my work, I am using a continuous vector representation of languages that allows modeling and exploring the language continuum in a very direct way. The basic tool for this is a character-based recurrent neural network language model conditioned on language vectors whose values are learned during training. By feeding the model Bible transla- tions in a thousand languages, not only does the learned vector space capture language similarity, but by interpolating between the learned vectors it is possible to generate text in unattested inter- mediate forms between the training languages.
8 December: Jörg Tiedemann
Place: U40, common room at the 6th floor (B610)
Title: Climbing Mount BLEU: The Strange World of Reachable High-BLEU Translations
Abstract: At EAMT 2016, we presented a method for finding oracle BLEU translations in phrase-based statistical machine translation using exact document-level scores. Experiments are shown where the BLEU score of a candidate translation is directly optimised in order to examine the properties of reachable translations with very high BLEU scores. This is achieved by running the document- level decoder Docent in BLEU-decoding mode, where proposed changes to the translation of a document are only accepted if they increase BLEU. The results confirm that the reference translation cannot in most cases be reached by the decoder, which is limited by the set of phrases in the phrase table, and demonstrate that high-BLEU translations are often of poor quality.