INTERACT planning document

Lauri Carlson

Department of General Linguistics

October 2000
 
 

Theory: Dialogue management architectures

The HLT Survey (http://cslu.cse.ogi.edu/HLTsurvey/ch6node2.html) treats dialogue modeling in two sections, general and spoken dialogue modeling. The general section establishes a classification of dialogue systems rather like the one presented here in terms of computational architectures:
 
  Architecture Characteristics Example
FSA transition networks menu driven CSLUrp 
CF recursive transition networks subdialogues Nuance 
ATN/UG augmented transition networks information states Philips, Trindi 
AI planning, inference, optimisation dialogue games SRI Autoroute, Trains 
SM statistics, machine learning best practice ATT, Verbmobil

A related classification is in terms of bottom up (data or goal driven) and top down (script or plan-driven) architectures. In a bottom-up system, the system uses its partner as a source of data to fill in a form. In a top-down system, the system follows a script or plan for the dialogue. The difference is one of degree: a bottom up system has a loose script and narrow success conditions, a top down system has a narrow script and loose success conditions. Bottom up (client talks, system reacts) and top down (system talks, client reacts) strategies have different applications, the strategy can vary with client and listening conditions.

The dialogue models described in http://cslu.cse.ogi.edu/HLTsurvey/ch6node5.html: dialogue grammars on the one hand, and plan-based models or joint action theories of dialogue on the other hand, appear to exemplify the FSA/CF, and AI architectures, respectively. While the first two seem too rigid, the drawback of the third approach is that plan-recognition and planning is combinatorially intractable in the worst case, and in some cases, are undecidable.

The TRINDI system appears to exemplify an intermediate type, where the dialogue is driven by the current and preferred information states of the participants. There is scant work to date on applying soft computing methods to dialogue management.

Spoken dialogue systems

The terms dialogue component and dialog management system come more specifically from speech research http://cslu.cse.ogi.edu/HLTsurvey/ch6node5.html#SECTION63: "The need for a dialogue component in a system for human-machine interaction arises for several reasons. Often the user does not express his requirement with a single sentence, because that would be impractical; assistance is then expected from the system, so that the interaction may naturally flow in the course of several dialogue turns. Moreover, a dialogue manager should take care of identifying, and recovering from, speech recognition and understanding errors.

The studies on human-machine dialogue have historically followed two main theoretical guidelines traced by research on human-human dialogue. Discourse analysis, developed from studies on speech acts [Sea76], views dialogue as a rational cooperation and assumes that the speakers' utterances be well-formed sentences. Conversational analysis, on the other hand, studies dialogue as a social interaction in which phenomena such as disfluencies, abrupt shift of focus, etc., have to be considered [Lev83]. Both theories have contributed to the design of human-machine dialogue systems; in practice, freedom of design has to be constrained so as to find an adequate match with the other technologies the system rests on. For example, dialogue strategies for speech systems should recover from word recognition errors.

The research focus in INTERACT could be, in accordance with the above and the structure of the consortium, in the combination of symbolic and soft computing methods to improve the robustness of speech recognition and the adaptivity of dialogue strategy to user response. More concretely put, the system will seem to understand a wide vocabulary because it cleverly changes recognition vocabulary when it falls off one, and it will seem to adapt to different speaking styles and speaking conditions, because it varies dialogue strategy according to what is spoken about (topic tracking) and how well it hears and understands (adaptive speech recognition and dialogue strategy).

More generally, we may want to frame the entire circuit from speech recognition to speech synthesis in stochastic or quantifiable terms, allowing choice between symbolic and soft computing methods anywhere in the circuit.

Desiderata: dialogue system ticklist

For black box evaluation of human-machine dialogue systems, feature checklists (ticklists) have been compiled inductively (Bohlin et al. 1999: Survey of existing interactive systems, http://www.ling.gu.se/research/projects/trindi).

In this section, the TRINDI checklist is organised along a model of idealised dialogue (Carlson 1983) to estimate what it takes in each case to achieve given behavior. The requirements are divided between speech technology (ST), language technology (LT), database design (DB), dialogue management (DM), and soft computing (SC).

Players


Q2. Is utterance interpretation sensitive to deictic context?

Example: now = system date

ST: recognition of indexical words

LT: recognition of indexical phrases, reduction rules to basic indexicals here, now, I, you, this

DM: Access to and conversion to/from system date/other indexical features (location, players)

Q12. Can the system recognise noisy input?

Example: (traffic noise in background)

ST: assessment of speech recognition success

DM: change of strategy, e.g. from bottom up to top down strategy

Q19. Is it possible to get connected to a human operator?

Q20. Does the system explicitly make it clear that it is not a human?

Epistemic alternatives

Q1. Is utterance interpretation sensitive to dialogue context?

Example: the bus = last mentioned bus (bus 10), other buses = not bus 10

ST: recognition of relevant words

LT: recognition of relevant phrases, interpretation rules

DM: Access to move history

Q6. Can the system deal with ambiguous designators?

Example: the downtown bus = bus to/from downtown

LT: recognition and representation of ambiguity

DM: resolution of ambiguity, possibly backtracking?

Q7. Can the system deal with negatively specified information?

Example: Not before Sunday = on or after Sunday

ST: recognition of negation words

LT: inference rules from negative to positive information

DB: negation in query language

Q9. Can the system deal with inconsistent information?

Example: today is Monday, tomorrow is Sunday

LT: interpretation rules check consistency

DM: Ask for correction, revise form accordingly

Q10. Can the system deal with belief revision?

Example: I want to go in the morning ... no, in the afternoon

DM: Acknowledge change, revise form

Q21. Is the domain adequately covered, i.e. are all aspects of the domain that the user might want information about covered?

LT: grammar of recognised fragment

SC: recognition of other domain related talk

Q23. Can the system keep track of several entities of the same type at the same time?

DM: representation of alternative threads of dialogue, backtracking

Preferences

Q14. Can the system deal with sub-dialogues concerning domain issues initiated by the user?

DM: representation of subdialogue hierarchy

SC: recognition of topic and topic shift

Q15. -concerning system functions and abilities?

SC: recognition of metadialogue

Q16. Is it possible to get a system tutorial concerning the system constraints?

DM: include help along with dialogue (type "please answer by number only")

Q22. Can more than one type of information be obtained from the system?

Strategies

Q3. Can the system deal with answers to questions that give more information than was requested?

LT: form driven IE from answer

DM: bottom up dialogue strategy

Q4. -Different information-?

LT: form driven IE from answer

DM: bottom up dialogue strategy

SC: topic recognition as fallback

Q5. -Less information-?

LT: form driven IE from answer

DM: form driven dialogue planning

Q11. Can the system deal with no answer to a question at all?

DM: top-down dialogue strategy afer timeout (system takes initiative)

Q13. Does the system give different feedback depending on the quality of recognised speech?

DM: top-down dialogue strategy at bad recognition levels

SC: topic tracking on basis of recognition rate

Q17. Can the system repeat an utterance on request?

DM: bottom-up dialogue strategy (system listens for user acknowledgement or requests for repetition)

Solution

Q8. Does the system only ask appropriate follow-up questions?

Example: When do you want to go? - I want to go to B. - Where do you want to go?

DM: Bottom up dialogue strategy

SC: Topic tracking

Q18. Can the system reformulate an utterance on request?

LT: Nondeterministic generation

In this paper, it is suggested that the project should pay attention to the underlined desiderata. How, will be discussed in the next section.

 Demo: System components and partner roles

The INTERACT demo is a spoken dialogue manager for querying Tampere bus timetables. The following system components are required, prima facie roles on the right:
 
 
Speech recognition TaY/CS (speech recognizer), HY/ling (language model) TAIK, HUT (soft computing)
NL parser HY/ling (off shelf tagging/parsing + ad hoc IE)
Database TaY/CS (off shelf)
NL generator HY (off shelf + ad hoc)
Speech synthesis HY (speech project?)
Dialogue manager TaY/CS, HY (symbolic processing), TAIK, HUT (soft computing)

Table 1

The Jaspis development architecture has been chosen as a common development platform. http://www.cs.uta.fi/hci/SUI/SUI2000/materiaali/jaspis.pdf.

Speech recognition

Given that speech recognition succeeds best with preselected small vocabularies, and that the recognition result is a weighted list of alternatives, vocabulary selection on the basis of heuristic topic tracking is a likely target for soft computing. When speech recognition rate against current vocabulary goes down drastically, the system ought to consider the possibility of a topic shift, and retry recognition with another, better fitting vocabulary and associated dialogue plan. (One method of topic detection could be calibration against a number of specific topic recognition vocabularies).

http://www.itl.nist.gov/speech/tests/

http://www.itl.nist.gov/iaui/894.01/publications/darpa99/index.htm

The timetable database

What the system can talk about depends on what it knows. It is assumed here that the bus timetable database consists minimally of facts (records) of the following primitive kind:

at(Bus,Stop,Time)

saying that a given bus (possibly identified as bus(Line,Departure)is scheduled to be at stop Stop at time Time. The facts can be more limited (e.g. only terminal stops are scheduled) or indirect (e.g. only average time between departures is given), but in either case they can be reduced to the above.

Arbitrary routes can be built up by sequencing such facts. In this project, route scheduling will not be addressed, which means that we restrict attention maximally to comparisons of individual pairs of facts.

Here is a list of entities, queries and facts that can be expressed (in fake SQL):
 
the bus stops at Hervanta at(Bus,hervanta,T)
the bus goes via Hervanta at(Bus,hervanta,T)
end station of the bus select stop from at where time = (select max(time) from at where bus = Bus)
the bus goes from station to Hervanta at(Bus,T1,station) and at(Bus,T2,hervanta)and T1<T2
what time is it? now
where are we? here
which bus is this? this
the next bus  select bus from at where time = (select min(time) from at where stop = here and time > now)
the next stop select stop from at where bus = this and time =(select min(time) from at where bus = this and time > now)
how many buses? select count(bus) from at

Table 2

NL parser/generator

The list of putative queries compiled by Tommi Jauhiainen (link) could be worked into a small (overgenerating) grammar:

Query := Subject, Predicate, Source?, Goal?, Loc? LocTime? TimeFrame? Duration? AbsFreq?

Subject := Attr Bus Number

Attr := Place GEN | numeron Number | seuraava, viimeinen, kutosen, monesko, mikä, montako

Predicate := NEG Verb kO?

Verb := on, seisoo, odottaa, lähtee, tulee, pysähtyy, menee, ajaa, kulkee, liikennöi, …

Bus := bussi, linja-auto, auto, vuoro, nysse…

Number := 1,2,…

Source := Place (ELA|ABL)

Goal := Place (ILL|ALL)

Loc := Place (INE|ADE)| Place GEN kautta

Place := asema, yliopisto, Hervanta, Amur, Pispala, Mannerheiminkatu, …

LocTime := DateTime (INE|ADE|ESS) | ennen DateTime PTV | DateTime GEN jälkeen | Number TimeUnits GEN kuluttua | arkipäivisin, pyhinä, aattona, milloin, kuinka pian, miten pian …

DateTime = ClockTime|TimeofDay|DayofWeek|Date…

TimeFrame = Number TimeUnits INE | miten nopeasti, nopeimmin, kuinka pian, missä ajassa, mihin mennessä

Duration := Number TimeUnits | kuinka kauan, miten pitkään

AbsFreq := Number kertaa | Number STI| monestiko

RelFreq := AbsFreq TimeFrame | Number TimeUnits GEN välein | miten usein, miten tiheästi

Indirect speech acts

The foregoing grammar can be mapped rather directly to the database. There is a large number of indirect speech acts that can also be converted to these types. There could be a separate rewriting component for such cases.

ehtiikö, pääseekö bussilla B -> Meneekö B

on tulossa menossa, matkalla -> tulee, menee nyt

haluaisin/pitäisi päästä -> meneekö bussia, milloin menee bussi, mikä bussi menee

kauanko joutuu odottamaan bussia B -> milloin bussi B tulee

ajaa A:n ja B:n väliä -> ajaa A:sta B:hen

mikä on nopein keino päästä B:hen?->mikä bussi on ensinnä B:ssä?

Ellipses, anaphors

Ellipses, or sentence fragments, form a large class of cases to handle. Many correction moves belong here:

Ei, vaan keskustasta.

Mitä muita busseja sinne menee?

Entä Hervannasta?

Seuraava bussi Hervantaan kiitos.

Other queries/speech acts

The system ought to be prepared for other types of queries and other speech acts, minimally recognise the topics or moves as extraneous and decline or avoid them meaningfully. This means that there must be nested levels of analysis: more detailed analysis in the form filling paradigm for in-dialogue moves and less detailed topic tracking, possibly in a soft computing paradigm, for off-dialogue moves.

In fact the parsing or IE problem itself, i.e. mapping of recognised bits of speech on information states, can also in its entirety be framed as a soft computing problem to find out how given patterns of recognised inputs affect the probabilities of given information states.

Missä on yliopisto? Miten pääsen suorinta tietä yliopistolle?

Mitä kautta/reittiä bussiB kulkee? Mikä on nopein yhteys Hervantaan?

Onko linjalla matalalattiabussia? Mistä ovesta mahtuvat lastenvaunut?

Paljonko maksaa? Mistä lippuja saa ostaa? Onko alennuksia? Saako bussissa maksaa? Minkä yhtiön bussi B on?

Mistä (miltä laiturilta) lähtee seuraava bussi Hervantaan?

Missä on taksiasema (neuvonta, vessa, kioski, ruokakauppa...)?

Talar ni svenska? Do you speak English? Govorite po-russki?

Mitäs tästä voi kysyä? No mitä mun nyt pitää sanoa? Maksaaks tää jotain?

Mitä? Miten se oli? Anteeksi, ei kuulu. Toistakko vielä.

Haloo? Kuuluuko? Onko siellä ketään? Tää meni ihan mykäks. Toimiiks tää?

No niin, koetetaas uuelleen. Ainiin, mä en painanu tosta.

Hei, älä mee, oota mä koklaan tätä. Hehe. (Älä viitti. Tuu jo!)

Voi voi, eihän tästä mitään tule. Eikös tähän ole minkäänlaista käyttöohjetta?

Vitsi tää ohjelma on syvältä. Turpa kiinni. Arvaa. (kirosanoja)

Mulla ois nyt seuraava ongelma. Sitten vielä toinen kysymys.

Hetkinen. Tota noin, ootas vähän. Annas olla.

Kiitos, näkemiin.

Apua! Poliisi!

Dialogue manager

Given the theory survey above, the direction in which the architecture of the dialogue manager should be sought is the intermediate type of model represented by TRINDI, in which there is a rather flexible dialogue grammar, mainly driven by the information goal of the system, which in turn is inferred from the topic derived by the client's initial request.

Form filling paradigm

Roughly, the client's initial question selects a query form for the system to fill, and the ensuing dialogue is primarily geared to filling out the rest of the form. Some simple form of information extraction (IE) is applied to the subsequent responses of the listener to fill out fields in the query form. This bottom up strategy produces more intelligent continuation questions when answers provide useful extra information.

At the same time, the system must be on the lookout for possible topic shifts in the answer, i.e. if the recognition rate is markedly low relative to the current vocabulary, a fallback strategy to search for possible topic shift is applied, with a corresponding change in recognition vocabulary and dialogue plan.

Speech recognition -topic tracking circuit

The basic principle of adapting dialogue strategy to talking conditions is simple. In general, it is more polite for the system to offer help, i.e. listen to the client and answer their questions; consequently, a bottom up strategy ought to be the primary strategy. However, when recognition deteriorates, the system may have to switch to a top down strategy, where the system asks the client questions and the client answers. This pares down the search space as the recognition set is reduced to the set of possible answers to the questions asked. At the same time, the system may switch from not or only indirectly confirming understanding to explicit confirmation moves. (Cf. Litman, D., Kearns, M., Singh, S., Walker, M. Automatic Optimization of Dialogue Management. COLING 18, Saarbrücken 2000)

Thus one task is to design a feedback circuit from speech recognition success to dialogue manager's dialogue strategy choice.

Dialogue manager - database query circuit

The choice of follow-up questions is guided by interim query to the database. The answer set of the final query must be small enough to produce a manageable voice response (assuming that the system only has a speech synthesiser channel and no facility to show timetables on screen). When the putative answer set is too large, the system tells the client so and tries to narrow it down by further questions to the user. A complementary device is to frame the answer in the form of an explanatory dialogue, interpolating requests for confirmation between parts of the answer.

Thus another task is to design a feedback circuit from the database manager to the dialogue manager's questioning strategy choice.

Information state

A third task is to design the internal representation of a dialogue situation and its history. In a form based approach, the information state of the system is primarily represented by a (partially filled) query form; the information goal is a completed form, and a dialogue plan is generated from the gaps in the form.

More generally, information state characterised as a choice from a set of partially filled query forms can be considered as a random variable, a probability distribution over a set of such states.

It is an open question whether the system needs to maintain a separate representation of the client's information state.

Dialogue history

To appear halfway intelligent, the system also has to keep track on what questions it has asked and what responses it has obtained, i.e. some representation of the dialogue history. A design question is the form or forms in which the dialogue history is registered and used, from the very surface to more digested forms: speech recognition input/output, word recognition and parsing results, information states.

For instance, ellipsis and anaphora recognition requires access to both the syntax of a previous move and its semantics (in the form filling paradigm):

A: Bussi 10 lähtee Hervantaan kahden minuutin kuluttua.

Q: Mitä muita busseja sinne menee?

Q: Entä takaisinpäin?

The system ought to back up gracefully from failed disambiguation, e.g. Q: Miten pääsen maanantaina Hervantaan?

A: Hervantaan lähtee maanantaina 31.2. kaksisataa vuoroa. Mihin aikaan päivästä haluat matkustaa?

Q: Tarkoitin huomenaamuna.

A: Hervantaan lähtee huomenna sunnuntaina 30.2. kello 6 ja 9 välillä 10 vuoroa 15 minuutin välein. Haluatko tarkempaa tietoa?

At the same time, the system must not exhibit too machine-like memory or planning, e.g. avoid reference to several moves back in the dialogue, pushing into and popping from extended subdialogues, or backtracking to earlier choice points. A question-counter-question-answer-reply sequence to depth one is a likely upper limit. The system ought to accept user repairs gracefully. Q: Miten pääsen maanantaina Hervantaan?

A: Hervantaan lähtee maanantaina 31.2. kaksisataa vuoroa. Mihin aikaan päivästä haluat matkustaa?

Q: Aamulla.

A: Kello 6 ja 9 välillä lähtee Hervantaan 20 vuoroa 5 minuutin välein. Tarkennanko?

Dialogue planning

Although explicit plan recognition is prohibitive, a hand-coded version in the form of volunteered information, so that answers do not only answer the literal question but the likely governing information need, contributes to perceived intelligence. For instance: Q: Milloin lähtee seuraava bussi Hervantaan?

A: Paikallisvuoro 10 lähtee Hervantaan nyt ja on perillä tunnin kuluttua. Pikavuoro 10x Hervantaan lähtee 5 minuutin kuluttua ja on perillä 20 minuutin kuluttua.

What happens here is that the original query is relegated to the query

Q: Mikä bussi menee nopeimmin Hervantaan?

revealing a supposed plan to get to Hervanta soonest possible. It is best to leave such plan recognition implicit, as it is in natural human conversation.

Adaptive learning of dialogue strategy

A possible domain for adaptive machine learning methods is the optimisation of dialogue strategy while the system is in (test) use, on the basis of feedback from speech recognition rate and other measures of success or failure of the system to meet its goals (Cf. Litman, D., Kearns, M., Singh, S., Walker, M. Automatic Optimization of Dialogue Management. COLING 18, Saarbrücken 2000).

Speech synthesis

The timetable demo is a potential platform for testing the prosody synthesis studied in the USIX speech synthesis project. The information state and move history from which dialogue system responses are generated easily provides the synthesis input with markup showing what part of the information is old, new, or contrastive information.

Map information

As a relatively independent addition, we may consider adding map or GIS information about locations not explicitly mentioned in the timetable database; for instance, to allow queries like Which buses go downtown?

Links:

http://www.ling.gu.se/~sl/dialogue_links.html