Symbol Layer


Datatypes for Symbols and Symbol Alphabets

typedef Symbol Symbol
 A handle for a symbol name, i.e. for a string.
typedef SymbolSet SymbolSet
 A set of symbols aka an alphabet of symbols.
typedef SymbolIterator SymbolIterator
 Iterator over the symbols in a SymbolSet.
typedef SymbolPair SymbolPair
 A pair of symbols representing a transition in a transducer.
typedef SymbolPairSet SymbolPairSet
 A set of symbol pairs aka an alphabet of symbol pairs.
typedef SymbolPairIterator SymbolPairIterator
 Iterator over the set of symbol pairs in a SymbolPairSet.
typedef KeyTable KeyTable
 A table for storing Key-to-Symbol associations.

Defining and Using Symbols

Symbol define_symbol (const char *s)
 Define a symbol with name s.
bool is_symbol (const char *s)
 Whether the string s indicates a name for a symbol.
Symbol get_symbol (const char *s)
 Find the symbol for the symbol name s.
const char * get_symbol_name (Symbol s)
 Find the symbol name for the symbol s.
bool is_equal (Symbol s1, Symbol s2)
 Whether the symbol s1 is identical to symbol s2.

Defining and Using Alphabets of Symbols

SymbolSetcreate_empty_symbol_set ()
 Define an empty set of symbols.
SymbolSetinsert_symbol (Symbol s, SymbolSet *Si)
 Insert s into the set of symbols Si and return the updated set.
bool has_symbol (Symbol s, SymbolSet *Si)
 Whether symbol s is a member of the set of symbols Si.

Iterators over Symbols

SymbolIterator begin_sigma_symbol (SymbolSet *Si)
 Beginning of the iterator for the symbol set Si.
SymbolIterator end_sigma_symbol (SymbolSet *Si)
 End of the iterator for the symbol set Si.
size_t size_sigma_symbol (SymbolSet *Si)
 Size of the iterator for the symbol set Si.
Symbol get_sigma_symbol (SymbolIterator Si)
 Get the symbol represented by the symbol iterator si.

Defining and Using Symbol Pairs

SymbolPairdefine_symbolpair (Symbol s1, Symbol s2)
 Define a symbol pair with input symbol s1 and output symbol s2.
Symbol get_input_symbol (SymbolPair *s)
 Get the input symbol of SymbolPair s.
Symbol get_output_symbol (SymbolPair *s)
 Get the output symbol of SymbolPair s.

Defining and Using Alphabets of Symbol Pairs

SymbolPairSetcreate_empty_symbolpair_set ()
 Define an empty set of symbol pairs.
SymbolPairSetinsert_symbolpair (SymbolPair *p, SymbolPairSet *Pi)
 Insert p into the set of symbol pairs Pi and return the updated set.
bool has_symbolpair (SymbolPair *p, SymbolPairSet *Pi)
 Whether symbol pair p is a member of the set of symbol pairs Pi.

Iterators over Symbol Pairs

SymbolPairIterator begin_pi_symbol (SymbolPairSet *Pi)
 Beginning of the iterator for the symbol pair set Pi.
SymbolPairIterator end_pi_symbol (SymbolPairSet *Pi)
 End of the iterator for the symbol pair set Pi.
size_t size_pi_symbol (SymbolPairSet *Pi)
 Size of the iterator for the symbol pair set Pi.
SymbolPairget_pi_symbolpair (SymbolPairIterator pi)
 Get the symbol pair represented by the symbol pair iterator pi.

Defining the connection between symbols and transducer keys. The

relation 1:N between keys and symbols is useful for dealing with equivalence classes of symbols.

KeyTablecreate_key_table ()
 Create an empty key table.
bool is_key (Key i, KeyTable *T)
 Whether i indicates an existing key in key table T.
bool is_symbol (Symbol s, KeyTable *T)
 Whether s indicates an existing symbol in key table T.
void associate_key (Key i, KeyTable *T, Symbol s)
 Associate the key i in the key table T with the symbol s.
Key get_key (Symbol s, KeyTable *T)
 Find the key for the symbol s in key table T.
Key get_unused_key (KeyTable *T)
 Return a Key which hasn't been associated to any symbol in key table T.
Symbol get_key_symbol (Key i, KeyTable *T)
 Find a symbol for the key i in key table T.
KeySetget_key_set (KeyTable *T)
 A set of keys in key table T.
KeyTableread_symbol_table (istream &is, bool binary=false)
 Read a symbol table from istream is and transform it to a key table. binary defines whether the symbol table is in binary or text format.
void write_symbol_table (KeyTable *T, ostream &os, bool binary=false)
 Transform the key table T to a symbol table and write it to ostream os. binary defines whether the symbol table is written in binary or text format.
KeyTablegather_flag_diacritic_table (KeyTable *kt)
 Return a new key table only including those key/symbol pairs which correspond to flag-diacritic symbol names.

Reading Symbol Strings and Transducers

Read transducers

(1) in text format from pair strings and input streams and

(2) in binary format from files and input streams so that the keys used in the transducer are harmonized according to a key table.

TransducerHandle longest_match_tokenizer (TransducerHandle t, Key marker)
 Create a longest match tokenizer based on paths of transducer t.
TransducerHandle longest_match_tokenizer (KeySet *ks, KeyTable *kt)
TransducerHandle longest_match_tokenizer2 (KeySet *ks, KeyTable *kt)
TransducerHandle longest_match_tokenizer2 (KeyTable *kt)
KeyTablerecode_key_table (KeyTable *kt, const char *epsilon_replacement)
 Replace the epsilon in kt, with epsilon_replacement.
KeyPairVectortokenize_string_pair (TransducerHandle tokeniser, const char *upper, const char *lower, KeyTable *inputKeys)
 Change 2 strings to a transducer aligned character by character according to tokenisation by tokeniser. The path(s) of result of composition of of string’s UTF-8 representations against tokeniser are paired up to a new tokeniser from beginning to end. Empty spaces in the end are filled with ε’s.
KeyVectortokenize_string (TransducerHandle tokeniser, const char *string, KeyTable *inputKeys)
 Change a string s into identity pair transducer as tokenised by tokeniser.
KeyVectorlongest_match_tokenize (TransducerHandle tokenizer, const char *string, KeyTable *inputKeys)
KeyPairVectorlongest_match_tokenize_pair (TransducerHandle tokenizer, const char *string1, const char *string2, KeyTable *inputKeys)
KeyPairVectortokenize_pair_string (TransducerHandle tokeniser, char *pairs, KeyTable *inputKeys)
 Tokenise with tokeniser a string s of individual characters and colon separated pairs into transducer.
TransducerHandle pairstring_to_transducer (const char *str, KeyTable *T)
 Create a one-path transducer as defined in pairstring form in str using the symbols defined in key table T.
TransducerHandle read_transducer_text (istream &is, KeyTable *T, bool sfst=false)
 Make a transducer as defined in text form in file using the key-to-printname relations defined in key table T. The parameter sfst defines whether SFST text format is used, otherwise AT&T format is used.
bool has_symbol_table (istream &is)
 Whether the transducer coming from istream is has a symbol table stored with it.
TransducerHandle read_transducer (istream &is, KeyTable *T)
TransducerHandle harmonize_transducer (TransducerHandle t, KeyTable *T_old, KeyTable *T_new)
void print_key_table (KeyTable *T)
TransducerHandle read_transducer (char *filename, KeyTable *T)
 Read a binary transducer from file filename and harmonize it according to the key table T.

Writing Symbol Strings and Transducers

Write transducers

(1) in text format into symbol pair strings and output streams and

(2) in binary format to output streams and files so that the print names associated to keys are stored with the transducer.

char * transducer_to_pairstring (TransducerHandle t, KeyTable *T, bool spaces=true)
 A pairstring representation of one-path transducer t using the symbols defined in key table T. spaces defines whether pairs are separated by spaces.
void print_transducer (TransducerHandle t, KeyTable *T, bool print_weights=false, ostream &ostr=std::cout, bool old=false)
 Print transducer t in text format using the symbols defined in key table T. The parameter print_weights indicates whether weights are included, the output stream ostr indicates where printing is directed. Parameter old indicates whether transducer t should be printed in old SFST text format instead of AT&T format.
void write_transducer (TransducerHandle t, KeyTable *T, ostream &os=std::cout, bool backwards_compatibility=false)
 Write t in binary form to output stream os. Key table T is stored with the transducer.
void write_transducer (TransducerHandle t, const char *filename, KeyTable *T, bool backwards_compatibility=false)
 Write transducer t to file filename. Key table T is stored with the transducer.

Detailed Description

Datatypes and functions related to symbols and the relation between symbols and keys.

Typedef Documentation

typedef KeyTable KeyTable

A table for storing Key-to-Symbol associations.

One key can be associated to several symbols but one symbol is associated to only one key.

Definition at line 57 of file symbol-layer.h.

typedef Symbol Symbol

A handle for a symbol name, i.e. for a string.

Symbol is the type of a handle for such a symbol that could occur in cell of an input or output tape or as input or output labels of transitions in transducers, or of a special-use symbols that do not occur on tapes but occur only as input or output transition labels having a special interpretation, e.g. any, default, failure, etc., which is indicated by an attribute of the transducer.

There is a global, session-spesific table of Symbol-to-string relations, called the the global symbol cache. In the symbol cache, one Symbol is associated with one string and for one string there is one Symbol representing it, i.e. the relation between strings and Symbols is one-to-one.

Definition at line 34 of file symbol-layer.h.

Iterator over the symbols in a SymbolSet.

Definition at line 40 of file symbol-layer.h.

A pair of symbols representing a transition in a transducer.

Definition at line 43 of file symbol-layer.h.

Iterator over the set of symbol pairs in a SymbolPairSet.

Definition at line 49 of file symbol-layer.h.

A set of symbol pairs aka an alphabet of symbol pairs.

Definition at line 46 of file symbol-layer.h.

A set of symbols aka an alphabet of symbols.

Definition at line 37 of file symbol-layer.h.


Function Documentation

void associate_key ( Key  i,
KeyTable T,
Symbol  s 
)

Associate the key i in the key table T with the symbol s.

The symbol that is first associated with a key, becomes the primary symbol for that key. If key i has already been associated with one or more symbol(s) not equal to s, the symbol s becomes a parallel symbol for the key i.

SymbolPairIterator begin_pi_symbol ( SymbolPairSet Pi  ) 

Beginning of the iterator for the symbol pair set Pi.

SymbolIterator begin_sigma_symbol ( SymbolSet Si  ) 

Beginning of the iterator for the symbol set Si.

SymbolSet* create_empty_symbol_set (  ) 

Define an empty set of symbols.

SymbolPairSet* create_empty_symbolpair_set (  ) 

Define an empty set of symbol pairs.

KeyTable* create_key_table (  ) 

Create an empty key table.

The result has no associations defined between symbols and keys.

Symbol define_symbol ( const char *  s  ) 

Define a symbol with name s.

SymbolPair* define_symbolpair ( Symbol  s1,
Symbol  s2 
)

Define a symbol pair with input symbol s1 and output symbol s2.

SymbolPairIterator end_pi_symbol ( SymbolPairSet Pi  ) 

End of the iterator for the symbol pair set Pi.

SymbolIterator end_sigma_symbol ( SymbolSet Si  ) 

End of the iterator for the symbol set Si.

KeyTable* gather_flag_diacritic_table ( KeyTable kt  ) 

Return a new key table only including those key/symbol pairs which correspond to flag-diacritic symbol names.

Symbol get_input_symbol ( SymbolPair s  ) 

Get the input symbol of SymbolPair s.

Key get_key ( Symbol  s,
KeyTable T 
)

Find the key for the symbol s in key table T.

KeySet * get_key_set ( KeyTable T  ) 

A set of keys in key table T.

Symbol get_key_symbol ( Key  i,
KeyTable T 
)

Find a symbol for the key i in key table T.

If there are several symbols associated with the key, the primary symbol (the symbol that was first associated with the key) is returned.

Symbol get_output_symbol ( SymbolPair s  ) 

Get the output symbol of SymbolPair s.

SymbolPair* get_pi_symbolpair ( SymbolPairIterator  pi  ) 

Get the symbol pair represented by the symbol pair iterator pi.

Symbol get_sigma_symbol ( SymbolIterator  Si  ) 

Get the symbol represented by the symbol iterator si.

Symbol get_symbol ( const char *  s  ) 

Find the symbol for the symbol name s.

Precondition:
s must refer to a symbol name. Use is_symbol to check this if you are not sure.

const char* get_symbol_name ( Symbol  s  ) 

Find the symbol name for the symbol s.

Key get_unused_key ( KeyTable T  ) 

Return a Key which hasn't been associated to any symbol in key table T.

bool has_symbol ( Symbol  s,
SymbolSet Si 
)

Whether symbol s is a member of the set of symbols Si.

bool has_symbol_table ( istream &  is  ) 

Whether the transducer coming from istream is has a symbol table stored with it.

Precondition:
The transducer is in valid format and the end of stream has not been reached. Use read_format to check this.

bool has_symbolpair ( SymbolPair p,
SymbolPairSet Pi 
)

Whether symbol pair p is a member of the set of symbol pairs Pi.

SymbolSet* insert_symbol ( Symbol  s,
SymbolSet Si 
)

Insert s into the set of symbols Si and return the updated set.

SymbolPairSet* insert_symbolpair ( SymbolPair p,
SymbolPairSet Pi 
)

Insert p into the set of symbol pairs Pi and return the updated set.

bool is_equal ( Symbol  s1,
Symbol  s2 
)

Whether the symbol s1 is identical to symbol s2.

bool is_key ( Key  i,
KeyTable T 
)

Whether i indicates an existing key in key table T.

bool is_symbol ( Symbol  s,
KeyTable T 
)

Whether s indicates an existing symbol in key table T.

bool is_symbol ( const char *  s  ) 

Whether the string s indicates a name for a symbol.

TransducerHandle longest_match_tokenizer ( TransducerHandle  t,
Key  marker 
)

Create a longest match tokenizer based on paths of transducer t.

Transducer t should be acyclic identity pair transducer whose paths contain UTF-8 arc sequences of possible symbols. Resulting transducer will be a cyclic transducer compatible with tokenize* functions.

TransducerHandle pairstring_to_transducer ( const char *  str,
KeyTable T 
)

Create a one-path transducer as defined in pairstring form in str using the symbols defined in key table T.

The transitions must be written one after another separated by a space. (For automatic tokenization of symbols, see tokenize_pair_string.) If the input and output symbols are not equal, they are separated by a colon. If the backslash '\' and colon ':' are part of a symbol name, they must be escaped as "\\" and "\:".

For example the string "<tt>a:\: cd:e<\tt>" represents a transducer with consecutive transitions mapping "a" to ":" and "cd" to "e".

See also:
transducer_to_pairstring

void print_transducer ( TransducerHandle  t,
KeyTable T,
bool  print_weights = false,
ostream &  ostr = std::cout,
bool  old = false 
)

Print transducer t in text format using the symbols defined in key table T. The parameter print_weights indicates whether weights are included, the output stream ostr indicates where printing is directed. Parameter old indicates whether transducer t should be printed in old SFST text format instead of AT&T format.

In HFST the print_weight parameter is ignored.

In At&T and SFST format, the newline, horizontal tab, carriage return, vertical tab, formfeed, bell character, backspace, backslash and space are printed as "\n", "\t", "\r", "\v", "\f" "\a", "\b", "\\" and "\0x20". In SFST format, the colon and angle brackets are printed as "\:", "\<" and "\>".

See also:
read_transducer_text

KeyTable* read_symbol_table ( istream &  is,
bool  binary = false 
)

Read a symbol table from istream is and transform it to a key table. binary defines whether the symbol table is in binary or text format.

Key table and symbol table are two ways of representing key-to-string mappings. Key tables are used during a session and symbol tables when moving or storing information between sessions.

During a session, a key table associates keys to symbol handles and the global symbol cache associates symbol handles to strings.

Between sessions, a symbol table associates keys directly to strings, as there is no symbol cache.

A symbol table in OpenFst text format lists each symbol name and its associated key on one line. The symbol name and the associated key are separated by a tabulator. If several symbol names are associated to the same key, the one listed first is considered the primary print name for that key.


KeyTable          Global symbol cache      Symbol table            Symbol table in text format     
--------          -------------------      ------------            ---------------------------

Key  Symbol       Symbol    string         Key   string            <> TAB 0
                                                                   <eps> TAB 0
 0     0, 1         0         "<>"          0      "<>", "<eps>"   a TAB 1 
 1     2            1         "<eps>"       1      "a"             b TAB 2
 2     4            2         "a"           2      "b"             c TAB 3
 3     5            3         "A"           3      "c" 
                    4         "b"
                    5         "c"
                    6         "d"

TransducerHandle read_transducer ( char *  filename,
KeyTable T 
)

Read a binary transducer from file filename and harmonize it according to the key table T.

Precondition:
The transducer that is read must have a key table stored with it.

TransducerHandle read_transducer_text ( istream &  is,
KeyTable T,
bool  sfst = false 
)

Make a transducer as defined in text form in file using the key-to-printname relations defined in key table T. The parameter sfst defines whether SFST text format is used, otherwise AT&T format is used.

In At&T and SFST format, the newline, horizontal tab, carriage return, vertical tab, formfeed, bell character, backspace, backslash and space must be escaped as "\n", "\t", "\r", "\v", "\f" "\a", "\b", "\\" and "\0x20". In SFST format, the colon and angle brackets must be escaped as "\:", "\<" and "\>".

An example of a transducer file:

AT&T                                       AT&T UNWEIGHTED
  SFST                         

  0      0                                   0
  final  0                     0      1      a      aa     0.3            0
  1      a      aa             0      a:aa   1 0      2      b      b      0
  0      2      b      b              0      b      2 1      0      c      C
  0.5            1      0      c      C              1      c:C    0 2      1
  \n     c      0              2      1      \n     c              2      \n:c
  1 2      0      a      A      1.2            2      0      a      A
  2      a:A    0 2      2      d      D      1.65           2      2      d
  D              2      d:D    2 2      0.5                                 2
  final  2 

The syntax of the lines in the text format is one of the following in the AT&T format: - originating_node TAB destination_node TAB input_symbol TAB output_symbol (TAB transition_weight) - final_node (TAB final_weight)

and one of the following in sfst format: - originating_node TAB input_symbol:output_symbol TAB destination_node - final TAB final_node

When AT&T format is used in HFST, weights are ignored. When SFST or AT&T unweighted format is used in HWFST, weights are set to zero.

Precondition:
All printnames used in the text format representation of the transducer must be in the key table T.
Postcondition:
file is not closed. Contents of file are not changed.
Returns:
A transducer as defined in file. If end of file is reached, NULL.
See also:
print_transducer

KeyTable* recode_key_table ( KeyTable kt,
const char *  epsilon_replacement 
)

Replace the epsilon in kt, with epsilon_replacement.

When tokenizing input-strings, the strings should never contain a substring matching the symbol name of the epsilon key in the KeyTable used in tokenization. Therefore the epsilons in the tokenizer should be replaced by an internal epsilon-symbol, which is unlikely to occur in real input-strings.

recode_key_table returns a KeyTable, which is the same as kt, except the key 0 corresponds to the internal epsilon symbol name epsilon_replacement and the original epsilon symbol name corresponds to the first unused key in kt.

size_t size_pi_symbol ( SymbolPairSet Pi  ) 

Size of the iterator for the symbol pair set Pi.

size_t size_sigma_symbol ( SymbolSet Si  ) 

Size of the iterator for the symbol set Si.

KeyPairVector* tokenize_pair_string ( TransducerHandle  tokeniser,
char *  pairs,
KeyTable inputKeys 
)

Tokenise with tokeniser a string s of individual characters and colon separated pairs into transducer.

E.g. a string cat+pl:s will be made to c a t +pl:s given that tokeniser creates such tokens.

Parameters:
tokeniser A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path.
pairs UTF-8 encoded string for transducer
inputKeys KeyTable that matches mapping of UTF-8 characters on input side of tokeniser.
Returns:
Transducer that contains as paths all possible aligned tokenisation(s) of upper : lower.
Todo:
does not support ambiguous tokenisations (i.e. with more than one path.

KeyVector* tokenize_string ( TransducerHandle  tokeniser,
const char *  string,
KeyTable inputKeys 
)

Change a string s into identity pair transducer as tokenised by tokeniser.

E.g. a string cat will be tokenised as transducer c a t, given that tokeniser creates tokens for c, a, and t.

Parameters:
tokeniser A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path.
string UTF-8 encoded string for transducer pairs.
inputKeys KeyTable that matches mapping of UTF-8 characters on input side of tokeniser.
Returns:
Transducer that contains as paths of s tokenised with tokeniser.
Todo:
does not support ambiguous tokenisations (i.e. with more than one path.

KeyPairVector* tokenize_string_pair ( TransducerHandle  tokeniser,
const char *  upper,
const char *  lower,
KeyTable inputKeys 
)

Change 2 strings to a transducer aligned character by character according to tokenisation by tokeniser. The path(s) of result of composition of of string’s UTF-8 representations against tokeniser are paired up to a new tokeniser from beginning to end. Empty spaces in the end are filled with ε’s.

E.g. strings cat dog are aligned as c:d a:o g:t. Strings ääliö ääliöitä are aligned as ä ä l i ö ε:i ε:t ε:ä. And talo+NOUN+SINGULAR+NOMINATIVE talo as t a l o +NOUN:ε +SINGULAR:ε +NOMINATIVE:ε, given that tokeniser and keytable contains those symbols.

If specific alignment is required, it is possible to specify ε’s manually using the string for ε that is defined in inputKeys.

A tokeniser tokeniser may be built manually using or with functions, such as longestMatchTokeniser(...)

Parameters:
tokeniser A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path.
upper UTF-8 encoded string for input side of transducer.
lower UTF-8 encoded string for output side of transducer.
inputKeys KeyTable that matches mapping of UTF-8 characters on input side of tokeniser.
Returns:
Transducer that contains as paths all possible aligned tokenisation(s) of upper : lower.
Todo:
does not support ambiguous tokenisations (i.e. with more than one path.

char* transducer_to_pairstring ( TransducerHandle  t,
KeyTable T,
bool  spaces = true 
)

A pairstring representation of one-path transducer t using the symbols defined in key table T. spaces defines whether pairs are separated by spaces.

The transitions are printed one after another, separated by spaces if so requested. If the input and output symbols are not equal, they are separated by a colon. If the backslash '\' and colon ':' are part of a symbol print name, they are escaped as "\\" and "\:".

The empty transducer is represented by "\empty_transducer" and the epsilon transducer as "EPS" where EPS is the symbol name for epsilon (pairstring_to_transducer recognizes "" as the epsilon transducer, but "EPS" is a more user-friendly notation). If the symbol name for epsilon is not defined, "\epsilon" is returned.

See also:
pairstring_to_transducer

void write_symbol_table ( KeyTable T,
ostream &  os,
bool  binary = false 
)

Transform the key table T to a symbol table and write it to ostream os. binary defines whether the symbol table is written in binary or text format.

See also:
read_symbol_table

void write_transducer ( TransducerHandle  t,
const char *  filename,
KeyTable T,
bool  backwards_compatibility = false 
)

Write transducer t to file filename. Key table T is stored with the transducer.

Parameters:
backwards_compatibility Whether the transducer is written in SFST/OpenFst compatible format.

void write_transducer ( TransducerHandle  t,
KeyTable T,
ostream &  os = std::cout,
bool  backwards_compatibility = false 
)

Write t in binary form to output stream os. Key table T is stored with the transducer.

Parameters:
backwards_compatibility Whether transducer is written in SFST/OpenFst compatible format.


Generated on Fri Mar 27 12:56:17 2009 for Helsinki Finite-State Transducer Technology (HFST) interface by  doxygen 1.5.6