University homepage Suomeksi
University of Helsinki
Omorfi–Open Morphology for Finnish language

Contact information

Department of General Linguistics
P.O. Box 9 (Siltavuorenpenger 20 A)
00014 University of Helsinki

Phone +358 (09) 1911
Fax +358 (09) 191 29307

Omorfi–Open Morphology for Finnish language

This package contains free and open source implementation of morphological analysis for Finnish language. It uses GPL licenced SFST as implementation language. This package is licenced under GNU GPL, LGPL and AGPL version 3, but not necessarily later. Licences can be found from files COPYING.*. Other licences are possible, and can be given by authors found in AUTHORS file.

Downloading

Omorfi can be found from gna! service. Omorfi download directory contains release packages. Development version can be found from gna! SVN server.

Stable version of materials by University of Helsinki, RILF and all, can be found from centre of scientifical computing servers. The development version is in CVS-repository of corpus.csc.fi under /c/appl/ling/koskenni/cvsrepo.

In Gentoo Linux omorfi can be installed from science overlay using portage:

layman -a science
layman -s science
emerge omorfi

or correspondingly using paludis:

fsfsffdfs
fsfds
fs

Dependencies

Installation requires:

  • SFST, at least version 1.1, or compatible:

    • fst-compiler-utf8, fst-compact, and fst-lowmem executables are needed
  • kotus-sanalista,

    version 1a or later

  • tr

  • sed

  • XSLT processor supporting XSLT 2.0, tested with Saxon8:

    • java must be able to access net.sf.saxon.Transform in existing env., or
    • script named saxon, saxon8 or saxon9 must execute it

The final transducer can be used with SFST 1.1 or compatible.

Installation

Installation uses standard autotools system:

./configure && make && make install

If configure cannot find XSLT 2.0 processor, SFST or kotus-sanalista, they must be supplied it using configure parameters. For more information, execute:

./configure --help

Autotools system supports installation to e.g. home directory:

./configure --prefix=${HOME}

In CVS or SVN version you must create necessary autotools files in host system:

autoreconf -i

It is a common practice not to store autotools gunk in version control system.

For further instructions, see INSTALL, the GNU standard install instructions for autotools systems.

For example, a typical installation session in corpus3.csc.fi:

[tpirinen@corpus3 omorfi]$ autoreconf
[tpirinen@corpus3 omorfi]$ ./configure --prefix=$HOME --with-kotus-sanalista=$HOME/kotus-sanalista-1a.xml --enable-guesser
[tpirinen@corpus3 omorfi]$ make
[tpirinen@corpus3 omorfi]$ make install

Usage

The final installation contains transducers omorfi and guesser in directory specified by configure command, by default $prefix/share/omorfi/, which in typical Linux system will be /usr/local/share/omorfi/. The installed files are suffixed .sfsta, .sfstc, and .sfstl, corresponding standard, compact and lowmem transducers. The first of these will work with all transducer applications, while the two others are order of magnitude more effective in analysis, but do not work for generation.

Tokenised file, one word per line can be analysed with:

fst-infl2 ${prefix}/share/omorfi/omorfi.sfstc tiedosto.words

It is also possible to use fst-infl with omorfi.sfsta or fst-infl3 with omorfi.sfstl, but these are slower.

Interactive interface can be launched with:

fst-mor ${prefix}/share/omorfi/omorfi.sfsta

All known word forms can be generated with:

fst-generate ${prefix}/share/omorfi/omorfi.sfsta

Guesser guesser.sfst{c,a,l} works by taking arbitrary input and trying to guess all morphological data. It can also generate arbitrary forms from given base forms and morphological data. The guesser is not installed by default, since compiling it takes forever and ever. It can be included with configure parameter --enable-guesser.

Python interface to transducers

SFST transducers can be used with pysfst via python, the lib/ directory contains an omorfi class and few scripts using it. Class is installed with omorfi into system’s site-packages and can then be used in python by creating an instance using omorfi(filename) constructor. The use examples in lib/ should be enlightening enough to get started.

On character codings

Omorfi prefers Unicode character set over legacy ASCII and in case of current SFST implementation this also means use of UTF-8 encoding of characters. The Unicode characters to pay attention to are apostrophes and hyphens; the U+2019 RIGHT SINGLE QUOTATION MARK and U+2010 HYPHEN are preferred over legacy 0x27 APOSTROPHE and 0x2D HYPHEN-MINUS, while latter two might occasionally work as well.

Programming and project management

Omorfi rulesets and codes are free and libre open source, modifiable and redistributable by anyone. For participation in project it is recommended to follow rules common in majority of ree and open source projects, such as GNU project style guide, and autobook book (esp. § 9.1.1) and instructions in project’s HACKING file.