Kari Pitkänen & Timo Järvinen
(Information on the corpora was collected in the early 90's.)

Helsinki Corpora I


1. Swedish
2. Russian
3. Swahili
4. English
5. Finnish
6. Latin
7. German
8. Word Lists
9. Others

*******************************************************************************

1. Swedish:

*******************************************************************************

Corpus: Helsinki / the Department of General Linguistics

Location: /corp/swe/snt /corp/swe/pre
Files: 26 (snt)
Words: 1 782 441 (snt)
Size: 9 857 657 (snt)
Description: snt, pre -- various texts, books, etc.

*******************************************************************************

Corpus: FISC / HBL 1991

Location: /corp/swe/hbl-sf/hbl
Files: 24 (run)
Words: 746 334 (run)
Size: 4 969 242 (run)
Description: run, pre, snt-- Newspaper

*******************************************************************************

Corpus: FISC / Vasabladet 1991

Location: /corp/swe/hbl-sf/vasabladet
Files: 7 (dos)
Words: 314 579 (dos)
Size: 1 883 104 (dos)
Description: dos, pre -- Newspaper

*******************************************************************************

Corpus: FISC / Books

Location: /corp/swe/hbl-sf/Nordica/books
Files: 6 (txt), 2 (dos)
Words: 408 637 [322 153 (txt), 86 484 (dos)]
Size: 1 836 455 (txt), 487 962 (dos)
Description: txt, dos, also: snt, pre -- Fin. Swe. Lit.

*******************************************************************************

Corpus: FISC / FINLAG

Location: /corp/swe/hbl-sf/Nordica/finlag.dir
Files: 16 (txt)
Words: 297 359 (txt)
Size: 2 895 106 (txt)
Description: txt, (pre) -- Finnish Law, legal texts

*******************************************************************************

Corpus: Helsinki / the Department of General Linguistics / Stora Focus

Location: /corp/swe/hbl-sf
Files: 27 (txt)
Words: 1 225 202 (txt)
Size: 8 455 801 (txt)
Description: txt, snt, pre -- Encyclopaedia

*******************************************************************************

Corpus: TAGGED HELSINKI CORPUS

Location: /corp/swe/hbl-sf/correct.tag/...
Files: -
Words: 270 183
Size: -
Description: swetwol -- Various texts -- tagged, corrected

*******************************************************************************

Corpus: aldre nysvensk

Location: /corp/swe/hbl-sf/nysvensk
Files: 1
Words: 67 911
Size: 536 650
Description: swetwol -- Various texts

*******************************************************************************

Swedish: TOTAL: 5 112 646 (words)

*******************************************************************************

2. Russian:

*******************************************************************************

Corpus: Russian Magazines -- Novoe Vremja, Ogonek and Sputnik

Location: /corp/rus/journal/snt
Files: 26 (snt)
Words: 110 962 (snt)
Size: 709 002 (snt)
Description: snt -- Russian magazines

*******************************************************************************

Russian: TOTAL: 110 962 (words)

*******************************************************************************

3. Swahili:

*******************************************************************************

Corpus: Dialectal Swahili -- Hurskainen / Swahili

Location: /corp/swa/dialects/dahe
Files: 13 (txt)
Words: 72 310 (txt)
Size: 441187 (txt)
Description: txt (2 versions) -- Interviews; dialects of the coastal Tanzania

*******************************************************************************

Corpus: Standard Swahili -- NEWS-PAPER ARTICLES 1988-1991 -- Hurskainen / Swahili

Location: /corp/swa/standard/articles
Files: 8 (snt)
Words: 110 212 (snt)
Size: 681 314 (snt)
Description: snt, res -- Newspapers

*******************************************************************************

Corpus: Standard Swahili -- BOOKS -- Hurskainen / Swahili

Location: /corp/swa/standard/books
Files: 16 (snt)
Words: 669 262 (snt)
Size: 4 019 388 (snt)
Description: snt -- Books

******************************************************************************* Swahili: TOTAL: _______________________________________________ 851 784 (words) *******************************************************************************

4. English:

*******************************************************************************

Corpus: Brown Corpus

Location: /corp2/eng/brown
Files: 7
Words: 1 138 000
Size: 6 140 615
Description: snt, tag, frq

*******************************************************************************

Corpus: London-Oslo/Bergen

Location: /corp2/eng/lob
Files: 7
Words: 1 154 402
Size: 6 020 269
Description: sentence per line, tagged, frequency counts

*******************************************************************************

Corpus: Grolier Encyclopaedia

Location: /corp2/eng/grolier
Files: 1
Words: 1 630 012
Size: 10 345 431
Description: running text

*******************************************************************************

Corpus: London-Lund corpus of Spoken English

Location: /corp2/eng/london-lund
Files: 1
Words: 653 579
Size: 4 194 304
Description: broad phonetic transcription

*******************************************************************************

Corpus: The Susanne Corpus

Location: /corp2/eng/susanne
Files: 64
Words: 156 396
Size: 5 317 748
Description: tagged

*******************************************************************************

Corpus: Wall Street Journal

Location: /corp2/eng/wsj
Files: 41
Words: 6 837 293
Size: 41 066 725
Description: running text

*******************************************************************************

Corpus: miscellaneous, Project Gutenberg texts

Location: /corp2/eng/misc
Files: 20, 5 subdirectories
Words: 3 597 204
Size: 23 160 762
Description: running text

*******************************************************************************

English: TOTAL: 15 166 886 (words) *******************************************************************************

5. Finnish:

*******************************************************************************

Corpus: Helsinki Corpus of spoken Finnish (1972-74)

Location: /corp/spoken/fin/hesa
Files: 126
Words: 457 270
Size: 3 137 308
Description: running text, broad phonetic transcription

*******************************************************************************

Corpus: Reverse dictionary

Location: /corp/fin/ksk
Files: 3
Words: 72 875
Size: 1 048 694
Description: noncompound entries from Nykysuomen Sanakirja

*******************************************************************************

Corpus: HKV-CORPUS

Location: /corp/fin/hkv
Files: 3
Words: 70 000
Size: 721 591
Description: running text; tagged; sentence per line, tagged

*******************************************************************************

Corpus: Helsingin Sanomat

Location: /corp/fin/hs
Files: 3
Words: 243 146
Size: 2 193 107
Description: running text

*******************************************************************************

Corpus: Suomen Kuvalehti 1975 (magazine)

Location: /corp/fin/sk75/txt
Files: 28
Words: 840 671
Size: 7 545 329
Description: running text

*******************************************************************************

Corpus: Suomen Kuvalehti 1987 (magazine)

Location: /corp/fin/sk87/snt
Files: 49
Words: 1 257 716
Size: 10 837 996
Description: running text; sentence per line

*******************************************************************************

Corpus: Otava (novels)

Location: /corp/fin/otava/txt
Files: 7
Words: 224 603
Size: 1 714 602
Description: running text

*******************************************************************************

Corpus: WSOY (novels)

Location: /corp/fin/wsoy/txt
Files: 13
Words: 841 406
Size: 8 746 945
Description: running text; sentence per line

*******************************************************************************

Corpus: Suomen kulttuurihistoria

Location: /corp/fin/skh/
Files: 5
Words: 304 683
Size: 2 259 903
Description: running text; sentence per line

*******************************************************************************

Corpus: Tiede 2000 (magazine)

Location: /corp/fin/misc
Files: 1
Words: 68 067
Size: 468 563
Description: sentence per line

*******************************************************************************

Finnish: TOTAL: 4 380 437 (words)

*******************************************************************************

6. Latin:

*******************************************************************************

Corpus: American Philological Association / LAT-1

Location: /lat/apa
Files: 68
Words: 2 663 504
Size: 22 525 851
Description: running text

*******************************************************************************

7. German

*******************************************************************************

Corpus: mk

Location:/corp/ger/mk
Files: 15
Words: 3 220 958
Size: 20 100 743
Description: snt txt

*******************************************************************************

8. Word Lists:

*******************************************************************************

Location: /corp/words

Files: 8

Dutch

Words 178 430
Size 1 998 881

Finnish

Words 264 654
Size 3 171 148

French

Words 138 257
Size 1 524 757

German

160 086
2 060 734

Italian

Words 60 453
Size 561 982

Norwegian

Words 61 843
Size 589 234

Swedish

Words 13 328
Size 117 685

Finnish Surnames

Words> 714
Size 4 488

Totar

Words 877 765
Size 10 028 909

*******************************************************************************

9. Others:

*******************************************************************************

Biblical Hebrew (Old Testament)
Files: 38
Description: tagged, lemmatized

*******************************************************************************

New Testament Greek (Novum Testamentum Graecum)
Description: tagged, lemmatized

*******************************************************************************

Finnish Bible 1938, 1992
Description: tagged, lemmatized

*******************************************************************************

All Corpora: TOTAL: 53 084 589 words

*******************************************************************************


Computer Corpora


P.S., Sep 2, 1998