Using SALAMA (Swahili Language Manager) in corpus work

Arvi.Hurskainen@helsinki.fi

 

0. Introduction

These instructions are for those who have permission to use SALAMA in working with Swahili corpus.

If you are not acquainted with unix/linux environment, you should first learn at least the basics of how to work in that environment. Such info can be found in the following addresses:

http://www.routledge.com/linguistics/introduction.html

Read especially Chapter 5.

http://www.aakkl.helsinki.fi/cameel/corpus/rmecorp.htm

Instructions given in the latter address above contain basic information on unix/linux and how to use Swahili corpus with general tools.

SALAMA, on the other hand, is a far more sophisticated approach to corpus work and it facilitates accurate and comprehensive management if information. The idea behind this approach is that we first carry out linguistic analysis of the text, and then we make use of the analysis result while working with corpus text. In some sense this approach is similar with working with a tagged corpus. There are significant differences, however. While in working with a tagged corpus we have access to a text that is already provided with morphological and perhaps also other kinds of tags. In the approach presented here, tagging is done in flight, so to speak, and the user does not necessarily even notice that he or she is in fact dealing with a tagged corpus.

The description of tags used in SALAMA is available in

Tagset of SWATWOL

The reasons for not providing the user with a readily tagged corpus are manifold:

(1) The tagged corpus takes space at least ten times more than the raw text. It is not feasible to maintain such huge text files in the corpus server.

(2) The user can do the tagging with the programs provided in SALAMA, if a tagged version of the corpus is needed.

(3) There are several levels and modes of tagging, depending on the need the corpus is used for. If we would provide a readily tagged corpus, we in fact would need to provide several versions of the tagged corpus, which would require still more space. The user can use the untagged corpus also for such tasks which one cannot use a corpus if it is tagged.

(4) The work with untagged corpus is vastly more flexible than the work with a corpus that already has been tagged. These points will be elaborated below.

1. The processing sequence

Texts in Swahili corpus are available in two formats. Some texts are just plain ASCII text without any kind of pre-processing. Most texts, however, are in the pre-processed format. When running the programs described below you should always check the format of the text, so that you know what version of the program you should use. The rule is:

If your source text is in pre-processed format, you should use a program with the -snt extension. The programs that expect raw text as input do  not have such an extension in the name.

1.1. Programs that expect raw text as input

The basic processing sequence is described below. The name of the program is given and its function is described briefly.

1.1.1. pre-process

It carries out a number of operations to the text, the purpose of which is to make the text optimal for linguistic analysis. The program does not tag the text in any way.

1.1.2. analyze

It carries out the basic linguistic analysis of the text. There are several versions of the analyzer. The default is that the analyzer carries out full morphological analysis and gives also English glosses to the words. It also treats verbal extensions as separate words, so that each extended verb has the extended stem as base form, or lemma.

1.1.3. analyze-simple-lemma          

It carries out the morphological analysis as analyze does, but for extended verbs it gives the non-extended base form as lemma.

1.1.4. analyze-no-glosses

It carries out the morphological analysis as analyze does, but it leaves the English glosses out.

1.1.5. analyze-no-glosses-simple-lemma

It functions as analyze, but leaves English glosses out and gives only the simple verb base as lemma for extended verbs.

1.1.6. analyze-only

It performs the morphological analysis only. The input text should be pre-processed. (It functions as analyze-snt, listed below, but for logical reasons it is also here, with a different name).

1.1.7. analyze-only-simple-lemma

It performs morphological analysis only but gives as output a simple verb base as lemma for extended verbs.

1.1.8. disambiguate

It resolves ambiguous analyses of words. It tries to leave only one analysis for each word. However, there are words, for which the disambiguator is not able to find solution (yet).

1.1.9. disambiguate-only

It resolves ambiguity in morphological analysis just as disambiguate does. The difference in the function is that disambiguate-only expects that the text has already been analyzed. You should not try to run this for raw text!

1.1.10. disambiguate-simple-lemma

It resolves ambiguous analyses of words. It functions as disambiguate, but gives as output a simple verb base as lemma for extended verbs.

1.1.11. one-line-format

It moves the word-form and its analysis to the same line. In this format it is easy to keep each word with its analysis together. Sorting and counting lines is possible without mixing words and their analysis.

1.1.12. one-line-format-simple-lemma

It functions as one-line-format but gives as output a simple verb base as lemma for extended verbs.

1.1.13. one-line-format-only

It puts the word and its analysis on the same line. It expects that the text has already been analyzed and disambiguated. Do not try to analyze raw text with this command!

1.1.14. list-count-lemmas

It gives only the frequencies of each lemma in text. Note that when using this option, the ambiguity of word-forms is not taken into account. A word with different interpretations is counted as a single word.

1.1.15. list-count-lemmas-analyze

It gives the frequencies of each lemma in text but also some morphological information, as well as English glosses. This format suits for dictionary compilation.

1.1.16. list-count-lemmas-simple

It gives only the frequencies of each lemma in text and gives a simple base form for extended verbs as lemma.

1.1.17. list-count-lemmas-analyze-simple

It gives only the frequencies of each lemma in text and gives a simple base form for extended verbs as lemma. It also gives some morphological information, as well as English glosses. This format suits for dictionary compilation.

Note that all the programs (1.1.1. - 1.1.17.) expect raw text as input, of no other information is given. In other words, each program performs the whole sequence of analysis starting from pre-processing. You do not need to put different programs into the pipe yourself. For example, the program called disambiguate performs the phases (1 - 3), the program called list-count-lemmas performs the phases (1 - 5), and so on.

Note, however, that the programs with the sequence -only in name perform a restricted sequence of commands, normally only one operation. Use these programs with care, because, if wrongly used, you get error messages. They are meant for more experienced users who understand the whole process of analysis. By using these commands you may build your own sequences of commands, by putting them into a pipe. For example, the command:

venus$ cat test.txt | /corp/swa/bin/pre-process | /corp/swa/bin/analyze-only-simple-lemma | /corp/swa/bin/disambiguate-only | /corp/swa/bin/one-line-format-only > test.res

performs the same operation as:

venus$ cat test.txt | /corp/swa/bin/one-line-format-simple-lemma > test.res

 

1.2. Filter programs

There are also some filters that remove such information that is not needed in a particular task.

            prune-tags              It removes such tags that are not needed in compiling a dictionary.

            remove-num              It removes numbers found in text.

            remove-propname  It removes proper names found in text.

remove-heur           It removes such words that have not been recognized by the parser, but for which the heuristic guesser has given an interpretation.

remove-lemma         It removes the lemma form of the analysis.

remove-token         It removes the word-form token from the analysis.

These filter programs should be run after the disambiguation of text has been carried out. They should not be run earlier, because the disambiguation rules need the full text and full analysis to work satisfactorily. You may consider using these filter programs as post-processing, as you use such linux filters as sort, uniq, rev, etc.

1.3. Programs that expect the source text to be in pre-processed format

Each of the programs in 1.1. has a counterpart program, which expects that the source text is already pre-processed. This is to help the user, because part of the corpus texts is available only in pre-processed format, while other texts are in original format. You should first check which format the text has, before you try to run a program.

Such programs are:

            analyze-snt

           analyze-simple-lemma-snt

           analyze-no-glosses-snt

          analyze-no-glosses-simple-lemma-snt

     disambiguate-snt

           disambiguate-simple-lemma-snt

     one-line-format-snt

           one-line-format-simple-lemma-snt

     list-count-lemmas-snt

     list-count-lemmas-analyze-snt

 

Note that each program that expects pre-processed text as input has the extension -snt.

You see above in (1.1.1-17) what these programs do.

 

1.4. Programs for dictionary compilation and for producing other kinds of lexical lists

Above we mentioned that the scripts list-count-lemmas-analyze and list-count-lemmas-analyze-simple are suitable for dictionary compilation. They produce the basic raw analyzed list of lemmas, each with a different format for verbs. There are more such scripts that take the processing even further.

translate

It analyses text and produces a vertical form of the text, each word provided with such lexical information as normally found in dictionaries, and also the gloss in English is given. In order to improve readability a lot of morphological tags have been removed. The lemma form in this program is deleted.

vocabulary

It analyzes text and produces an alphabetical list of lemmas found in the text. Such lexical information that is normally included in dictionaries is included. Also glosses in English as well as the etymological tags are included. This format is the best approximation of the final dictionary that the system can produce automatically.

vocabulary-count

This performs the same as vocabulary, but it ads the frequency number in front of each lemma. These numbers can then be transformed into frequency classes by a program and moved to the end of entries in the final dictionary.

vocabulary-less-top500

vocabulary-less-top1000

vocabulary-less-top1500

vocabulary-less-top2000

vocabulary-less-top2500

vocabulary-less-top3000

These are programs that work as vocabulary, but they cut out the most common words in Swahili. The number in the end of the command indicates the number of words to be cut in the general frequency list of Swahili. For example, vocabulary-less-top1000 cuts 1000 most common words and gives a lemma list of the rest of words, with all relevant lexical information and a gloss in English. This program suits for Swahili learners, who gradually improve the command of vocabulary. They can choose the level of vocabulary they need.

2. Hints on use

Hint 1:

The programs described above are not available to all users of the corpus server. Therefore, while using the programs, you should specify the full path of the programs. The programs described here are located in the directory /corp/swa/bin/ .

venus@ cat /corp/swa/standard/articles/alasiri/ala-all | /corp/swa/bin/one-line-format > ala-all.r

This command analyzes the text, disambiguates it, puts the result into one line format, and finally directs the result into a file called ala-all.r .

Hint 2:

Be careful not to create too big files. If you need such files, you should quickly move them to your own computer. Recall that even a disambiguated file is ten times bigger than the original file.

There are methods to handle the size problem:

1. Zip the file by using

venus$ gzip file

The resulting condensed file will have the format file.gz .

You can unzip the file with the command

venus$ gunzip file.gz

The file gets its original form.

You can also use the zipped file without unzipping it. The file is opened in flight.

venus$ zcat file.gz | commands...

2. If you do not need the analyzed words in the order where they are in text, you can down-size the resulting file considerably by sorting and removing duplicate lines.

venus$ cat /corp/swa/standard/articles/alasiri/ala-all | /corp/swa/bin/one-line-format | sort | uniq -c | sort -nr > ala-all.rsn

Hint 3:

It is wise to test the function of the program, or a sequence of programs, first. A handy method of testing is to to use unix commands head and tail for getting ten first or last lines of the file. When you are sure that the program does what you want, you can process larger amounts of text. For example, if you want to see whether the file books-all is pre-processed or not, you may invoke the command:

You can then make a test with a program. For example:

[1039] venus$ head /corp/swa/standard/books/books-all

<FAS> *makala za *semina ya *kimataifa ya *waandishi wa *kiswahili *iii *fasihi *taasisi ya *uchunguzi wa *kiswahili *chuo *kikuu cha *dar_es_*salaam , 1983

<FAS> *dibaji $

<FAS> *kitabu hiki , ambacho kinatolewa katika juzuu tatu , ni matokeo ya semina mbili za kimataifa za waandishi wa *kiswahili .$

<FAS> *semina ya kwanza ilifanyika *dar_es_*salaam , *tanzania , kuanzia tarehe 16 / 9 / 78 hadi 7 / 10 / 78 .$

<FAS> *semina ya pili vilevile ilifanyika *dar_es_*salaam kuanzia tarehe 12 / 5 / 1980 hadi 24 / 5 / 1980 .$

<FAS> *semina zote mbili zilisimamiwa na *taasisi ya *uchunguzi wa *kiswahili , *chuo *kikuu cha *dar_es_*salaam , na kugharamiwa na *shirika la *kimataifa la *maendeleo la *sweden ( s*i*d*a ) kupitia *shirika la *umoja wa *mataifa la *elimu *sayansi na *utamaduni ( u*nesco )$

<FAS> *washiriki wa semina zote mbili walitoka katika nchi za *afrika *mashariki na *kati zenye wazungumzaji wa *kiswahili , *nchi zilizoalikwa ni *burundi , *kenya , *ethiopia , *malagasi , *malawi , *msumbiji , *ngazija , *rwanda , *somalia , *sudan , *tanzania , *uganda na *zaire .$

<FAS> *zote , isipokuwa *ethiopia , *malawi , *ngazija na *zaire , zilileta wajumbe wao .$

<FAS> *kadhalika , semina zilihudhuriwa na watajamaji kutoka *sweden ( s*i*d*a ) na u*nesco .$

<FAS> *semina zote mbili zilikuwa na shabaha za jumla zifuatazo :$

It is a pre-processed text. Then you may make a test with a program.

[1040] venus$ head /corp/swa/standard/books/books-all | /corp/swa/bin/one-line-format-snt

"<<FAS>>" &

"<*makala>" "makala"  N 9/10-0-PL  ' article (written) ' &

"<za>" "a"  GEN-CON 9/10-PL  &

"<*semina>" "semina"  N 9/10-0-SG  &

"<ya>" "a"  GEN-CON 9/10-SG  &

"<*kimataifa>" "taifa"  ADV ADV:ki  5a/6-PL  ' nation. (ar) ' &

"<ya>" "a"  GEN-CON 9/10-SG  &

"<*waandishi>" "mwandishi"  N 1/2-PL  ' 1 recorder, secretary. 2 writer, author ' &

"<wa>" "a"  GEN-CON 1/2-PL  &

"<*kiswahili>" "*kiswahili"  PROPNAME 7/8-SG  &

"<*iii>" "iii" <CAP> <Heur> PROPNAME SG &

"<*fasihi>" "fasihi"  N 9/10-0-SG  ' literature ' &

"<*taasisi>" "taasisi"  N 9/10-0-SG  ' institute. (ar) ' &

"<ya>" "a"  GEN-CON 9/10-SG  &

"<*uchunguzi>" "uchunguzi"  N 11-SG  DER:zi HC  &

"<wa>" "a"  GEN-CON 11-SG  &

"<*kiswahili>" "*kiswahili"  PROPNAME 7/8-SG  &

"<*chuo>" "chuo"  N 7/8-SG  &

"<*kikuu>" "kuu"  ADJ A-INFL 7/8-SG  ' great, important, eminent; main, major, chief ' &

"<cha>" "a"  GEN-CON 7/8-SG  &

"<*dar_es_*salaam>" "*dar_es_*salaam"  PROPN  &

...

 

Hint 4:

We will make another test, now with a newspaper file.

[1051] venus$ head /corp/swa/standard/articles/alasiri/ala-all

[ Alasiri 26th *january 1999 ] ' *bomu ' jengo la *p*p*f laitikisa *dar lwanajeshi , polisi walizingira *jumanne , *januari 26,1999 *na *agnether *kasenene , *jijini *hofu , kiwewe na mshikemshike leo zilitawala katikati ya *jiji hasa maeneo ya mitaa ya *samora na ile ya jirani kutokana na taarifa zilizotolewa polisi kwamba , kuna bomu limetegwa kwenye jengo jipya la *p*p*f na lingelipuliwa saa nne kamili .

*jengo hilo lenye ghorofa 14 linalomilikiwa na *mfuko wa *pensheni wa *mashirika ya *umma , lijulikanalo kama *p*p*f *house , lipo kwenye makutano ya barabara za *samora na *morogoro , jirani kabisa na ofisi za *tume ya *jiji .

*kamanda wa *polisi wa *mkoa wa *dar es *salaam , *bw .

*alfred *gewe amesema kuwa baada ya kupewa taarifa hizo kuwa kuna bomu , alimwaga *askari *wataalam wa masuala hayo , *polisi wa kawaida na *askari wa *jeshi la *wananchi *tanzania kwa ajili ya kupambana na hali hiyo .

*askari hao wamesema kuwa walipofika kwenye tukio , waliwazuia wafanyakazi na wapangaji wote wa jengo hilo kuingia .

*wamesema ilibidi wapekue jengo zima chumba hadi chumba kukagua kama kuna bomu ili walitegue kama watalikuta .

*baadhi ya watu wamesema sakata hilo lilianza jana baada ya mfanyabiashara mmoja kutoa taarifa kituo cha *polisi cha *kati kuwa , ana wasiwasi kuna bomu limetegwa kwenye jengo hilo .

*alipohojiwa na gazeti hili , mfanyabiashara huyo mwenye ofisi ya uwakala wa *bima amekiri kutoa taarifa hizo polisi .

*mfanyabiashara huyo , ofisi ya kampuni yake iko ghorofa ya tano jina lake halisi ni *bw .

*sanjay *suchak .

 

It is also a pre-processed text.

The following sequence of commands puts the result into the word-per-line format, sorts it alphabetically, then deletes duplicate lines but retains count, and this result is then sorted according to number and arranged so that the biggest number is in the beginning of the file. The larger the source file is, the bigger the advantage of this kind of post-processing is.

[1048] venus$ cat /corp/swa/standard/articles/alasiri/ala-all | /corp/swa/bin/one-line-format | sort | uniq -c | sort -nr > ala-all.rsn

  19554"<,>" ","  COMMA &

   8230    "<na>" "na"  CC @CC ' and '  &

   5901    "<ya>" "a"  GEN-CON 9/10-SG  &

   3724    "<kwa>" "kwa"  PREP ' at, to, for ' &

   3128    "<wa>" "a"  GEN-CON 1/2-SG  &

   3101    "<kuwa>" "kuwa"  CONJ **CLB ' that ' &

   2854    "<">" """  DOUBLE-QUOTE &

   2673    "<ni>" "ni"  DEF-V:ni ' be '  ""  SG1-SP  &

   2495    "<katika>" "katika"  PREP ' in, at ' &

   2342    "<za>" "a"  GEN-CON 9/10-PL  &

   2233    "<la>" "a"  GEN-CON 5/6-SG  &

   2131    "<hiyo>" "hi-o"  PRON DEM :hV  ASS-OBJ 9/10-SG ' this '  &

   2119    "<wa>" "a"  GEN-CON 11-SG  &

   2070    "<ya>" "a"  GEN-CON 5/6-PL  &

   2051    "<wa>" "a"  GEN-CON 1/2-PL  &

   2016    "<cha>" "a"  GEN-CON 7/8-SG  &

   1718    "<na>" "na"  AG-PART ' by '  &

   1478    "<huyo>" "hu-o"  PRON DEM :hV  ASS-OBJ 1/2-SG' this '  &

   1453    "<na>" "na"  PREP ' with '  &

   1413    "<'>" "'"  SINGLE-QUOTE &

   1285    "<kama>" "kama"  ADV AR ' like , such as '  &

   1231    "<hilo>" "hi-o"  PRON DEM :hV  ASS-OBJ 5/6-SG ' this '  &

   1145    "<baada_ya>" "baada_ya"  PREP @ADVL ' after '  &

   1101    "<wa>" "a"  GEN-CON 3/4-SG  &

   1056    "<*na>" "na"  CC @CC ' and '  &

   1017    "<*bw>" "*bw"  PROPNAME SG  &

 

Hint 5:

If we want to sort the result according to the lemma, we use the following sequence of commands. Here again, only a small extract is given.

 

[1050] venus$ cat /corp/swa/standard/articles/alasiri/ala-all | /corp/swa/bin/one-line-format | sort | uniq -c | sort +2 > ala-all.rsl

      1    "<waliandikisha>" "andikisha"  V 1/2-PL3-SP VFIN PAST  SV SVO ' SVOO ' write '  CAUS   &

      5    "<waliojiandikisha>" "andikisha"  V 1/2-PL3-SP VFIN PAST 1/2-PL-REL REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      1    "<waliojiandikisha>" "andikisha"  V 1/2-PL3-SP VFIN PAST 3/4-SG-REL REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      2    "<wamejiandikisha>" "andikisha"  V 1/2-PL3-SP VFIN PERF:me REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      1    "<nikajiandikisha>" "andikisha"  V 1/2-SG1-SP VFIN NARR:ka REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      6    "<kuandikisha>" "andikisha"  V INF  SV SVO ' SVOO ' write '  CAUS   &

      3    "<kujiandikisha>" "andikisha"  V INF REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      1    "<hawajiandikishi>" "andikisha"  V NEG-a VFIN 1/2-PL3-SP REFL-SG-OBJ  SV SVO ' SVOO ' write '  CAUS   &

      1    "<wakaandikishana>" "andikishana"  V 1/2-PL3-SP VFIN NARR:ka  SV SVO ' SVOO ' write '  CAUS  REC   &

      1    "<wakiandikishwa>" "andikishwa"  V 1/2-PL3-SP VFIN COND:ki  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<waliandikishwa>" "andikishwa"  V 1/2-PL3-SP VFIN PAST  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<walioandikishwa>" "andikishwa"  V 1/2-PL3-SP VFIN PAST 1/2-PL-REL  SV SVO ' SVOO ' write '  CAUS  PASS   &

      3    "<wameandikishwa>" "andikishwa"  V 1/2-PL3-SP VFIN PERF:me  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<niliandikishwa>" "andikishwa"  V 1/2-SG1-SP VFIN PAST  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<aliandikishwa>" "andikishwa"  V 1/2-SG3-SP VFIN PAST  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<ameandikishwa>" "andikishwa"  V 1/2-SG3-SP VFIN PERF:me  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<ziliandikishwa>" "andikishwa"  V 9/10-PL-SP VFIN PAST  SV SVO ' SVOO ' write '  CAUS  PASS   &

      5    "<kuandikishwa>" "andikishwa"  V INF  SV SVO ' SVOO ' write '  CAUS  PASS   &

      1    "<hawaandikishwi>" "andikishwa"  V NEG-a VFIN 1/2-PL3-SP  SV SVO ' SVOO ' write '  CAUS  PASS   &

      2    "<waandikishwe>" "andikishwa"  V SBJN VFIN 1/2-PL2-OBJ  SV SVO ' SVOO ' write '  CAUS  PASS   "andikishwa"  V SBJN VFIN 1/2-PL3-OBJ  SV SVO ' SVOO ' write '  CAUS  PASS   "andikishwa"  V SBJN VFIN 1/2-PL3-SP  SV SVO ' SVOO ' write '  CAUS  PASS   &

...