HELSINKI CORPUS OF SWAHILI (HCS)

 

 

INSTRUCTIONS

 

1. GENERAL INSTRUCTIONS

 

The Helsinki Corpus of Swahili, earlier located at the Helsinki University Language Corpus Server, has been moved to the corpus server of the Scientific Center for Computing (CSC). Whereas earlier the corpus material was in text form without linguistic tags, the current corpus is in XML-format and it is provided with a set of linguistic tags. The corpus was automatically analyzed using the Swahili Language Manager (SALAMA), and the new format makes it possible to perform a variety of searches.

 

The access to the HCS can be granted for scientific research by application. The application form is in:

http://www.csc.fi/english/customers/university/useraccounts/languagebank

 

The form should be filled, signed and sent to the address given on the page.

 

The researchers with use permission can use the WWW-Lemmie 2.0 interface. Because this is a general environment for corpus users in the corpus server of CSC, the users of the HCS should read the general Manual. Access to this manual can be found on the top bar of Lemmie.

 

Here are instructions on how to find the web interface Lemmie for using the Helsinki Corpus of Swahili:

 

Log to: http://www.csc.fi

Select: English

Then on the top bar click E-services. Then under Scientist's User Interface click sui.csc.fi

You are asked to log in.:

Then near Home button click the button Services. Under it select Science applications and then Lemmie.

 

Here you have the user interface for doing various kinds of searches.

In Manual you will be instructed how to work using the Lemmie interface.

 

The tagged corpus of Swahili contains mostly news materials from various news sources, and also extracts from books. The extracts are fairly short so as to avoid violating the good practice in using copyrighted printed material.

 

The HCS contains also transliterated speech material that was collected in 1989-1991 from the coastal area of Tanzania and from the islands of Zanzibar, Pemba and Mafia. The material was collected and transliterated in a joint project between the Institute of Kiswahili Research (University of Dar-es-Salaam) and the University of Helsinki. This material is not tagged and access to it can be granted only through the normal Linux interface. The access to the speech material is not automatically included in the use permission, and it can only be given by a separate request.

 

Because the transliterated speech corpus cannot be used through Lemmie, the users should use the standard utilities of Linux and the programs described below.

 

 

2. HOW TO WORK IN LINUX

 

In Linux, commands are typed on the prompt command line. There are more than one way for working in this environment, and no claim is made here on what is the 'best' way. What I write below just reflects my way of working. In order to be able to browse the files and already executed commands later on freely, it is good to execute the commands from within the emacs editor. To get the command prompt inside emacs, do the following.

 

At the prompt ($ stands for the prompt here), open the editor by typing:

$ emacs

(or other versions of emacs)

 

Then invite the prompt by typing the command esc-x shell (press esc and, while keeping it pressed, press also x, then type shell and press enter).

 

To view the first lines of a file, type

 

$ head filename

 

The first lines of the file will be printed. Now when the command prompt is called from within emacs, the prompt is in fact a buffer and the contents of the buffer can be browsed as any other buffer in the editor.

 

If you want to see more lines, type

 

$ head -100 filename

 

The first 100 lines will be printed.

 

The last lines will be printed with the command

 

$ tail filename

 

The command

 

$ tail -150 filename

 

will give the last 150 lines of the file

 

To print the entire file to the buffer type

 

$ cat filename

 

You can also print files after each other with cat (cat stands for concatenate)

 

$ cat filename1 filename2

 

The concatenation result can also be saved as a file

 

$ cat filename1 filename2 > catfile

 

 

3. GETTING HELP

 

In the Linux system there is an electronic manual. You can call this manual by typing man and the command or process name you need information of. 

 

$ man grep

 

gives you information about a Linux tool called grep. 

 

To view the results of the command man, it is better to execute the command from the standard prompt (not inside emacs), because all emacs versions do not display the result properly.

 

 

4. THE DIRECTORY STRUCTURE OF THE CORPUS SEERVER

 

The corpus files of the HCS have the following path:

/l/kielipankki/hcs/articles

/l/kielipankki/hcs/books

/l/kielipankki/hcs/dialects

           

As told above, the material in articles and books is in XML-format and it is linguistically tagged. You may here take a look at its format. But keep in mind that it is intended to be browsed with Lemmie.

 

The directory dialects contains normal text, and the retrieving tools described below can be used for retrieving information from it.

 

To move up and down in the directory tree, you can use the command cd with various options. Examples of usage of cd:

 

$ cd /l/kielipankki/hcs/articles

 

will take you to the directory, where the news articles are located.

 

$ cd ..

 

will take you up one level in the directory tree. For example, if your current working directory (the branch where you 'are' right now) is

 

/l/kielipankki/hcs/articles

 

and you give the command cd .., the working directory after this will be /l/kielipankki/hcs/

 

Note that there is a space between cd and '..'. 

 

$ cd /

 

will take you to the root directory, to the very beginning of the tree.

 

The command pwd ('print working directory') is very useful, when you feel you are lost in the 'jungle' of directories. It prints on the screen the whole path of the current working directory. The command dirs does the same thing. 

 

$ pwd

/corp/swa/standard

 

cd without any options takes you back to your own home directory. Try it, and then check the path of your home directory with pwd.

 

'~' in the path means the home directory of the user. For example, if you want to change to a subdirectory called /swamat in your home directory from the Swahili corpus directory, just give the command

 

$ cd ~/swamat

 

Read more about cd and pwd from the electronic manual.

 

NOTE1 Since lower case letters and capital letters are two different sets of characters in the Linux system, it is important that you use lower case letters when so written and capital letters when so written.

 

NOTE2 Linux system saves the commands you type during the session into a 'command buffer'. By pressing the up-arrow and down-arrow keys you can go through the commands you have been using during the session. So you do not always have to type the whole command again, just press the up-arrow key and find the proper command. You can also edit the command line with the usual editing commands: 

 

NOTE3 If you execute the commands from within the emacs editor, you can browse the previously executed commands. With the key combination alt-p you browse backwards, and with alt-n forwards. Up to 30 most recent commands are kept in memory.

 

Ctrl-a         moves the cursor to the beginning of the line.

Ctrl-e         moves the cursor to the end of the line.

 

Arrow keys move the cursor left and right on the line.

Delete and Backspace delete the character on the left from the cursor

 

Ctrl-k         deletes the whole line from the cursor to right.

Ctrl-y         returns the last deletion made by C-k.

 

 

5. LISTING OF FILES

 

The command ls gives you a list of the files and directories in the working directory. 

 

$ ls

 

READ.ME       README.beta   bin/          standard/

README        README~       dialects/     swa-tmp/

 

The files with a '~' after the name are back copies created by the system. As with the cd command and most of the Linux commands you can use options with the command ls. Typing the command ls with the option -l gives you a list of the files with more information.

 

$ ls -l

total 102

-rw-r--r--   1 ahurskai swahil       707 Oct 20  1995 READ.ME

-rw-r--r--   1 ahurskai uhswa      34077 Mar 17 14:38 README

-rw-r--r--   1 ahurskai swahil      1148 Dec 21  1992 README.beta

-rw-r--r--   1 ahurskai swahil     33587 Oct 21  1998 README~

drwxr-xr-x   2 ahurskai swahil      8192 Nov  6  1998 bin/

drwxr-x---   3 ahurskai swahil      8192 Jan  8  1997 dialects/

drwxr-xr-x   7 ahurskai swahil      8192 Feb  5 18:37 standard/

rwxr--x---   2 ahurskai swahil      8192 Jan  9  1997 swa-tmp/

 

To get this listing you may alternatively use the command: 

 

$ dir

 

The first column of the list tells the protection of the file. d in the beginning of the line means that it is actually a directory. (About the other letters see unit 2.5 below.) 

The numbers in the second column tell the number of files linked to the file of directory, and it is usually 1 for files. For directories the number of  links shows the number of subdirectories in that directory.  The third column tells the owner of the file, and the fourth the group to which the owner belongs. 

The fifth column tells the size of the file. The sixth, seventh and eighth columns tell the date and the time when the file or directory was last edited. The last column tells the name of the file or directory.

 If you want a list of ALL the files in a directory (including the special files that are not shown when using just ls), use the command ls -a. It lists all the files that start with a dot '.' and of course the files that would be listed with the command ls. 

 

$ ls -a

./            READ.ME       README.beta   bin/          standard/

../           README        README~       dialects/     swa-tmp/

 

You can also combine the two previous listings: 

$ ls -la

total 118

drwxrwxr-t   6 ahurskai uhswa       8192 Mar 17 14:38 ./

drwxr-xr-x  35 root     system      8192 Dec  2 11:12 ../

-rw-r--r--   1 ahurskai swahil       707 Oct 20  1995 READ.ME

-rw-r--r--   1 ahurskai uhswa      34077 Mar 17 14:38 README

-rw-r--r--   1 ahurskai swahil      1148 Dec 21  1992 README.beta

-rw-r--r--   1 ahurskai swahil     33587 Oct 21  1998 README~

drwxr-xr-x   2 ahurskai swahil      8192 Nov  6  1998 bin/

drwxr-x---   3 ahurskai swahil      8192 Jan  8  1997 dialects/

drwxr-xr-x   7 ahurskai swahil      8192 Feb  5 18:37 standard/

drwxr-x---   2 ahurskai swahil      8192 Jan  9  1997 swa-tmp/

 

Read more about ls from the electronic manual.

 

6. FILE PROTECTION

 

The letters in the first column of the file listing indicate the protection of the files, i.e. who is allowed to use these files.

Letters 2-4 show the permissions of the owner of the file, letters 5-7 the permissions of the group, and letters 8-10 the permissions of the world, i.e. all the other users. 

The following four letters are used to indicate the type of protection the file or directory has: 

 

r          permission to read the file/directory (readable)

w          permission to edit and delete the file/directory (writable)

x          permission to execute the file/directory (executable)

t          permission to edit the file/directory, but not to delete as with w

 

For example 

drwxrwxr-x

 

tells you that the file is a directory, the owner of the file may read, write and execute it, other members of the same group may read, write and execute it, and other people who do not belong to the same group may read and execute it. 

 

-rw-rw-rw-

tells you that everyone (owner, group and world) may read and write the file.

 

7. MOVING, COPYING AND DELETING

 

To create a new subdirectory in your home directory give the command mkdir ('make directory') in the directory where you want the new directory.

 

$ mkdir directoryname

 

It is recommended that you create directories for different types of material that you have in your home directory to help keeping it manageable and in good order.  

 

To move a file or directory to another directory give the command

 

$ mv filename1 directoryname/filename1

 

If you also want to change the name of the file give the new name instead of typing the old name again, as in 

 

$ mv filename1 directoryname/filename2

 

If the files are in the same directory the specification of the directory is unnecessary. 

 

The command cp ('copy') works in the same way as mv ('move'). 

 

Remove a file: 

$ rm filename

 

Remove a directory:

$ rmdir directoryname

 

 

8. EMACS EDITOR

 

In Linux there is an editor called Emacs. Here are only some basic commands of Emacs to get started with. For more information see Emacs manuals (on-line manual while in Emacs by pressing C-h).

 

To launch the editor type the command

 

$ emacs filename

 

If the file named on the command line does not exist it will be created on opening the editor. (To be precise, only the buffer with that name will be created. You still need to save the buffer, until there is a file with that name in the directory.)

 

Some basic Emacs commands (C stands for the ctrl key):

C-x-f            open file (if the file does not exist, a buffer for it will be created when executing this command. You need to save the buffer for the file to be created.)

C-x-s            save file (with the name of the buffer)

C-x-w            write file (e.g with a different name than the buffer name)

C-x-c            exit emacs

C-h               help

C-x-2            split the screen 

C-x-1            kill the other window(s)

C-x-o            move to another window

C-g                 interrupt the command

C-x-u            undo the previous action

C-l                   move the cursor position to the center of the screen

 

Moving in the file:

Esc <            beginning of the file

Esc >            end of the file

C-n                 following line (also arrow keys work)

C-p                 previous line (also arrow keys work)

C-a                 beginning of the line

C-e                 end of the line

C-v                 one screen downwards (also pgdn key works)

Esc v            one screen upwards (also pgup key works)

C-b                 one character backwards (also arrow keys work)

C-f                 one character forwards (also arrow keys work)

 

Searching in a file:

C-s                 search forwards

C-r                 search backwards

 

After the command you will be asked to type in the search string. C-g takes you back to the place where you started the search.

 

Editing:

C-x d            invite directory editing buffer (dired)

                        Here you can do the following:

            e          open the file at cursor position (also enter works)

            d          mark a file for deleting

            x          delete the marked file(s)

C-k                 kill and save the line to the right from the cursor

Esc-@            put mark

Esc-w            save the text between the mark and cursor

C-w                 kill and save the text between the amrk and cursor

C-y                 yank the saved text back to the cursor position

Esc-%            start query replace

Esc-(            start defining keyboard macro

Esc-)            end defining keyboard macro

C-x e            execute keyboard macro

 

 

9. TOOLS FOR RETRIEVING

 

9.1 kwic

 

Perhaps the most useful retrieving tool with context is kwic. The user can modify its behaviour in various ways. See examples below:

 

The default behaviour of kwic without any switches is below:

 

$ head -100 _dahe.all | kwic 'sema '

ba wangu, ndo ntanila." - "Akasema kuna mama mmoja, sawa ndo

                       TS M Kasema inafundisha nini?

 huyo nyoka akarudi nyuma, akasema umenikata kimoja lakini b

                       TS M Kasema inafunza nini?

moja wa mapacha yule, yeye alisema mimi ntakuwa mwindaji. Ba

 chakula, yanamvalisha, we unasema za kwako." Akasema, "Aah,

a kuondoka tu, yule mjomba anasema nyie ningojeni, ningojeni

                       TS M Kasema inafundisha nini?

 

It aligns the hits and gives the default number of characters on each side as context.

 

If you add the switch -i with a numerical argument, you get the following:

 

$ head -100 _dahe.all | kwic -i6 'sema '

 JE M ba wangu, ndo ntanila." - "Akasema kuna mama mmoja, sawa ndo

 TS M                             Kasema inafundisha nini?

 AA M  huyo nyoka akarudi nyuma, akasema umenikata kimoja lakini b

 TS M                             Kasema inafunza nini?

 SS M moja wa mapacha yule, yeye alisema mimi ntakuwa mwindaji. Ba

 VJ F  chakula, yanamvalisha, we unasema za kwako." Akasema, "Aah,

 VJ F a kuondoka tu, yule mjomba anasema nyie ningojeni, ningojeni

 TS M                             Kasema inafundisha nini?

 

On the left, the first six characters are reserved for index codes. In this case, the first code is an abbreviation for the speaker, and the second code is for the sex.

 

You can also modify the length of the context. In the following example the context is 40 characters on both sides of the hit.

 

$ head -100 _dahe.all | kwic -i6 -l40 -r40 'sema '

 JE M mwage mjomba wangu, ndo ntanila." - "Akasema kuna mama mmoja, sawa ndo ntanila."

 TS M                                       Kasema inafundisha nini?

 AA M wa kimoja, huyo nyoka akarudi nyuma, akasema umenikata kimoja lakini bado kingin

 TS M                                       Kasema inafunza nini?

 SS M le mtoto mmoja wa mapacha yule, yeye alisema mimi ntakuwa mwindaji. Basi akakuba

 VJ F io yanampa chakula, yanamvalisha, we unasema za kwako." Akasema, "Aah, ngojea le

 VJ F le yanataka kuondoka tu, yule mjomba anasema nyie ningojeni, ningojeni, yale mad

 TS M                                       Kasema inafundisha nini?

 

In all examples above, the hits are in the order they had in the source text. The hits can also be sorted to the right from the hit, using the switch -s.

 

$ head -100 _dahe.all | kwic -i6 -l40 -r40 -s 'sema '

 TS M                                       Kasema inafundisha nini?

 TS M                                       Kasema inafundisha nini?

 TS M                                       Kasema inafunza nini?

 JE M mwage mjomba wangu, ndo ntanila." - "Akasema kuna mama mmoja, sawa ndo ntanila."

 SS M le mtoto mmoja wa mapacha yule, yeye alisema mimi ntakuwa mwindaji. Basi akakuba

 VJ F le yanataka kuondoka tu, yule mjomba anasema nyie ningojeni, ningojeni, yale mad

 AA M wa kimoja, huyo nyoka akarudi nyuma, akasema umenikata kimoja lakini bado kingin

 VJ F io yanampa chakula, yanamvalisha, we unasema za kwako." Akasema, "Aah, ngojea le

 

Also regular expressions are supported.

 

$ head -1000 _dahe.all | kwic -i6  'sem(a|e|i) '

 CH M mke anaolewa kwa hadithi." Akasema ndio ninayemtaka mimi. Ka

 CH M aji si ya mvua si ya bomba. Kasema haya jibu lako tutakutafu

 TS M                             Tuseme ukilinganisha, haya malez

 X1 F  Na kama yanabebwa watoto hawasemi ukweli. Ee, kama mimi wan

$ head

 

9.2 kw-alg

 

Another tool that can be used for aligning hits is kw-alg.

 

$ head -100 _dahe.all | kw-alg 'sema '

23:                             TS M Kasema inafundisha nini?

84:                             TS M Kasema inafundisha nini?

46:                             TS M Kasema inafunza nini?

19:  mjomba wangu, ndo ntanila." - "Akasema kuna mama mmoja, sawa ndo ntanila."

56: oto mmoja wa mapacha yule, yeye alisema mimi ntakuwa mwindaji. Basi akakuba

79: nataka kuondoka tu, yule mjomba anasema nyie ningojeni, ningojeni, yale mad

39: moja, huyo nyoka akarudi nyuma, akasema umenikata kimoja lakini bado kingin

79: nampa chakula, yanamvalisha, we unasema za kwako." Akasema, "Aah, ngojea le

 

This program aligns the hits and sorts them to the right, and gives the line number of each hit

 

Also kw-alg supports regular expressions.

 

An annoying feature with kw-alg is that by default it produces only a limited number of hits. If you want to get more hits, perhaps all of them, increase the limit. This can be done in the following way:

 

$ MAXOCC=99999  (or any other number of hits you want)

$ export MAXOCC

 

This definition will be valid for the whole session, but not after logout.

 

9.3 kw-snt

 

In case the source text is in sentence-per-line format, it might be convenient to retrieve the whole sentence as a context. This can be done with kw-snt.

 

$ head -100 _dahe.all | kw-snt 'sema'

 79: VJ F <<Akasema,>> "Aah, ngojea leo twende tuone, kama hazikuja nakuua."

 84: TS M <<Kasema>> inafundisha nini?

 

Note that the standard utility egrep in Linux does the same thing, but it does not mark the match:

 

10. REGULAR EXPRESSIONS

 

A 'regular expression', or regexp, is a way of describing classes of strings. The simplest regular expression describes itself. For example 

'abc' matches the string abc.

The regular expressions get more complicated when special characters are added.

The special characters of regular expressions:

^          matches the beginning of the string or the beginning of a line within the string. For example:

^Chapter

 

matches the string Chapter at the beginning of a string. This character can be used to identify the beginning of a line. (NOTE that in 'Linux world' everything that is between two hard returns is considered as one line, even though the line continues to the next 'line' on the screen. This is called 'a long line' and at the end of each 'line' on screen there is a backslash to tell the user that the line continues.)

 

$          is similar to '^', but it matches only at the end of a string or the end of a line within the string. For example

 

o$

matches a string that ends with a vowel 'o'.

 

.          matches any single character except a newline. For example

 

.P

matches any single character followed by a 'P' in a string. Using concatenation you can make regular expressions like

 

U.A

which matches any sequence of three characters that begins with 'U' and ends with 'A', for example

'USA'

 

[...]            Square brackets form a 'character set'. It matches any of the characters enclosed in the square brackets. For example

 [MVX]

matches any of the characters 'M', 'V', or 'X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in square brackets. For example

 [0-9]

matches any digit.

[^...] This is a 'complemented character set'. The first character after the '[' must be a '^'. It matches any characters except those in the square brackets. For example

 [^0-9]

matches any character that is not a digit.

 

|          This is the 'alternation operator' and it is used to specify alternatives. For example

^P|[0-9]

matches any string that matches either '^P' or '[0-9]', i.e. any string that starts with 'P' or contains a digit. The alternation applies to the largest possible regular expression on either side.

 

(...)            Parentheses are used in regular expressions for grouping. They can be used to concatenate regular expressions containing the alternation operator '|'. For example

possibilit(y|ies)

matches either 'possibility' or 'possibilities'.

 

*          means that the preceding regular expression is to be repeated 0-n times. For example

ph*

applies the '*' symbol to the preceding 'h' and looks for matches to one 'p' followed by any number of 'h's (including zero). The '*' repeats the 'smallest' possible preceding expression. If you need to repeat a larger expression, use parentheses. For example

(abc)*

matches '', 'abc', 'abcabc', 'abcabcabc', etc.

 

+          is similar to '*', but the preceding expression must be matched at least once. This means that

wh+y

would match 'why' and 'whhy' but not 'wy', whereas 'wh*y' would match all three of these strings.

 

?          is similar to '*', but the preceding expression can be matched once or not at all. For example

fe?d

will match 'fed' or 'fd', but nothing else.

 

\          is a backslash. It has two functions: it is used to suppress the special meaning of a character (including '\' itself), and it introduces additional special constructs (see below). For example

 

\$

matches the character '$'.

 

To match '-', write it as '---', which is a range containing only '-'. You may also give '-' as the first or last character in the set. To match '^', put it anywhere except as the first character of a set. To match a ']', make it the first character in the set. For example

 

 []d^]

matches either ']', 'd' or '^'.

 

Usually '\' followed by any character matches only that character. However, there are several exceptions: some characters form special regular expression constructs when preceded by '\'.

 

\|       specifies an alternative. Two regular expressions a and b with '\|' in between form an expression that matches anything that either a or b will match. Thus, 'uta\|ona' matches either 'uta' or 'ona' but no other string. '\|' applies to the largest possible surrounding expressions. Only a surrounding '\( ... \)' grouping can limit the grouping power of' '\|'. Full backtracking capability exists to handle multiple uses of '\|'.

 

\( ... \) is a grouping construct that serves three purposes:        

1. To enclose a set of '\|' alternatives for other operations. Thus, '\(uta\|ona\)x' matches either 'utax' of 'onax'.                 

 

2. To enclose a complicated expression for the postfix '*' to operate  on. Thus, 'ba\(na\)*' matches 'bananana', etc. with any (zero or  more) number of 'na' strings.  

 

3. To mark a matched substring for future reference. This last application is not a consequence of the idea of a  parenthetical grouping; it is a separate feature which happens to be assigned as a second meaning to the same '\( ... \)' construct because there is no conflict in practice between the two meanings.

\`       matches an empty string, provided it is at the beginning of the buffer.

\'       matches an empty string, provided it is at the end of the buffer.

\b       matches an empty string, provided it is at the beginning or end of a word. Thus, '\butu\b' matches any occurrence of 'utu' as a separate word. '\bbahar?\b' matches 'bahari' or 'bahar' as a separate word.

\B       matches an empty string, provided it is NOT at the beginning or end of a word.

\<       matches an empty string, provided it is at the beginning of a word.

\>       matches an empty string, provided it is at the end of a word.

In regular expressions, the operators '*', '+', and '?' have the highest precedence, followed by concatenation, and finally by '|'. As in arithmetic, parentheses can change how operators are grouped.

 

Whereas you normally can undo the special meaning of a character by placing a backlash in front of it, the apostrophe cannot be treated this way. You can avoid this problem by using double quotes ("...") in stead of single quotes ('...') in delimiting the string. Then you can find words such as ng'ombe, because the program does not now interpret the apostrophe inside the word as an end mark of the string, which it would do when using simple quotes.

 

You may also use a dot '.'  in place of the apostrophe or any other character that causes problems. In regular expressions the dot stands for any single character. Thus you can use one of the following alternatives:

 

$ kw-alg " ng'ombe " <textfile >results

or

$ kw-alg ' ng.ombe ' <textfile >results

 

 

Last modified: 1.12. 2007, A. Hurskainen