Helsinki Corpus of Swahili


Center for Scientific Computing (CSC) hosts a number of language corpora, including the Helsinki Corpus of Swahili. The access to the server is granted to university researchers by application and signed contract. These precautions are taken because of copyright regulations.


An application form can be downloaded here. Corpus-specific Instructions on how to retrieve information from the HCS using the Web interface Lemmie are available.


A general introduction to the  materials of the Helsinki Corpus of Swahili.


If one wishes to work using the Linux interface, there are instructions on using Linux in rmecorp.pdf 

Helsinki Corpus of Swahili offers now a possibility to use SALAMA (Swahili Language Manager) in corpus work. SALAMA makes use of language analysis and it offers more accurate and wide range possibilities for work with corpus texts than the traditional string search. The permission for the use of SALAMA does not come automatically to corpus users. This facility is in testing phase, and those interested are asked to contact directly . The introduction to the use of SALAMA is available in salamainfo.pdf . The linguistic tags that have been used in SALAMA are listed in swatags.pdf .


There is information on string-based information extraction as well as on information extraction based on SALAMA in corptrain.pdf . The supremacy of SALAMA is shown through several tests. A more sketchy but at the same time more comprehensive presentation of similar issues is in cameel2002.pdf .


Please note that it is not possible to get access to the materials and tools directly from this page.


Those interested should contact: