Skip to content

(Undergrad independent study project) Focused web crawler aiming to discover documents written in isiXhosa on the web

Notifications You must be signed in to change notification settings

Restioson/isixhosa-crawler

Repository files navigation

isixhosa-crawler

Simple focused web crawler for discovering documents written in isiXhosa. This was produced as part of an undergraduate independent research project under the supervision of Professor Hussein Suleman during my B.Sc Computer Science & Xhosa Communication at the University of Cape Town.

Disclosure

This research was partially funded by the National Research Foundation of South Africa (Grant number: 129253) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.

Publication

Results associated with the crawler were published in the SAICSIT2023 conference.

The final paper is available from SpringerLink, and a pre-print version is available for free from UCT CS's publications archive.

The dataset itself is available here.

Citation

Please cite as follows:

@InProceedings{10.1007/978-3-031-39652-6_2,
author="Marquard, Cael
and Suleman, Hussein",
editor="Gerber, Aurona
and Coetzee, Marijke",
title="Focused Crawling for Automated IsiXhosa Corpus Building",
booktitle="South African Institute of Computer Scientists and Information Technologists",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="19--31",
abstract="IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps' Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages.",
isbn="978-3-031-39652-6"
}

About

(Undergrad independent study project) Focused web crawler aiming to discover documents written in isiXhosa on the web

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published