Statistical corpus of 20th century Basque
The statistical corpus of 20th century Basque consists of 4,658,036 forms. Its main and almost sole function is to show the Basque which has been and is used, and not to put forward a model of the language.
The corpus is based on an exhaustive inventory of publications in Basque in the 20th century. A random sample was drawn from the parent population of these publications which resulted in 6,351 extracts from published works.
The project was begun in 1987 and was carried out in two stages: the first covering the period 1900-1987, and the second for the period from 1988 to 1999. Although initially it was designed to be an open corpus and was therefore updated annually, at the end of the century it became a closed corpus which shows how Basque was used throughout the 20th century. The corpus contains written and not spoken Basque, although the latter has been added to the corpus to the extent that it has been transcribed and published.
The corpus is implemented using an ORACLE relational database.
Classification criteria:
- Period: 20th century publications are divided into four periods:
- 1900-1939: from the start of the century to the end of the Spanish Civil War.
- 1940-1968: from the post-Civil War period up to the appearance of standard Basque.
- 1969-1990: from the first changes produced by standard Basque up to the publication of the Euskaltzaindia's proposals and rules (i.e. up to the publication of the Hauta-Lanerako Euskal Hiztegia by Ibon Sarasola)
- 1991-1999: subsequent to the new standards.
- Dialect:
- Vizcaíno
- Guipuzcoano
- Suletino
- Labortano and Navarro
- Standard Basque
- Unclassified: this section includes articles from the press and periodical publications which have been inventoried as a whole (meaning that distinct dialects may appear in the same publication) and not by articles.
- Genre:
- Non-literary prose articles: articles from "significant" periodical publications, such as Euskera, Egan, Euzko Gogoa, and Jakin, which have their own files in the inventory and make up the contents of this section.
- Administrative texts
- Text books
- Essays (non-literary prose)
- Literary prose
- Poetry
- Theatre
- Verse
- Research
- Literature for children and adolescents
- Oral: transcriptions
- Liturgy
- Newspapers
- Periodical publications
This classification not only shows the parent population and is the basis for taking the statistical sample, but it is also valid for making queries such as the use of a form in one or more dialects, periods and/or genres. For example, you could search for the headword pastoral, but restricting the search to the Suletino dialect, or erdu in all dialects except Vizcaíno.
Each work or article contains information about the author (or authors) and the title, although this data cannot be used to make a query mainly because they are extracts from statistically assigned works and the selected pages are not continuous (so as to reflect the greatest possible degree of lexical variety within a particular work). Hence there are a large number of authors and titles but this does not provide significant help in searching.
Extracts from works are given in SGML (Standard Generalized Mark-up Language).
In addition to the abovementioned features, the corpus has an added value: it is lemmatised, that is to say each form has a standard headword assigned to it, which makes queries easier. For example, given that all declensional forms and variants are brought together under a single headword, in the etxe entry you can find forms such as etxe, etxea, etxien, echeco
, and etchetik
. Moreover, there is no risk of missing out a form or variant as the headword includes all of them.
Lemmatisation is not restricted to simple headwords but also includes compounds and derivatives and other complex lexical units (multiword): in addition to the etxe
noted above, you can also find etxe orratz, etxe-abere, etxe-tresna, etxeko, etxeko jaun, etxekoandre, etxepe, etxetxo, etxeño, etxezain
, etc. Together with hala
, you will also find hala ere, hala eta guztiz ere, hala... nola, hala nola.
Thus the user has 101,585 different headwords available which means that queries are easy to make and, most importantly, the results are reliable.
Bibliography:
URKIA, M. “Corpusgintzaren garrantzia hizkuntzalaritzan eta euskararen egoera” in EUSKALTZAINDIA, ‘Corpusgintza gaur egun’ mintegiaren aktak. Bilbo, 2010.
UZEI (Argtz.). ‘Hizkuntza-corpusak. Oraina eta geroa’ jardunaldien aktak. Donostia, 2002.