A historical corpus of the Welsh language

The Historical Corpus of the Welsh Language 1500-1850 is a collection of Welsh texts from the period 1500-1850 in an electronic format. It is the result of a project to encode Welsh texts of the period funded by the Arts and Humanities Research Board (AHRB Resource Enhancement Award RE11900) in the Department of Linguistics at the University of Cambridge between 2001 and 2004. The project's Principle Investigator was David Willis, while Ingo Mittendorf was the project's Research Associate. The aim of the project was to begin to provide an electronically searchable resorce for use in linguistic, literary and historical research, of a kind similar to existing corpora already available for languages such as English, French, German and Irish. The Cambridge project dealt with the early modern Welsh period. Other projects at the University of Wales have provided or are providing similar materials for earlier periods. Although the project came to an end in 2004, it is hoped that resources will become available to allow future extension of the corpus.

The corpus is a planned corpus, and aims to reflect the rich diversity of the texts attested in Welsh during the period 1500-1850 by including texts and samples of texts from different stylistic levels and of varying geographical provenance. A number of the texts included are not available in adequate modern editions or are available only in modernised form, hence the corpus also provides access to a number of texts in an easily available form for the first time. It is hoped that this will encourage further linguistic, literary and historical research on these texts.

The corpus is encoded using Extensive Markup Language (XML) in a format that conforms to the standards of the Text Encoding Initiative (TEI). This should ensure its long-term preservation, and also allows flexibility in the way the texts of the corpus can be displayed and used. The corpus files can be viewed online here, and are also available for download here in a number of formats: as plain XML files; as viewable HTML documents in two formats (diplomatic and edited); as corpus files designed for use with the Concordance software package; and as web-based indexes and concordances. Although the corpus contains no grammatical tagging, the XML files contain some encoding designed to facilitate the usefulness of the corpus as a source for linguistic research. This concerns mainly spelling and graphical variation. Original spelling is maintained, but tagging for scribal errors and extreme orthographic variation is included, and is used in the indexes and concordances. Other editorial conventions are documented here.

The corpus is arranged into different groups of text types in order to represent the stylistic diversity of the Welsh language, while allowing for differences in the specific range of text types actually available at different periods. The texts therefore include drama, personal letters, ballads, political (didactic) prose, scripture, historical narrative, narrative prose, and religious prose. For each text a representative sample of approximately 15,000 words is included. With texts whose total length is less that around 20,000 words, and also in the case of dramatic texts (the interludes) we have generally chosen to include the entire text. Overall the corpus contains around 420,000 words from 30 texts.

arts-humanities.net

Principal investigator
Dr David Willis
Principal project staff
Dr David Willis
Start date
Thursday, February 1, 2001
Completion date
Sunday, February 1, 2004
Era
Place
Digital resources created
A collection of Welsh texts from the period 1500-1850 in an electronic format. An electronically searchable resource for use in linguistic, literary and historical research, of a kind similar to existing corpora already available for languages such as English, French, German and Irish. The corpus is encoded using Extensive Markup Language (XML) in a format that conforms to the standards of the Text Encoding Initiative (TEI). The corpus files can be viewed online, and are also available for download in a number of formats: as plain XML files; as viewable HTML documents in two formats (diplomatic and edited); as corpus files designed for use with the Concordance software package; and as web-based indexes and concordances. Although the corpus contains no grammatical tagging, the XML files contain some encoding designed to facilitate the usefulness of the corpus as a source for linguistic research. This concerns mainly spelling and graphical variation. Original spelling is maintained, but tagging for scribal errors and extreme orthographic variation is included, and is used in the indexes and concordances. Other editorial conventions are documented on the site.
Source material
The final set of texts included in the corpus consisted of the following (ordered by approximate date): a selection of poetry (‘hen gwndidau’) by Tomas ab Ieuan ap Rhys (c. 1510-c. 1560); ‘Rhyfeddodau’r Ynys’ (Peniarth 163 ii) (1548); a selection of free-metre poetry from Cardiff 6 (c. 1550); religious plays (‘Y dioddefaint’ and ‘Tri Brenin o Gwlen’) from BL. Add. 14986; John Mirk, ‘Darn o’r Ffestifal’ (Hafod 22) (after 1525, ms. 1550-75), ‘Cronicl Hywel ap Syr Mathew’ (Peniarth 168) (1568-90); Testament Newydd (1567), a selection of parts translated by William Salesbury; Testament Newydd (1567), all parts translated by Thomas Huet; histories (‘Hanes Taliesin’, ‘Ystori y gwr moel o Sythia’, ‘Ystori Alexander a Lodwig’ and ‘Ystori’r llong foel’) from NLW 13075B; 1588 Bible translation; Huw Lewys, Perl mewn adfyd (1595); Ifan Llwyd ap Dafydd, Ystorie Kymru (NLW 13B) (1567-1609); a selection of free-metre poetry from Peniarth 218 (1605-10); Edward James, Pregethau a osodwyd allan… (1606) (‘Llyfr y Homilïau’); ‘Troelus a Chressyd’ (Peniarth 106) (1613, 1622); a selection of free-metre poetry from Llyfr Gwyn Mechell (NLW 823E) (late 17th c.); Huw Morus, ‘Y Rhyfel Cartrefol’ (Cwrtmawr 42) (after 1660); Ellis Wynne, Gweledigaetheu y Bardd Cwsc (1703); Theophilus Evans, Drych y Prif Oesoedd (1716); ‘Y Brenin Llur’ (Cwrtmawr 212) (c. 1700-50); a selection of the letters of Goronwy Owen (BL. Add. 15037, NLW 17B) (1752-7); ballads by Hugh Jones (Llangwm) (1750s-60s); James Albert Ukawsaw Groniosaw, Berr hanes o’r pethau hynod ym mywyd James Albert Ukawsaw Groniosaw (1779); a selection of ballads by Ellis Robert (Elis y Cowper) (1782-94); Thomas Edwards (Twm o’r Nant), Tri Chryfion Byd (1789); letters to Welsh settlers in the United States from south Carmarthshire (NLW 14873E) (1797-1807); Matthew Williams, Hanes holl grefyddau’r byd (1799); letters by Welsh settlers from south Merionethshire in the United States (NLW 2722E) (1816-18); Legh Richmond, Crefydd mewn bwythyn (1819); B. W. Chidlaw, Yr American (1840).
Publications

Mittendorf, Ingo and Willis, David. 2003. Ein historisches Korpus der kymrischen Sprache. In Keltologie heute: Themen und Fragestellungen, edited by Erich Poppe. Münster: Nodus, 135–42.