Romansh digitisation project breaks new ground
Records of witch trials, memoirs of mercenaries and personal letters up to 450 years old have recently made their way from the library shelf to the computer thanks to a pioneering digitisation project with potential far beyond Switzerland.
They are all to be found in the Rätoromanische Chrestomathie, regarded as a mine of information for historians and linguists.
A collection of texts written between about 1560 and 1910 in all the varieties of Romansh, Switzerland’s fourth national language, it was published around the turn of the 20th century.
A facsimile version in 15 volumes came out in the 1980s. But valuable though it is, it has never been particularly easy to work with.
Now, thanks to specially developed software and a huge amount of goodwill by volunteers, it is available to anyone who wants to consult it, and in a much more user-friendly form.
The project was the brainchild of Professor Jürgen Rolshoven, a computational linguist in the Department of Linguistic Data Processing at the University of Cologne in Germany, and Wolfgang Schmitz, head of the Cologne University Library.
Rolshoven explained to swissinfo.ch that he had started off his academic career as a philologist with a special interest in Romance – or Latin-based – minority languages, including Romansh, so had personal reasons for wanting to digitise the work.
And with his enthusiasms, he is the perfect bridge. “Often people working in that domain don’t like computers, and often people working with computers don’t like minority languages,” he laughed.
Digitising old texts is not new, but – as anyone who has tried to read them on the computer will know – the end version is often full of mistakes. It is normally a two-stage procedure: first the original printed text is scanned to produce a photographic copy, which is then converted by the process of optical character recognition, or OCR, into computer-readable text.
The word chrestomathy – the English form – is derived from two Greek words, meaning “useful” and “know”.
It is a collection of texts, selected with a didactic purpose, usually for the study of language development or literature.
The didactic purpose is what differentiates it from an anthology.
The Rätoromanische Chrestomathie was published around the turn of the 20th century, and consists of texts in different variants of Romansh spanning about 350 years.
For the digitisation project the printed text was first scanned in Hanover, and this was converted into computer-readable text by the OCR process.
A team at the Department of Linguistic Data Processing at the University of Cologne developed software to enable a community of helpers to correct the text.
They see on their screen both the OCR version, and the scanned image.
When a corrector clicks on a word in the OCR version, a frame appears around it on the scanned image version, enabling the corrector to compare the two and make any necessary changes.
They can also add comments, but at present these are not available to on-line readers.
The development of the software in Cologne was funded by the German Research Foundation, which provided about €160,000 (CHF196,000); the organisation of collaborative work in Graubünden was funded by the canton of Graubünden, the Legat A. Cadonau Foundation, and the Graubünden Institute for Cultural Research.
Rolshoven’s department went one step further: it developed a web-based system in which the OCR version can be compared with the scanned version. This system enables collaborators to add corrections, commentaries and cross-references.
“This was important, because there are about 8,000 pages, and one person can’t do all the work. On the one hand it was necessary to distribute it, and on the other hand people should check one another,” he explained.
“This is a pioneering work,” said Florentin Lutz, a Romansh-speaking linguist based in Bern, who helped Rolshoven with contacts, knowledge and fundraising in Switzerland. “It’s the first time that ordinary people have been brought in to help.”
These correctors are all volunteers. Without community involvement the project couldn’t have succeeded. The bulk of the funding that Rolshoven and Lutz have been able to raise is needed to pay for the software developers at the university.
About 150 people from all over the world – most of them Romansh speakers – registered to help with the correction. Not all of them ended up doing so and some made only a few contributions. The bulk of the work ended up being done by quite a small group, dominated by one man: Michele Badilatti, who comes from the Engadine, and is a student at the University of Zurich. He told swissinfo.ch that he wouldn’t like to think how many hours he had put in.
Nevertheless, for him the fact that the group ended up being quite small had advantages.
“What I noticed as I worked was that when there are not so many collaborators, it’s easier to achieve consistency,” he told swissinfo.ch.
To err is human
Badilatti admits that it could sometimes be “mega-dull”. “All the religious stuff – which is actually most of it: for us today it’s just awful to read. But at least linguistically they were interesting.”
“And then there were lots of very fascinating things – the ones I probably liked best were accounts written by mercenaries.”
It wasn’t only the OCR version that contained mistakes. Caspar Decurtins, the original compiler at the turn of the century, transcribed many of the texts from manuscripts. As a speaker of the Sursilvan variant of Romansh, he sometimes made errors in other variants. And the printed version was typeset in Germany, by printers who didn’t know the language at all.
The correctors decided to leave most of these errors, rather than to take a subjective decision as to what was right. They only corrected systematic errors, such as when the German-speaking printers had muddled ‘n’ with ‘u’ – one being the upside-down version of the other.
Computational linguistics uses computers to try to model natural language.
It started in the US in the 1950s with attempts to develop automatic translation tools.
The applications include machine translation and information retrieval as well as intelligent text processing.
A huge amount of effort has gone into the project, but what’s in it, and for whom?
The developers don’t have any specific figures about the number of users, but Lutz is in no doubt that it has lots of advantages.
“You can find and download lots of texts very quickly, and you can find answers in them to all sorts of questions. Perhaps you want information about superstitions, or you’re looking for old recipes. Someone was interested in different words for ‘butterfly’ for a radio programme. It’s all very easy with our search function, which we are still expanding.”
Rolshoven also appreciates it from quite a different angle.
“It’s useful for my students, who are learning how to write specialised software. And we also learned that it is indeed possible to find a community and to integrate them to do all these jobs.”
Of course the experience is not tied to any particular language or area.
“We might try to collaborate with the Africanists at Cologne University, for example. There’s a great demand for doing this for languages where the books are only in the libraries of Paris or London, not in the countries where the languages are spoken,” he suggested.
And there are plans to create a digital Romansh library, if funding can be found. In some cases there are copyright issues, but literature in minority languages faces a problem that doesn’t apply to larger ones: when a work is out of print, it’s not worthwhile to reissue it.
Lutz hopes that the system can be used for print-on-demand, enabling people to buy works that disappeared long ago from bookshops.
And the volunteers? What’s in it for them? Badilatti is glad it is finished and that he can live more of a normal life. He assured swissinfo that he didn’t do it with an eye to enhancing his c.v., and laughed at the suggestion that at least he has got himself a good reputation in certain circles.
“Some might think that, but others think I’m crazy!”
In compliance with the JTI standards