Step 1: Structure of data and archives

Structure of Language Data

Taking into consideration the amount of curation and organization of language materials, we can talk about three kinds of organizational units: assemblages, collections, and corpora.

An assemblage is the sum total of all the material collected and created in the course of your project, research, or work. Depending on how you worked, there may be plenty of working files, intermediate versions, drafts, and various ephemera. People working in low-electricity, low-internet zones sometimes make periodic backups of their entire working directories onto external hard drives to avoid major data loss. All of the materials in these drives would be part of the assemblage, but naturally, no one would want to archive three identical copies of the same directory just because they are on three different external hard drives. The job of curation is to select from the assemblage those parts which merit archiving, to organize them into a logical arrangement, and to produce enough metadata and description of the materials so that the user knows what they are and can find things within this set. In short, curation makes collections. A collection can (some would say should) have different kinds of materials: recorded conversations, recorded songs, procedural texts, photographs, videos, and analytical documents.

Two examples of a collection, including some of their contents, can be seen below in Figures 9 and 10.

Figure 9:

Image of the collection record and partial folder listing for the language Quechua as recorded in Peru

Figure 10:

Image of the collection record and partial folder listing for the language Sochiápam Chinantec as recorded in Mexico

A special type of collection is a corpus. This is a set of materials (most commonly recorded texts with their annotations, but many kinds of corpora exist) that is designed for a purpose. It may be a single purpose, such as a collection of words recorded for a phonetic study or a series of interviews with poets about their work, or it may be intended to be reused in many different ways—similar to corpora like the Brown corpus (an annotated set of a million English words), pictured in Figure 11. Corpora can be much much bigger than that, but can also be smaller of course.

Figure 11:

A table demonstrating the text categories in the Brown Corpus, organized by genre group, category, content of category, and number of texts.

Also note that a collection is not always an intermediate step on the path to building corpora. Some projects set out from the beginning to create corpora and never really create a heterogenous curated collection. Many collections are never uniformly processed to create a corpus yet are still of great worth; while, for example, natural language processing research reuse may be less likely, individual recordings of a collection can be accessed by community members interested in their content and topics, or inspected for their grammatical features.

Different kinds of digital repositories

For the purposes of our discussion, we present three different types of digital repository: a digital language archive, a data repository, and an institutional repository. When thinking about how to prepare your materials for archiving, it can be good to think about which of these types of digital repository you will use. Most digital repositories have some sort of technical method for restricting files; these methods might include granular access, graded access and time embargoes. Granular access means that certain files in a folder of related files may be restricted while other files in the same folder are not. Graded access implies different levels of access depending on the files themselves or the role of the user. A time embargo restricts access to a file or folder of files for a limited and predetermined amount of time. Each digital repository, no matter its type, will have its own policies and procedures that must be followed by both depositors and users. In the following sections, we draw some distinctions between digital language archives, data repositories, and institutional repositories.

Digital language archives

An archive is “a trusted repository created and maintained by an institution with a demonstrated commitment to permanence and the long-term preservation of archived resources” (Johnson 2004:143). Extending the definition further, a digital language archive is a digital repository whose primary purpose is to preserve materials (files) that are in or about specific languages, usually Indigenous or minority languages that are under-documented, under-resourced, under-described, and less commonly taught. According to Peter K. Austin (2011), regional archives are used more by language communities for “cultural, historical or language-learning purposes,” but other archives are used primarily by researchers.

Figure 12:

Logo for DELAMAN, The Digital Endangered Languages and Musics Archives Network

The Digital Endangered Languages and Musics Archives Network (DELAMAN, www.delaman.org, Figure 12) is a network of archives that specialize in endangered and/or Indigenous languages. The member archives have given special consideration to the handling and description of language documentation materials. Each of the DELAMAN archives has its own policies, procedures, and specializations, and all or most of them have some technical process for restricting access to certain files. Some of the DELAMAN archives can also provide granular access, graded access, or both.

Full Members of DELAMAN are

The Archive of the Indigenous Languages of Latin America (AILLA, University of Texas at Austin, https://www.ailla.utexas.org)
Alaska Native Language Archive (ANLA, University of Alaska Fairbanks, https://www.uaf.edu/anla/)
California Language Archive (CLA, University of California, Berkeley, https://cla.berkeley.edu)
Endangered Languages Archive (ELAR, SOAS University of London, https://www.elararchive.org/)
Kaipuleohone (University of Hawaiʻi Digital Ethnographic Archive, http://ling.hawaii.edu/kaipuleohone-language-archive/)
Native American Languages Collection at the Sam Noble Museum of Natural History (https://samnoblemuseum.ou.edu/collections-and-research/native-american-languages/native-american-languages-collections/)
Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC, http://www.paradisec.org.au)
Repository and Workspace for Austroasiatic Intangible Heritage (RWAAI, Lund University, https://projekt.ht.lu.se/rwaai)
SIL International Language and Culture Archives (https://www.sil.org/resources/language-culture-archives)
The Language Archive, (TLA, Max Planck Institute for Psycholinguistics, https://tla.mpi.nl)
The Library of the American Philosophical Society (APS, https://www.amphilsoc.org/library/CNAIR)
World Oral Literature Project (http://www.oralliterature.org)

Associate Members of DELAMAN are

The Archive of Languages and Oral Resources of Africa (ALORA, CERDOTOLA, https://www.delaman.org/members/alora/)
Computational Resource for South Asian Languages (CoRSAL, University of North Texas, https://corsal.unt.edu)
Digital Himalaya (http://www.digitalhimalaya.com)
Language Archive Cologne (LAC, University of Cologne, https://lac.uni-koeln.de)
Pangloss Collection (https://pangloss.cnrs.fr/index_en.html)
Rosetta Project (http://rosettaproject.org)
Standing Rock Sioux Tribe Language and Culture Institute (This archive is under construction at the time of writing and will soon be accessible at http://www.standingrockiyapi.org)

For more information about these archives, and to see the most updated version of this list, visit https://www.delaman.org/members/.

Data repositories

Digital data repositories are generally intended to accommodate all kinds of data from different disciplines. As such, they usually handle the assortment of different files typically found in language collections quite well, though their metadata fields may not be particularly well-suited for handling language data. Most data repositories that are associated with universities only accept data from their faculty, staff, and students, but some data repositories are public, and welcome donations of data from anyone whether or not they are associated with a university. One example of a public data repository dedicated to linguistic datasets is the Tromsø Repository of Language and Linguistics (https://dataverse.no/dataverse/trolling, Figure 13).

Figure 13:

Logo for TROLLING, the Tromso Repository of Language and Linguistics

Data repositories may be especially good for corpora that are meant to be consulted or used in aggregate: for example, a collection of transcriptions meant to be used for natural language processing tasks, or numerous short audio recordings used to measure acoustic features of a language’s vowels. Since data repositories often require less metadata for individual files, they may be especially useful for collections with materials that are not likely to be used individually in the future.

Institutional repositories

Many universities have digital institutional repositories where faculty, staff, and students can deposit their works (e.g., Figure 14). Some institutional repositories are data repositories like those we just discussed, but more often they are designed to hold only academic literature such as reports, articles, white papers, conference proceedings and presentations, theses and dissertations, all of which can exist as single, simple digital objects (usually PDF files). They are usually not configured to hold collections that include datasets or audio-visual recordings (which will require more space and produce more complex collections). Some institutional repositories are open to the worldwide web while others are only available to affiliates of the host university.

Figure 14:

Logo for Texas Scholar Works at the University of Texas at Austin Libraries

Each university’s institutional repository will have different capabilities and requirements. For example, if you are a student you may need a faculty sponsor to deposit your materials, and some institutional repositories are not accessible by people who are not affiliated with a particular university. Institutional repositories—much more so than digital language repositories and data repositories—often support only a few file formats. You should inquire about these policies before you decide to deposit materials at your institution’s repository.

Complete and Continue

Discussion

Archiving for the Future