Step 1: Data and archives

Introductory video: Organizing for Personal vs Archival Workflows

You might or might not be in a position to get to select the digital archive that will be the permanent home of the language documentation that you (help to) collect. Some people are required to deposit research data into a particular archive. For example, if you are awarded a grant from the Endangered Language Documentation Program (ELDP), you will be required to put that body of research data into the affiliated Endangered Languages Archive (ELAR). Similarly, some Tribes and research or academic institutions require their affiliates to put language documentation data into their Tribal archive or institutional repository. However, many people will have to (1) find a digital archive or repository that will accept their language documentation data and (2) make their own arrangements with that archive. The following information is intended to illustrate the purposes and procedures of digital archives to help you understand what archives are, how they operate, and how archival materials are exposed for reuse. Even if you are required to use a particular archive, you still need to understand archival processes that will affect how the data that you collect will be preserved and shared with the world.

Please do not forget to prioritize community access to the materials. In addition to archiving language documentation data in a digital repository or archive, be sure to share a copy of the materials and data with the speech community. There is no single, correct way to share the materials, so exactly how you do this is something that you will have to establish in collaboration with members of the speech community. We suggest that you consider depositing a copy in an institution such as a school, library, archive, museum or cultural office that is located in the community’s town, region, state, or country. While these activities are not the focus of this course, we view them to be an important part of the work of language documentation.

What is data?

“Data” means different things to different people, and people may have narrower or broader ideas of what “data” is. In the narrowest of senses, data is a collection of numerical or textual observations which can be cleaned, processed, used, and analyzed for many purposes, for example, to test hypotheses or to make visual displays. Perhaps the clearest example of this kind of data would be numerical data laid out in tables, such as the handwritten weather records below in Figure 5.

Figure 5:

In the world of digital data repositories, “data'' are materials or files that result from research and that are used to produce documents like reports, scholarly articles, pedagogical materials, theses and dissertations. Ideally, research data should be accompanied by supporting documentation (also known as “metadata”) that helps to explain what the data are about, how to interpret them, and keys to any codes that were used to prepare or analyze the data themselves. In language archiving, we use the term “data'' very broadly to include any recorded observations of language, including sounds, images, and writing, that can be transcribed, translated, glossed, watched, read, or measured for analysis. Many language documenters also distinguish between primary data (raw audio or video recordings or written observations of spoken or signed language that are used for analyses, including narratives, oral histories, elicitation, and experimental protocols) and secondary data (transcriptions*, translations, morpheme breakdowns, glosses and other types of annotation that require some level of preliminary analysis to create). Both primary and secondary data are used to produce published documents like the Quechua lexicon shown in Figure 6 below. In this course “data” will be used very broadly and nearly synonymously with “materials”.

Figure 6:

Displays a vocabulary of Quechua as spoken by the indigenous people of Peru in 1560.

Reusing archival data

It’s important to bear in mind why language materials are archived and who the potential audiences for these collections are now (i.e., at the time of archiving) and in the future. Thinking about who will see and perhaps reuse your collection can help you think through sensible ways to arrange and describe your materials.

Archives, including digital language archives, preserve materials so they can be accessed by future users. Future users of archival collections can include the collectors themselves, for example when the collector references citable public versions of a transcript or needs to demonstrate their research activities. Heritage language users may consult materials when reclaiming their languages. Language revitalizers may delve into archival materials to build useful pedagogical materials. Community members may use language archives as a source of local history or as a record of the culture at a particular time. Your collection may be an unexpected source of information for future historians, genealogists, governmental agencies, NGOs (non-governmental organizations), and academic linguists and anthropologists. Archived language recordings have even inspired and been used by visual artists and composers (e.g., see Figure 7).

Figure 7:

The sheet music for Ishi's Song by Martin Bresnick based on the Daidepayahi (Maidu) Doctor's Song.

In the field of language documentation, much of the discussion around archival reuse focuses on the revitalization, reclamation, and/or analysis of endangered or sleeping languages; however, it is important to note that archival collections have great value even in situations where a language is—and is expected to remain—quite vital. Language archives provide snapshots of languages and their communities at a point in time, and these snapshots can be informative and useful for all kinds of purposes.

Good metadata allows current and future users of archives to determine if a collection could be useful for them and helps them find materials they might be interested in using.

Structure and Purpose of Archives

Archives are meant to provide long-term preservation and access to materials, specifically materials that are not likely to be edited or modified. They are designed to allow users to access the stored files, but they are not meant to be temporary file storage platforms for files that are still being edited or modified. Because digital archives have this commitment to making materials available over the long term, archives do not work like some other digital content management, storage or sharing platforms you may be familiar with. Many people use cloud-based platforms like Google Drive (Figure 8), Box, DropBox, Microsoft OneDrive, etc. to share files with other people.

Figure 8:

Image depicting 17 anonymous individuals accessing a google document. Each user is assigned a random animal to represent them.

While these platforms work well for sharing files that are still being edited or revised, they are not appropriate for the long-term preservation of files or data that are stable and unlikely to be changed. Additionally, they are not stable distribution platforms since the files can be modified, replaced, or deleted by anyone with edit access. Video and audio streaming platforms such as YouTube, Vimeo, and SoundCloud are frequently used to publish and distribute audio/video content; however, even though these platforms are free to use and easily accessible, they make no commitment to the long-term preservation of the content. Furthermore, non-archival websites with user-generated content might be discontinued or bought out by another company; when this happens, the content might be deleted or the terms of use and access might change.

Digital archives and repositories are not intended to be used as temporary storage. All data that is put into the archive will be preserved, so materials should be in a stable and citable state. Materials do not need to be in their final state—a collection of text transcriptions may be revised over time, for example—but minor edits to materials might not merit preservation or later versions can be included in the archive alongside the originals. You should not archive materials you would not want anyone to reuse. While some archival software systems support versioning, many do not, and data cannot be deleted or replaced past a certain point in the archival processing. Many archival systems that allow versioning will keep all versions of a file available, making the task of keeping track of your data more difficult. Thus, it might be preferable or necessary to wait to archive some material (e.g., the secondary data) until after you are satisfied that no significant revisions will be necessary.

Digital archives do not make ideal distribution platforms. Many archives that can accept video files do not have built in video viewers for streaming those files. Furthermore, other functions of common video sharing platforms are not possible in digital archives. For example, it may not be possible to allow a user to submit their own captions for a video, embed a video in another webpage, or receive suggestions for other related videos they may be interested in. These kinds of functions are often incompatible with archives’ need to preserve materials over the long term and (in some cases) provide for culturally appropriate access to the materials.

* Note that transcripts are sometimes considered to be primary data (see Thieberger and Berez 2012). We consider transcripts to be secondary data because, in many cases, their production is dependent on some preliminary analysis of the phonemes of the language. This distinction is not critical to this course.

Complete and Continue

Discussion

Archiving for the Future