Step 8: Managing growing collections with progressive archiving

Introductory Video: Progressive Archiving

Managing growing collections with progressive archiving

In the past, language documentation was donated to an archive at the end of a researcher’s career, if at all. In contrast, language documentation collections today are being archived much closer to the time of their creation. While funders will require language documentation data to be archived prior to the end of the project, some language documenters archive language materials while they are still in the field (Robinson 2006) or shortly after each field trip. This kind of archiving, where portions of a collection are archived in stages, has different names, including progressive archiving and incremental archiving. Here we use the term progressive archiving.

In a progressive or incremental archiving model, a depositor or documentation team adds materials to an archive as soon as possible after they are collected. Typically the primary data (audio/video recordings, photographs, notes, etc.) collected during field trips or other data collection phases of a research project (e.g., if the “field” is your university or location of residence) goes into the archive first. Secondary data (e.g., transcriptions, translations and education materials created from the primary data) and additional files (e.g., analyses, theses, and collection guides, see Step 9) are added as they are completed, which in some cases could be years after the primary data is archived. This cycle happens iteratively during or after each period of data collection (see Figure 59). If a project is ongoing, the archiving can and should be done on a regular basis.

Figure 59:

Details the steps of progressive archiving.

What follows is an extremely simple example of progressive archiving, but it is enough to give you the general idea of how this process works. Let’s imagine that you are working on a project studying ethnomedicinal practices in your community. In your first round of data collection you conduct videotaped interviews and take photographs. You have assurances that there is no sensitive material within these files, and you have permission to archive them. Since this primary data is unlikely to be changed, you can go ahead and archive these materials. The image below in Figure 60 shows a small collection of ethnomedicine materials that has two folders of files. “Santos Interview” has one video and two photographs, and the “Ortiz Interview” has one video.

Figure 60:

A collection with some video and image files.

Several weeks or months later, you complete transcriptions of the video files, which you then archive alongside their corresponding videos. In the image in Figure 61 below, each folder now has a transcription (.eaf) file--secondary data--in addition to the primary data that was archived previously. These new additions are indicated in the image by the pointing fingers.

Figure 61:

A collection with some video, image, and transcription (text) files.

You can continue to work like this, progressively adding new or revised annotations or other materials to folders in the archived collection and adding folders of new materials, including reports and articles synthesizing the work. New additions in Figure 62 below, again indicated with pointing fingers, include a translation file in each of the original folders, as well as a new folder, “Hot and Cold Cures,” containing a journal article. Because the primary data (the videos) and the secondary data (the transcriptions) were already archived, you were able to directly cite those materials in your journal article, which you later also deposited in the archive.

Figure 62:

A collection with video, image, .eaf, and .pdf files.

Now imagine this process at scale. If you have collected tens, hundreds, or thousands of audio and video files, it could take you years to transcribe, translate, and otherwise annotate that many files. Rather than just keeping all of those audio and video files stored on hard drives or cloud storage for years, it is better to archive them as soon as possible and add the annotation files as they are completed.

Technical limitations on progressive archiving

Not all repositories can handle progressive archiving; for others, practicing progressive archiving will require some modifications to your workflow. For example, as of this writing, the system used by the California Language Archive will not allow for new files to be added to existing folders (like we saw in the earlier example), and the system used by the Archive of the Indigenous Languages of Latin America only allows for files under 1 GB to be added to existing folders. AILLA’s limitation makes it difficult to add audio files and impossible to add video files to existing folders, but easily allows small files to be added. Check with your repository about the feasibility of your archiving plan if you intend to use a progressive archiving model to see if you may need to adjust your plans. Even if the archive will not allow you to practice progressive archiving to ingest the files incrementally into the archive, you should still plan to organize your files, pick an arrangement strategy, and document your processes after every data collection trip or phase. This will facilitate the archiving process when it is time to submit your entire collection and the associated metadata to your chosen repository.

Deletion

Archives are reluctant to simply delete archived data for two important reasons. First, being able to examine data that was used in an analysis is an important aspect of reproducible and verifiable research (Berez-Kroeker, Andreassen, et al. 2018; Berez-Kroeker, Gawne, et al. 2018). So if a file (i.e., data) has been available to the public, then it might have been cited in an article, book, thesis, etc. A reader of that work might want or need to examine the data, so they go to the archive to find it (which they can do because the data was appropriately cited and listed in the references), only to learn that it is no longer there. Now the research is neither reproducible nor verifiable. Second, digital preservation involves backing up data and metadata across various media types that are stored in multiple locations, so deleting a file from an archive is not about just deleting it from one place on a server. It can actually be a complex process to delete the file from all storage and preservation media and locations.

Thus, you may not be able to delete a file once it has been archived. Therefore, it is vital to check your materials for issues like sensitivity or anonymity before you deposit them in an archive. If you discover you have archived something that you should not have, many archives will allow you to restrict access to only authorized users or for a predetermined amount of time.

Versioning

Versioning refers to the management of changes to a file. If you have already archived a file but later make changes to it, you may not be able to delete or replace it for reasons you just learned about above in the subsection on deletion. Instead, you will need to archive the newer version of the file as well. If you end up having different versions of the same files in the archive, users will need to know what the differences are between each version. Some archives allow innate versioning where later versions of the same media file are added to the same item record with some documentation of the version history. However not all archives allow innate versioning. If this is the case with your chosen repository, you need to establish a method to indicate additional versions of files. A simple way to do this is by adding a "v" for version and a version number, beginning with a leading 0, to the end of the filename; another option is to make other meaningful additions to the ends of filenames. To keep related files together in sorted lists, it is better to add suffixes to filenames, not prefixes.

If a video and its Elan transcription (file extension .eaf) were archived in 2018, and in 2019 translations were added to the annotation file, in a repository that allows innate versioning, both Elan files can be archived with the same name, and the software will keep the different versions distinct and in many cases older versions will still be accessible to users. If innate versioning is not supported, then the different versions of the Elan annotation file will require different names. In this case, these two Elan files could either be labeled with sequential version numbers or meaningful suffixes, as shown in Table 6 below.

Table 6:

Table demonstrating some file versioning methods over time and through the transcription and translation process.

Benefits of progressive archiving

Even though the progressive archiving model requires the depositor to think about archiving more frequently, the benefits are substantial. If you are required to archive the results of a language documentation project by your grant funder (e.g., the National Science Foundation in the US or any private or public funder in any country) or academic department, doing the archiving work in incremental stages will prevent you from missing crucial deadlines and facilitate the process so that you are not rushing to archive the entire collection all at once. By starting the archiving process from the beginning of a project, you will not have to untangle large amounts of data and reestablish relationships between the different materials that you have collected, and you will have sufficient time to create a well-organized, fully-contextualized, complete, and accurate collection. If you (alone or with collaborators) fail to satisfactorily archive data that results from your grant, you will be less likely to be awarded funding in the future. Overall, the process will be much less stressful if you do it progressively.

Furthermore, by practicing progressive archiving, you will make the results of your grant-funded research available to the public sooner than they would be if you waited until the end of the project. This increases the likelihood that the primary and secondary data will have a greater impact on language maintenance, revitalization, and education, as well as on public policy for language planning, education and funding. Publicly accessible archived materials help advance scientific and research endeavors both by providing primary and secondary data for additional research purposes, such as natural language processing and linguistic analysis, and by facilitating the citation of these same language documentation data (Berez-Kroeker et al. 2017). The diligent citation of archived data helps to ensure the reproducibility of research (Berez-Kroeker, Gawne, et al. 2018). This means that as you analyze and publish your findings, you should cite the data you have archived, so the sooner you archive the primary and secondary data, the sooner these sources will be available for citation purposes.

In the digital age in which we live, archiving and preserving data is now considered a regular part of the research data management lifecycle (Briney 2015, Berez-Kroeker, Collister & Kung 2017). By practicing progressive archiving, you will develop better research data management practices overall. One of the most important lessons to learn when it comes to research data management is that it is much easier to manage your data if you do it regularly and in small batches. Similarly, it is much easier to organize and archive a language documentation collection in small and regular increments than it is to try to organize and archive a decade or more of language documentation all at once.

Finally, practicing progressive archiving will help you with your own personal academic or educational advancement. As soon as you have archived primary data in a digital repository, you can list that archived collection on your curriculum vitae (Johnson 2004). Even if you have only a small amount of data archived, that will still look better on your CV than if you have nothing archived. Archived collections can also be included in your grant proposals, and having previous experience with archiving any sort of data, including language documentation data, will increase your chances of getting a grant that requires that the resulting data be archived. Finally, archived collections of language documentation should be included in your Tenure and Promotions portfolio (Berez-Kroeker, Gawne, et al. 2018). Again, archiving your data as you progress through your project will help to ensure that you will get the work done and reap the benefits in a timely manner.

Complete and Continue

Discussion

Archiving for the Future