Step 7: Arrangement & Discoverability

Introductory video: Arrangement

Now that you have completed some or all of your planned data collection, it is time to prepare the data for deposit in your chosen digital repository. Now it is time to revisit everything that you learned about your chosen digital repository back in Step 1 so that you can prepare your deposit according to the repository’s requirements. In arranging your collection for deposit, you will need to do the following tasks:

Arrange your files into folders that can be accommodated by the structure of your chosen repository;
Increase discoverability of your files with transparent folder names; and
Possibly convert some of your file formats.

Step 7 is dedicated to carrying out these tasks.

File arrangement

File arrangement refers to how digital files are organized, ordered, and grouped together in a digital environment such as on a personal hard drive or in a digital archive. Digital archives typically have three units of organization. These go by many names, but in this course, we can call them collections, folders, and media files. Media files are the digital data or analysis objects themselves—the things you can stream or download and reuse. Folders are digital objects that contain media files, and collections are digital objects that contain folders. Folders typically contain materials that are closely related in some way, and collections contain folders of materials that pertain to a single depositor or project. A few language archives and many physical archives will also have subcollections or series of folders grouped together. The terminology used to refer to each of these levels varies from repository to repository, as seen in the table below, but for simplicity we will call them collections, folders, and files.

Table 5:

A table showing what terminology various repositories use for the words "collection" and "folder".

File Structure

Beyond terminology, archives also differ in whether they have a flat structure or if they allow for nesting folders. Many digital repositories have a flat structure in which folders may only contain media files and cannot contain other folders. This is true of most digital language archives (with the exception of The Language Archive), and may or may not be the case for an institutional or data repository.

Many people create deeply nested file structures, shown in Figure 50 below, for storing their own materials. In digital language archives, however, this kind of deep structure is unwieldy to navigate for the user, and it poses challenges for the archive with respect to maintenance, preservation, and future software migrations. Furthermore, each intermediate folder would need to have its own metadata record.

Figure 50:

A graphic demonstrating a nested file structure

In contrast to the nested folder structure, most archives utilize a flat structure, shown in the Figure 51 below, where there is just one level of folders between the collection level and the media files themselves. A flat structure facilitates archival processing, preservation, and migration, as well as user navigation.

Figure 51:

Graphic demonstrating a flat file structure

It is often much easier for a user to get a sense of the content of a collection from flat lists of folders than from nested structures. The two images below illustrate the ease of navigation in a flat structure. Looking at the top image (Figure 52), which is from the main page of a PARADISEC collection, we can immediately see that the collection has 59 folders, and we can read their titles. If the archive user clicks on a View button to the right of the first item on the list, Alor elicitation session, they will see the files inside that folder (Figure 53).

Figure 52:

A list of folders within a PARADISEC collection with flat structure.

Figure 53:

A list of folders within a PARADISEC collection with flat structure.

If your chosen repository utilizes a flat structure, but you do not, you will need to arrange your files into folders that conform to this flat three-tier structure. What follows are some suggested arrangements to organize your files into folders to fit this structure.

Arrangement Strategies

While there are endless arrangement strategies, or possible ways to arrange your files into cohesive units for archiving, there are a few that are quite common. These include arrangement based on:

recording session
language variety
location
speaker
experimental protocols
questionnaires
replication data; and
media format.

In what follows, we discuss the “recording session” arrangement strategy in detail. Though we mention the others, a detailed discussion of more than one arrangement strategy would be redundant.

The predominant strategy used in many digital language archives to arrange collections of language documentation projects is to group together all the media files associated with a single recording session or speech event. This strategy is so prevalent that one digital language archive, TLA, calls its folders “sessions”. This type of arrangement is especially well suited for materials that will be consulted individually, as the content of each recording will likely be reflected in the folder-level metadata. This arrangement results in mixed media folders that contain a variety of media and data types, including some primary audio-visual recordings along with their annotations and any other media associated with the event. For example, a recorded interview about ethnobotany might be accompanied by a transcript of the interview, photographs of plants discussed by the interviewee, and a tabular dataset indicating the plants’ names, taxonomic classifications, and reported uses.

Figure 54:

Image shows a file structure where mixed media bundles keep related files of different file types together.

In the example shown above in Figure 54 the collection “Rukiga Collection of Lydia Linguee” has two session folders, 20171226_Kabale_LBN and 20171226_Kabale_WKN. Each folder contains four files of different media formats, one each of WAV, MP4, EAF, and JPG. The WAV and MP4 files contain the primary data, which is the audio and video recordings; the EAF (Elan) file contains the time-aligned transcription, translation, and annotation; and the JPG file is a photograph of the session participants.

In cases in which multiple audio and video files result from a single recording session (e.g., a new track is written each time a recorder is stopped or paused; very long recordings are automatically divided into multiple files), it is not necessary to create a separate folder for each discrete recording. In these situations, all audio and/or video files that result from a single recording session should be arranged together in one folder. For example, the PARADISEC folder shown in the Figure 55 below contains four primary video recordings (MP4 files), plus their associated annotations and derivative files, demonstrating the different steps in the process of preparing cava.

Figure 55:

Recording session arrangement with 33 files of various formats, including 4 MP4 files.

Other arrangement strategies include grouping files together according to language variety, location, or speaker. Arrangements like this may be good for collections covering diverse language varieties or locations.

Data collected through experiments or questionnaires may be organized according to the protocol, prompt, wordlist, etc. that was used to gather the materials. All data that are used for a particular analysis or study can be grouped together, which is especially helpful if there was a resulting publication. This type of arrangement is especially well-suited for data repositories, where such folders may be identified as “replication data” since they contain all the data and supporting documentation needed to produce the same results obtained in the corresponding article.

If you want to arrange your media files so that they may be easily used as data for natural language processing tasks, you should group only files of the same format and language together in a folder. In this arrangement, all WAV audio files would be grouped together in one folder, all video files of the same format would be grouped together in another folder, and so on.

Finally, bear in mind that you might choose to use different arrangement strategies for different parts of your data, especially if you used different data collection methods in the field. For example, stimulus-based elicitation sessions might be arranged according to the task, protocol, or speaker while traditional stories might be arranged according to the recording session.

Discoverability

Having detailed metadata at all levels of the archive’s structure will improve the discoverability of the materials in your collection. Not only does this help users find and retrieve items, it helps users see the connections between files. Additionally, writing clear, concise and relevant titles is important because things with good titles are more easily found and therefore more widely reused. For academic articles, “good” titles are shorter titles that describe research results (Paiva et al., 2012).

Creators of language collections often rely on filenames to indicate the relatedness of files. Many linguists, for example, are trained to give annotations and other derivative files the same name as the recording they are related to, creating filenames that differ only in their extensions. So an original video recording (CAM20548_01.mpg) could be accompanied by a file containing the isolated audio of the video (CAM20548_01.wav), a free translation (CAM20548_01.txt), and a time-aligned transcription (CAM20548_01.eaf). The relatedness of these files is implicit from their similar filenames. If related files are placed in separate folders or have very different filenames, relationships between files should be made explicit in the descriptive metadata.

When thinking about how to give titles to the folders and items in your collection, you should take care to make sure you’re not relying on particularities of your repository’s current display to identify items. For example, one collection at AILLA contains 59 folders with the same title “Swadesh list and OKMA special list”. As you can see in Figure 56 below, the original arrangers were relying on the repository’s display to distinguish between the different folders through the alphanumeric order of the resource IDs (KEK002R061, etc., KJB003R001, etc. in the leftmost column) that were used to sort the folders in the collection, as well as through the display of the language names (Q’eqchi’, Q’anjob’al in the middle column).

Figure 56:

Collection arranged according to the alphanumeric display of the repository.

After the repository was migrated to new software, folder titles were displayed alphabetically, and it was no longer possible to display the language name or resource ID alongside them. The result, shown in Figure 57 below, was that a user browsing the collection had no easy way to determine which of the 59 folders with the title “Swadesh list and OKMA list” contained the Q’anjob’al lists they were looking for without opening each folder to check the metadata within. The moral of this story is that you should disambiguate your folder-level titles as much as possible.

Figure 57:

The same collection from above is no longer arranged according to the alphanumeric ordering and it now appears to have multiple folders of the same name.

Another consideration when naming folders is to aim for titles that can be easily understood by archive users. Sometimes folder titles in language archives contain abbreviations that cannot be readily understood by the archive user. For example, in the selection of folder titles in Figure 58 below, the user may not know that “EV”, “VA”, “CW”, and “TT” are the initials of the person being recorded, or that “Tapalla”, “Villaflor”, and “Villafranca” are the names of the communities where the recording took place. Since this information will ideally also appear in other metadata fields, it may be better to leave this information out of folder titles to clear up space for more descriptive titles.

Figure 58:

Screenshot of some of the folders in the Yauyos Quechua collection which use place and interviewee identifiers as titles.

Complete and Continue

Discussion

Archiving for the Future