Step 4: Language metadata and tags

Introductory video: Language Metadata in AILLA

Language Metadata

It is especially important that your metadata indicates all of the languages that are represented in your collection. In addition to identifying the language(s) that are the focus of the collection, you should also identify any other languages that appear in the materials, such as any lingua franca or national or contact languages that you used to conduct the research or complete the analysis. That way, users can know which languages appear in written or spoken translations, notes, or were otherwise used in the creation of the collection.

Languages are almost always indexed in metadata records using an authority file of standard codes developed by the International Organization for Standardization (ISO, iso.org). Many libraries and repositories use either the two-letter codes of the ISO 639-1 standard or the three-letter codes of the ISO 639-2 standard. Both of these standards are useful for identifying the languages of most published books, but neither includes many Indigenous or lesser-known languages that are likely to appear in language data collections. A more fine-grained standard is the ISO 639-3 (see Figures 37 and 38 below). This more detailed standard is capable of identifying with good precision languages such as Isthmus Zapotec (zai) or Warlpiri (wbp) that—when using the ISO 639-2 codes—would have to be indicated by less precise language family labels (zap for Zapotec languages) or geographic labels (aus for “Australian languages”), respectively.

Figure 37:

Example of an ISO 639-3 language code, "bao", in use to classify the language Bará.

However, be aware that some languages might not have an ISO 639-3 language code, or there might be a problem with the code that has been assigned. A language may not have been classified, especially if it has been identified only recently, or there may not be a good one-to-one match between what you consider the language to be and the ISO 639-3 code applied to it (for example, a single ISO 639-3 code could be applicable to what you may consider two distinct languages; or two different ISO 639-3 codes might have been applied to two language varieties that you consider to be the same language). The ISO 639-3 registry is maintained by SIL International, and the code tables are publically available at https://iso639-3.sil.org. This organization also maintains Ethnologue (https://www.ethnologue.com/), a subscription-based online encyclopedic database of information about the languages identified in the registry. Some freely accessible online encyclopedic databases that report information about languages along with their assigned ISO 639-3 codes include Wikipedia and Glottolog (https://glottolog.org).

In terms of collecting the metadata about your own collection, you should identify each language that appears in your collection by specifying its ISO 639-3 language code as well its name. Including the language code in your metadata will also help to disambiguate between different languages that have the same or similar names (e.g., many languages in the Vaupés region along the border of Colombia and Brazil may be called by a form of the pejorative name Maku) or to identify a language that has multiple names (e.g., Popti’, a Mayan language of Guatemala, is also called Jakalteko, which used to be spelled Jacalteco). Identifying the ISO 639-3 code can be very helpful as well for digital repositories that are not dedicated language archives since they likely will not have staff familiar enough with the materials to accurately identify the languages and their codes without your assistance.

Figure 38:

Image shows an archive listing for a recording of an explanation of weaving tools, and demonstrates how two languages appear in the recording and are both represented by their ISO 639-3 codes.

Tags

Your archive may allow you to create tags (also known as keywords or labels) for your materials. With a user-generated tagging system, you can add keywords to folders or media files (see figures 39 and 40 below); if other materials (even materials outside your own collection) also have that same tag applied, a user can then navigate among them. Some depositors prefer having the freedom to apply exactly the labels they choose to their own materials rather than having to depend solely on a possibly restrictive controlled vocabulary, but systems relying on user-generated tags have their own pitfalls.

Figure 39:

Screenshot of a search being performed on a database. On the left, tags are called "keyword term".

Figure 40:

A screenshot of an archive listing, here tags are labeled "keyword"

User-generated tags often must match exactly in order for linking between them to work, so care must be taken to always write tags in exactly the same way, avoiding typos (“language” vs. “langauge”), differences in style (“Sun and Moon legend” vs. “Sun & Moon legend”), and possibly even differences in leading or trailing spaces (“myth” vs. “myth “) or case sensitivity (“tortillas” vs. “Tortillas”). 

However, it is important to bear in mind that user-generated tags are often not very useful for navigation across collections since tags can be so precise that they apply only to a few items. One study of tags used at ELAR and TLA found that 40.3% and 47.8% of tags, respectively, were used only once (Sullivant 2020). However, you may find useful ways to use tags in your collection. For example, the status of materials related to an ongoing project could be indicated with tags such as “transcribed”, “translated”, or “revised”.

Complete and Continue  
Discussion

0 comments