Step 3: Audio and video formats

Header image reading Audio formats

Audio formats

For spoken languages, audio files are often the focus of a language data collection. For the most part, following best practices for recording audio files (e.g., Margetts and Margetts 2012; Casey and Gordon 2007) will result in files that are suitable for archiving—specifically, uncompressed WAV files (Pulse Code Modulation, or PCM, encoding) with appropriate numbers of channels and resolution, as described below.  

Compressed vs. uncompressed formats

Audio files can be either compressed or uncompressed. Whenever possible, people recording spoken language should have their digital recorders produce uncompressed WAV files rather than compressed MP3 files. Compressed files are produced by performing transformations on the original signal and discarding some information, producing a lossy file (i.e., a file from which some data has been permanently discarded). Since compression algorithms tend to discard information that listeners are unlikely to hear, many lossy compressed files are fine for most purposes; for example, most podcasts use compressed MP3 files. Language documentation recordings, on the other hand, are often used or reused for acoustic analysis, in which computer systems inspect an audio signal to measure various things like voicing, phonation type, pitch, or intensity. Uncompressed audio is preferred for this kind of analysis, as some of the acoustic signals may be discarded or distorted by compression algorithms, which can lead to less reliable measurements. Thus, uncompressed audio formats are preferred over compressed audio formats for language archives, despite their larger size.

Note that since information is discarded in lossy compression, converting a compressed audio file into an uncompressed format will not improve the audio signal (and, according to Fuchs and Maxwell (2016: 526), may make it less reliable for some acoustic measures) since the discarded information cannot be recovered, even though the resulting uncompressed file will likely be larger than its compressed source. Therefore, if only compressed MP3 files were made for a given session, for example, you should not convert those MP3 files to WAV files unless directed to do so by your archive.

Channels

Audio files typically contain either a single audio stream (mono, Figure 29) or two audio streams (stereo, Figure 30) usually intended to be played out of different sets of speakers (left and right). The effect of stereo sound comes from slight differences between the two audio channels. This can be the result of intentional decisions made in producing a recording of a song (e.g. making backing vocals more prominent on the left or right channel) or the result of multiple microphones placed in different locations. In some cases the microphones may be close together and record two largely similar audio channels. In other cases, the microphones could be configured to record completely different audio recordings. One example of this in language documentation fieldwork is when two speakers in conversation with one another each wear a lapel microphone meant to capture their voice alone, with one speaker being recorded on the left track and the other on the right track. The drawback of recording in stereo is that an uncompressed stereo recording will be about twice as big as a single-channel mono recording of the same length. 

Figure 29:

Image depicts a mono stream of audio.

Figure 30:

Image depicts two audio channels, which together product one stereo recording.

While there may be very good reasons not to convert stereo files to mono files (e.g., the each-speaker-on-a-different-channel example mentioned earlier), recording and storing stereo audio files in many cases results in much larger files with little improvement in how those files will be heard or reused. Additionally, many workflows for processing audio files (such as forced aligners or speech recognizers) do not work with stereo files, or they will forcibly convert stereo files to mono files before the application runs.

Sample rate

The sample rate (Figure 31) of an audio recording refers to the number of snapshots of sound recorded in each second of audio, similar to the frame rate of film and video recordings. Just as a low frame rate can produce a video that is jerky and difficult to watch, an audio recording with a low sample rate can be poor quality. Three of the most common sample rates (in kilohertz or kHz) are 44.1 kHz (CD audio standard), 48 kHz (DVD audio standard), and 96 kHz (Blu-ray Disc and HD DVD audio standard). Each of these sample rates allows a recording to capture frequencies at half their stated rate; so a file with a 96 kHz sample rate will capture sound frequencies up to 48 kHz, a sample rate of 48 kHz will capture frequencies up to 24 kHz and so on. Since humans cannot hear sounds above about 20 kHz, a sampling rate of 44.1 kHz will capture all frequencies that people can hear. For most speech science purposes, there is little advantage to recording and storing audio files at sample rates above 48 kHz, especially since higher rates produce much larger files with information about frequencies that are irrelevant to human speech production and perception.

Figure 31:

Depicts a sound wave crossing through the x axis. The section above the x axis has spaced out vertical lines, which represents the low sample rate. The section below the x axis has vertical lines which are close to each other, as it is the high sample rate.

Bit depth

The bit depth refers to how much information is captured at each sample, similar to the resolution of an image. A bit depth of 16 bits will capture 8 bits of information at each sample, a bit depth of 24 will capture 12 bits per sample, etc. The advantage of higher bit depths is that a greater dynamic range can be recorded, which makes for better recordings. For most cases, 16 and 24 bit recordings are great for audio recordings, especially the recordings where the focus is human speech. While higher bit depths (such as 32) are theoretically better, any benefit to the recording will be marginal whereas there will be a marked increase in the size of the audio files. Table 4 below shows estimated file size in megabytes (MB) for one hour of uncompressed single-channel (mono) WAV audio at various sample rates and bit depths. Stereo recordings will be twice as large as the numbers shown in the table.

Table 4:

Table showing the estimated WAV file sizes for 1 hour of mono audio based on the size of the sample rate and if it is 16, 24, or 32 bit.

The same information is shown in the graph in Figure 32 below where the file size in MB is shown on the Y-axis, the sample rate in kilohertz is shown on the X-axis, and the three different bit depths are represented by the colored lines on the graph.

Figure 32:

Graphical representation of the previous image.

It is possible to produce audio files at sampling rates greater than the 44.1 kHz or 48 kHz rates suggested by most dedicated language archives. An audio file produced at a greater sampling rate, will be significantly larger with little to no advantage to the file’s use in most applications, including acoustic analysis. Similarly, it is likely unnecessary to preserve audio files with a bit depth greater than 24 bit. While you are unlikely to produce any single audio file that will be too big to be supported by most repository software, a collection with many audio files with high sampling rates will place a greater burden on the archive’s infrastructure and storage costs. This is why many digital repositories charge fees or place limits on the size of individual files, the size of the total collection, or both. 


Header image which reads Video formats

Video formats

While the landscape of audio formats is fairly stable, video formats can change relatively quickly, and digital cameras are likely to produce files in proprietary formats that are not supported by your archive’s software. Even more so than with audio files, you should familiarize yourself with what your equipment produces and what your archive can accept. The US National Archives considers AVI, MOV, WMV, MP4, MPG (MPEG-2), and MXF to be acceptable formats for their digital video collection. Seyfeddinipur and Rau (2020) recommend widely-supported versions of MP4 and MOV video files for language documentation and archiving. You may have to convert your video files to one of these or another supported video format before delivering them to your archive.

Video files tend to be much larger than audio or text-based files; a 30-minute video file can easily be 6 GB or larger, depending on the camera and recording settings. You should become familiar with the various options on your video camera for manipulating the frame rate, bitrate, and resolution. While the higher settings will produce better quality video, they will also result in much larger files. Also, the longer the recording lasts, the larger the file will be; thus, many video cameras limit recording time to 30 or 60 minutes per file, depending on the selected recording parameters.

Note that video compression is less of an issue for most language documentation collections than audio compression is. Unlike audio files, the details of a video stream are less likely to be studied pixel by pixel than details of an audio signal will be studied bit by bit in acoustic analysis. A notable exception to this is language documentation collections focusing on signed languages or gesture, where small details of moving hands or non-manual gestures will likely be studied. Especially for these kinds of materials, any compressed or converted video should be checked to make sure that the image (and especially moving parts of the image) are not excessively blurry. 

Size and Resolution

When considering an appropriate archive for your collection, you need to be aware that the archive might impose limits on the total size of your collection or the number of files it can contain. There may be a limit to the number of files that can be placed inside a folder, or there may be a hard or soft limit to the size of files an archive can support. 

This will most likely be an issue for video files with high resolutions and/or long running times. Today’s language documenter goes to work with cameras that produce very large files that can be costly to archive and bothersome to process en masse, even if they can be supported technically. When producing video, you may want to consider sending the archive smaller versions of the video files you are collecting. Reducing the overall size of the file could be achieved by decreasing the frame size (see Figure 33 for some common frame sizes), lowering the resolution of the image, making shorter recordings, or encoding the video file in a different codec. You should practice manipulating your recording settings and durations before you start to collect video data.

Figure 33:

Image compares common video resolutions

At the same time, most audio recording equipment is capable of producing files that have resolutions much greater than what is necessary for most uses. While you are unlikely to produce a single audio file too big for archival software to handle, it is important to weigh the advantages of larger files against the costs and impracticalities of storing and handling the (often much) larger files. Oftentimes, applications that linguists may use to process their audio files, such as speech recognizers, are incompatible with higher resolution audio. For most collections, audio files in either 16/44.1k or 24/48k are a great balance between quality and size. Resolutions below 16/44.1k should be avoided.

Whether and how audio and video data is compressed is determined by the file’s codecs. Codecs vary widely, and files with the same extension can be encoded with very different codecs. File extensions only specify the container format, and the files themselves can contain different kinds of data streams that can be encoded by various codecs. If container formats like MOV and MP4 are like the body of a car, codecs are the engine. Reencoding a file with a different codec is like swapping out the engine of a car—it will still look the same on the outside (i.e., it has the same extension), people seeing it drive around may or may not notice a difference (the file still plays in various media players), but there can be differences in the weight of the car and its performance (file size and compression quality). Some cameras will actually produce video files that use a non-standard codec for their container format, resulting in a much larger file than is needed for many purposes. Swapping out a file’s souped-up codec with a more widely supported one can bring the size of the file down considerably, and avoid issues when playing the file with little to no visible difference in audio and video quality in many players and screens.

Your archive may have guidance on what audio and video codecs they support or prefer. If they do not have any guidance, you can follow general suggestions for language documentation collections, which suggest producing files in either MPEG-4 Part 14 (MP4) or Quicktime (MOV) formats using the codec H.264/MPEG-4 AVC which is widely supported. You may also consider producing a separate WAV audio file for MP4 video files since the MP4 file’s audio format is the lossy (though high-quality) AAC format. For further detailed reading on video file formats for language documentation research, see Seyfeddinipur and Rau (2020).

Figure 34:Shows two people in front of a white board, identifying the important features of file types: open, lossless, unencrypted, unstructurable, and non-executable.

Complete and Continue  
Discussion

0 comments