Step 7: Converting file formats
Even if you did your best to select open, enduring formats before you started your data collection, you might nevertheless have files--especially video files--that are in a format that is not supported by your chosen repository. If you do have unsupported file types, you will need to convert those files into supported formats before you submit them to the repository. In what follows, we discuss the following media formats:
- Audio,
- Video,
- Text,
- Photographs,
- Databases and tabular data; and
- Zipped files.
Audio formats
Audio formats are quite stable, and most audio recorders create either WAV or MP3 or both file types. Most modern equipment will produce files that are at or above the minimum specifications discussed in Step 3, but you may find yourself with a set of recordings that do not conform to these specifications. For example, your remote fieldwork may have relied on a messaging app that only produces compressed audio files, you may have a collection of lower-resolution audio files (below 16/44.1 KHz) created by an earlier digitization project, or you may have audio files recorded at a very high-resolution (such as 48/96 kHz or above). In cases like these, your archive may be able to do some conversions for you, but free applications for converting between audio formats are available. One very good application for audio-visual formats is FFmpeg. Although this is a command-line interface with its own commands and tools, there are applications with graphic interfaces that can simplify the process, such as Axiom and Handbrake (among others). If you are already comfortable writing simple scripts in some programming language, you may be able to find tools to use FFmpeg commands with that language so that you can, for example, write a short Python script to select, process, and rename all the audio files in a directory rather than process files one by one.
When converting audio files, you should take care to avoid creating larger files that do not actually have an improved audio quality. This means you should not create uncompressed audio files from lossy originals, and you should not create higher-resolution audio files from lower-resolution files. To say this in the most basic terms, do not convert a lossy, low-resolution MP3 file into an uncompressed, high-resolution WAV file; the result will be a much larger file with no improvement in the sound quality.
Video formats
While audio file formats are relatively standardized, video container formats and especially the codecs used to encode files change much more frequently. Your archive will be able to advise you about which formats to produce, and may be able to do some conversions if your files do not conform to their guidelines. As with audio files, there are free tools such as FFmpeg, Axiom and Handbrake (among many others) that you will be able to use to convert between video formats. As with converting audio files, you should take care to not produce files that are much larger without improved audio or video quality.
Text formats
Any textual documents you have that are in proprietary formats should be converted into some non-proprietary format so that future users will be able to open and read the files correctly. Exactly what format a file should be converted to will depend on the nature of the document. Many documents that are meant to be read (e.g., most Microsoft Word documents) can be saved as a text file (TXT), saved in a format like the Rich Text Format (RTF), or converted to an archivable PDF file (PDF/A). The last of these methods will preserve the layout of the document and its pagination--making it ideal for ensuring legibility of the document--but its formatted text can make the text difficult to reuse. If the document is something that you expect users to want to copy and adapt (for example, experimental stimuli, interlinear glossed text, or bilingual dictionaries) you may want to consider also archiving a copy of the document in an alternate format (TXT, RTF, XML, etc.) for greater portability and reusability. When converting a word processing file containing annotations (track changes, comments, and highlighting), be sure to check if the annotations are visible or not in the final document, unless you want to preserve your conversations with your co-authors!
Slideshow presentations (such as those made with Microsoft PowerPoint), may be converted to non-proprietary software versions or converted to PDF/A. Presentation software often allows users to add text to a notes field that is meant to be seen only by the editor or presenter of the file, and not the audience during a presentation. This text may be lost when converting to different formats. If there are speaker notes that you would like to preserve, consider replicating the text in another text document or exporting the slideshow into a format that does include note text like some of the handout export formats in Microsoft PowerPoint.
Image formats
Digital repositories often accept more kinds of image files than audio files or video files, since image viewers for static images are often more readily available than for audio or video files. Nevertheless, it is possible that you may have to convert some of your image files into some other format. As with audio-visual files, there are some free applications that can be used to convert and manipulate image files, such as GIMP, and image viewers already installed on your computer may be able to convert image files from one format to another.
Databases & tabular data
Digital archives are generally not designed to house dynamic data or complex databases, but are ideal places to store the data tables comprising such databases. While most tabular data is stored in text files such as Comma Separated Values (CSV) or Tab Separated Values (TSV) files, you may have tabular data stored in proprietary software, for example Microsoft Excel. These kinds of documents can usually be saved as CSV, TSV, or other kinds of non-proprietary archivable files, but please be aware that there can be limitations to how these non-proprietary files work. For example, a CSV file may use a less robust character encoding, and it will not preserve formatting or formulas, so you should check the output of the conversion to make sure it is still acceptable. Also, since these text files can only contain one “sheet” of data, whereas Excel workbooks can contain many sheets, you may have to create multiple CSV files from a single Excel workbook. Also note that XLSX files are compressed files, so CSV files produced from them may be notably larger.
Zipped files
Some archives do not support zipped files. Some data repositories (including some Dataverse configurations) will automatically extract the contents of zipped files as individual media files, whereas others simply cannot add them to their repositories. The contents of zipped files are often suppressed and are not machine-readable, so some file metadata (including the number of objects within the zipped file or the length of audio/video recordings inside) will not be calculated by automatic software that repositories use to extract technical metadata. It is very important that the size and contents of a zipped folder are explicitly identified in your metadata. Make sure that you check with your chosen repository regarding their policies on zipped files.
Submit your materials and metadata to your digital repository
At this point, you are ready to submit some or all of your files to your chosen digital repository. Congratulations!
Most modern digital data repositories and language archives require depositors to practice some form of self-archiving, i.e., depositors must do some or all of the work of adding their files (or folders of files) and the corresponding metadata to the repository. The term self-archiving was made popular in the early 2000s by the open access (OA) movement, and it refers to the practice in which authors are allowed (by journals) or are required (by their institutions) to upload preprints of their journal articles, as well as other literature such as conference proceedings, white papers, etc., to their university’s institutional repository (BOAI 2002). Today this model is better known as green OA (Suber 2008). This self-service archiving model is now ubiquitous in almost all digital repositories, though it might be called by different names. For example, the Archive of the Indigenous Languages of Latin America calls this process self-depositing, and the California Language Archive calls it pre-archiving. The details of exactly how self-archiving works vary from repository to repository, but the basics are the same: you will be required to enter your metadata directly into the repository or a software program either before, during or after uploading your files/folders. At some repositories, you can build your collection in a private workspace that is accessible only to you and your team until you are ready to publish the collection; once it is published, others can see it. At other repositories, your collection will be visible to the public the entire time you are building it, though you will likely be able to restrict access to your files until you have entered the corresponding metadata. Make sure that you understand the method that your chosen archive uses. And remember that in both scenarios, the processes of metadata entry and file- or folder-uploading can be time-consuming, so you should plan plenty of time into your schedule to do this work.
If you are submitting your materials to a language archive, you will likely have to submit donation, deposit, and/or licensing forms to the repository before you can start self-archiving. This is not the case at most data repositories, but you will still be required to create an account and agree to the terms and conditions of use. Some data repositories will give you the option to apply a license or rights statement to each folder or file; others will have a standard license that applies to all data in the repository.
Finally, make sure that you review all instructions provided by your chosen repository before you start, and if you have any questions at any point in the process, do not hesitate to ask a representative of the repository.
0 comments