The following appears on a website that isn’t my personal blog here.
Below is a print version of a presentation I gave Friday, 10/22 at the Vision Midwest conference in Madison, Wisconsin. The timing was good for a number of reasons, not least of which was the 21st Century Communications and Video Accessibility Act of 2010 that the president signed into law on October 8th. Among other things, the law requires the FCC to create a committee within 60 days of 10/8 to advise on the technical challenges of adding audio description to online video and to present a report within 18 months. I’ve spent the last year studying this problem and coming up with my own solution (that deploys within two months, not two years) and hope that whoever ends up on the committee doesn’t settle on requiring something that’s unworkable or inadequate. Now’s a good time to throw some voices from the trenches out there.
My name is Liam Moran. I work for a digital media unit at the University of Illinois. I wear a lot of hats: I’m a videographer, a programmer, I run our audio studio, configure our streaming servers, and try to keep up with the latest technology to stay at the forefront of delivering the best quality of service possible on the budget we have to work with. As of August of this year, my unit exists as a partnership between the college of LAS and the Office of Continuing Education in order to devote more of our resources to developing video content for online and blended courses for LASOnline, one of the programs for providing online courses at the University of Illinois. I’ve been thinking about Audio Description for about a year and a half, feel confident that the plan I came up with for producing and delivering audio descriptions is a good solution, and was allocated time during my work week to implement it in the past few months. In this presentation, I walk through the different possible ways to produce and deliver audio descriptions and try to convince you that what we’re doing at the University of Illinois makes the most sense for the various stakeholders involved: the faculty members who provide the content, the media units that produce the content, the server administrators who host the content, and the students who must learn the content.
Video is being used in higher education at an increasing pace: in traditional classrooms, for blended learning, and in online courses. This is a good thing: video is an informationally dense source of curriculum materials and presenting the learning material in different ways can’t be a bad thing. There are certain situations where video is necessary: where a demonstration is too dangerous or expensive to perform in a classroom with students present or where an individual can’t schedule time to be present in a classroom, but a camera crew and interviewer can meet with them. However, using video without taking care to make the video accessible deprives students with various disabilities of the learning materials they are expected to master. If a video is presented that is not fully accessible—especially accessible to the blind and those with low-vision, it is not a decision made in malice, but that there simply does not exist the infrastructure or standards to generate or deliver accessible video. Let’s take a step back and think a little bit about what accessibility is in the grand scheme. Accessibility has two major aspects: usability and equivalent content. Usability has to do with how easy it is to anticipate how to navigate and control whatever it is that is to be accessible. The provision of equivalent content means that when information should be presented in a way that everyone can acquire it. Building codes require that structures be built accessibly with respect to usability by specifying how high off the ground and far away from a doorway a light switch should be located, where handrails should be installed, and how a door can be opened by a visitor in a wheelchair. Buildings are accessible with respect to equivalent content if signs have braille equivalent for the text, elevator buttons have a tactile means of indicating which floor the button will take you to, etc. A webpage is accessible with respect to usability if it uses high-contrast colors and is structured in a way that screen-reader software can index it and provide a navigation method of its different content areas. A webpage is accessible with respect to equivalent content if, for example, the images have alt-tags to describe what the image shows.
Jumping back to accessible media: the player controls have to be usable to blind and low vision users–they have to be of appropriate size and use high-contrast colors and they have to be navigable and operable with a screen reader and a keyboard. The media also has to provide equivalent content for users with disabilities. Most of the time, when media professionals talk about accessibility, they’re really talking about captions. Captions are great, of course, and a mainstream part of life now: captions are usually turned on at bars and restaurants so many patrons can watch different television programs on different televisions without interfering with one another or the music being played, for one example. Captions are only half of the game, though: providing the equivalent content if you can’t hear the audio portion of the video. The motion picture part of the video contains content that needs to be made accessible, too, otherwise we wouldn’t buy all the expensive cameras we have. So how do we provide equivalent content for that portion of the video material?
The solution came in the 1970s from Margaret Pfanstiehl, an avid fan of the theater who lost her eyesight while in her young 30s. She cultivated a group of volunteers who would describe the visual aspect of the performance for her and other blind and low vision patrons, eventually developing an infrastructure involving radio transmitters and headset receivers for broadcast to those patrons in the audience that has become the standard for providing equivalent content for theatrical performances and is the basis for all other forms of making visual media accessible to those with visual impairments.
The current best-practice method for including the equivalent content is to play a second audio track over the video, synchronized with the video—exactly like a director’s commentary track on a DVD (which is a standard designed for audio description in the DVD specification, but that hasn’t been widely exploited, unfortunately). Two resources for guidelines on producing the descriptions are: Audio Description Coalition—with free registration, you’ll receive a PDF file containing the standards used by the theatrical audio description community which contains a useful set of ethical guidelines; and Joe Clark’s AD Principles, a concise and clear presentation.
My own distillation of the guidelines for educational content:
- Describe what you observe
- Keep interpretation to a minimum
- Respect your audience
The objective is to be the user’s eyes, not their brains. Good descriptions are not annotations of the video. You have to keep your ego in check and try not to be too helpful. A good rule of thumb to keep you from being too helpful is to watch the video with your audio descriptions turned on and to verify that the descriptions add no content that isn’t readily apparent from the video. I acknowledge that this is impossible at times; that some human judgment is often needed to determine what information it is that students who can see the visual aid are likely getting from it. I suggest that it’s better to err on the side of completeness–to describe the apparent content as completely as possible for reasons that will be clear later. Finally, make sure to resolve ambiguous references that rely on visual cues: “the second figure on the left”, “this one right here”, “the green arrow.”
Here is a clip from the Ribbon of Sand sample made by Audio Description Solutions, chosen to demonstrate this method of audio description because it is both beautifully shot and beautifully described. Note that the production house who produced this video did something interesting: Meryl Streep’s narration is in one channel (left) and the description audio is in the other channel, so you could, in theory, turn the AD off if you wanted to and if the player provided a method to turn it off. Quicktime does not allow you to control the panning or balance of the audio.
The question is whether this technique would work for educational video. For the following clip, I made the audio track by watching the video, noting what time a description should occur what the description should say, recording the descriptions in my studio, editing them in a multi-track editor so they would start when I noted they should, and exporting it as a mono file. Here’s the first demo using this method:
The standard method clearly isn’t going to work all the time for educational video: the lecturer has a finite amount of time with students and has to provide as much information as possible in that time—natural pauses are few and short. Visual aids used are often dense with information and require extensive description in order to provide equivalent content. In video shot for entertainment, the camera does a lot of the exposition—it tells a good portion of the story. That’s not always always the case for some types of educational video.
Since my first mission on making our media accessible was captioning, that was my only tool when I turned to audio description—it was a hammer and every problem looked like a nail. My first attempt to improve on the standard practice was to leverage the screen reader’s typically fast speech rate, by essentially presenting the text of the descriptions to the screen reader to synthesize at the appropriate time. Captions standardly work in Flash by being provided in a w3c standard dfxp.xml timed text file, containing paragraph nodes having values of the text of the caption, with attributes begin and end with values of the time stamps when the caption should appear and disappear, respectively. To get the screen reader to synthesize the text displayed, you need to use the Accessibility.updateProperties() method to force the screenreader to refresh its buffer containing the accessible objects in the movie (including the new text), then force focus onto the text box where the description is printed with the FocusManager.setFocus() method. I can comprehend the synthesized speech in JAWS with a setting at 85, so if you have JAWS available, turn it on and set the JAWS Cursor Dialog speech rate there (or higher) and watch Demo #2.
Demo #2 worked better than Demo #1, but is still not good enough. JAWS spoke over the audio native to the video; JAWS couldn’t keep up with the pace of the video, even when speaking at a high word rate; and seizing keyboard focus from the user would be problematic with a properly functional user interface.
WGBH noticed that it frequently happens where a natural pause is not available into which to insert a description without interfering with the native audio and so suggest “extended descriptions” where the video pauses as needed to allow time for the description.
Their example is: WGBH “All Systems Go” Extended Description demo. As much as it pains me to criticize the great work they do at WGBH, I have problems with this particular implementation. First, the descriptions in the video aren’t descriptions, but annotations. There’s information presented in the extended descriptions that is not readily apparent from the motion picture. Also, the player doesn’t actually pause, it merely displays the same frame repeatedly until the descriptive audio finishes.
This is problematic for a number of reasons: we’d more than double our disk usage and costs in order to deliver accessible video in this manner, the media producers would have to make two different versions of each video we make, web designers would have to come up with a clever way to reliably route users to the correct version of the video for them (if that’s even knowable), and users would have no way to skip the description if they didn’t feel it necessary to listen to (the objective is to provide equivalent content, not to extend the amount of time they are required to consume content) or to switch to the other version of the video if they were routed to the wrong one.
Flash does more than play video—we can push the technology to do what we need it to do, exactly how we need it to do it. This is one of the benefits of using Flash over an embedded commercial player. Extended descriptions are going to be necessary in educational video, so a mechanism is needed to pause the video stream when necessary and resume it either when the description has finished playing or the user chooses to skip the remainder of the description.
Not only because my only tool is closed captioning, the audio descriptions are controlled via an xml file extened from the timed text standard, with two new attributes available for p-nodes: pause, which takes a boolean value (defaulting to false), and href, which tells the player where to find the audio to be played at the time specified by the begin attribute. The player doesn’t interact much with the screen reader, since I intended it to behave the same everywhere, except to detect that it’s in use and to turn on the audio descriptions. It only makes that check once, then stops checking in case the user doesn’t want them on. This has been a little confusing for beta-testers so far, and so am working out under what conditions to let the screen reader handle the UI and when to let the built-in controls take over. The goal is for the thing to behave in an expected manner, and my beta-testers know a lot better than I do what is expected.
Note that the buttons are large and use high-contrast colors; that the real-estate is limited since the buttons are large; and that captions over the screen are only useful when their location indicates the speaker, but that keeping the captions off the screen where it can’t block the motion picture content is preferable from my point of view (and others on campus). So when captions are turned on, the controls panel flips to reveal the captions and a button to flip back. Keyboard controls still work with captions on in the predictable manner. When you played the video, observe that it pauses when it has to and doesn’t when it can. It’s easy to skip a caption if you don’t want to see it by pressing spacebar or the play button. I need to add forward-back buttons to go back to the last description if you skipped it but then find that you missed something and also for general navigation control.
When it comes down to it, getting audio descriptions to be used widely on campus will depend on a cost-benefit analysis. The benefits are fixed: audio description have to be provided by law for government online materials in Illinois. The trick is to reduce costs enough that the decision isn’t made to just get rid of online video, which would be bad for me, since making it is my profession, and bad for everyone because online video is a valuable type of learning material for the students and the public at large. Costs can be measured in either cash or in how much their way of doing things needs to change.
Nothing much changes for server administrators: they don’t have to double their disk installations for media servers. There are a few mp3 files that are very small relative to video and another xml file that’s hosted on our video content management system.
Media producers need to make the xml files I intend to use as a standard method of delivering audio descriptions and to record the audio files. I estimate that all AD generation requires about 2X realtime to produce. If we were to adopt the standard workflow, where we produce an audio file to play in sync with the video, the way I’d produce them would be to watch the video and take notes of when a description would need to play and what the description would be, then watch the video a second time with a headset on and record the descriptions from my notes at the right time. For an hour-long video, that would take about two hours. I estimated that it took me a little under twice the length of the video to type up and time-sync the descriptions using SubtitleWorkshop, but quite a while to record the audio because I’m not a competent voice actor. That estimate does, of course, depend on the amount and complexity of visual aids presented.
Furthermore, since the format is Timed-Text, the infrastructure to generate them is already in place for captioning and can be re-purposed; also some of the skills people with expertise in captioning possess carry over as well, which altogether should make adoption of the process more acceptable. All that is needed is a post-processing script to convert the standard timed-text file to the proposed extension to the timed-text format. (More on that later).
Also, many of the videos my unit makes are storyboarded: we know from the early planning phase what each shot is meant to communicate and translating the storyboards to descriptions is straightforward. Since almost all television shows and feature films are storyboarded, the excuse for them not to provide audio descriptions once they have the infrastructure in place to deliver them is flimsy.
For faculty and other instructors, the routine is the same if they’re working with us. In my experience, faculty new to video often report that working with us to tighten up their presentation for scripted video forced them to re-think the way they present content in a way that benefited their teaching methods in positive, career changing ways. It’s fair to assume that they already think hard about what visual aids to use and have a good idea of what they intend it to communicate, so if they produce their own videos, producing their own audio descriptions shouldn’t be a stretch and might become just “part of the process”. I assume that when people do things in an inacessible ways, like failing to structure a pdf file so that a screen reader can index or even read it, they do so just because they don’t know how to do it right. I only learned that (very minor) skill a few months ago and now it’s simply how I do things—doing it any other way would be doing it wrong and creating work for myself down the road.
There are secondary benefits to implementing audio description as we will be doing at the University of Illinois. Since the timed text files I suggest using include the text of the descriptions within the p node even though the player doesn’t directly use it at this time, it’s good to have it in there for future proofing in case Flash 11 includes a built-in speech synthesizer. It’s immediately good to have in the file for search: with a fully accessible video, you can search for a term and navigate to either when it’s spoken or described. Being able to simply search the motion picture part of the video isn’t possible with inaccessible video, so that’s a significant advantage. That’s another reason why I suggest erring on the side of too much descriptive audio instead of too little, especially if it’s easy to skip them and to navigate back as needed.
Those secondary benefits are critical to mainstreaming audio description the way that closed captions are now mainstream and expected. Once students without vision impairments notice that descriptions are available, I’d hope that they start using them to multi-task: playing the video with the descriptions on in the background while reading or typing their notes, cooking dinner, whatever… That it becomes of value to the way they do their learning, too, and come to expect it as well. Even though I’m a video producer, I don’t particularly watch much online video because I’m usually too busy to do just one thing at a time. Sometimes I’ll start a video, then switch to another tab to do something else and will get lost immediately, since I’m depriving myself of the motion picture content. If audio description becomes as mainstream as I’d like it to be, I would watch more online video.
The method I propose structurally reduces the costs about as much as I can see possible, with a few exceptions.
Recording the audio would be time and labor expensive, and so it would be preferable from a cost-savings perspective if we could synthesize the descriptions. The workflow I have in mind for this would be to have the tool that translates the standard dfxp.xml file to my proposed ad.xml file format also synthesize the audio by piping the text through a synthesizer like Festival or whatever we have handy on campus. It’s been a few years since I worked on speech synthesis, but when I last did, the hot area was in prosody and communicating emotional states, etc., which is needed for dramatic presentations. The simplest way to do it would be to generate a caption file in SubtitleWorkshop (which needs a patch to export timed text xml in the current version at least) and for each description, set the end time to be the same as the begin time if you don’t want it to pause and to some later time (to display for a non-zero duration) if you do want it to pause. The script for translating the xml would then know what the value for the pause-attribute would be by comparing times and could assign the href values depending on where it writes the output from the synthesizer. I’ll be using a synthesizer in any case, since that workflow provides descriptions faster instead of having to wait for studio time: it’ll just be a matter of replacing the synthesized speech with recorded human speech. If it can be experimentally demonstrated that the synthesized speech is just as good a presentation of the material, though, we can devote all of our resources to creating the descriptions instead of splitting it between typing them up and then recording them, which would mean more audio described video gets made per dollar. We know that audio description aids learning, but I need to know under what parameters that effectiveness is maximized so we can best position resources.
There’s a possibility that the descriptions themselves could be automatically generated. Since the objective is for the descriptions to be free from interpretation, there’s a good chance that some fancy image recognition and OCR could produce the descriptions without continuing, or at least with limited, human supervision. In the simplest case, where the visual aid is a powerpoint and the professor provided alt-text in them, the task of producing much of the descriptions would be trivial. A pair of ECE professors at the University of Illinois are working on a lecture-capture system that automatically performs the mix between the camera video source and the projected video source based on the professor’s gestures as identified by the camera system, which could be re-purposed to sync the descriptions if their system is reliably successful.
The question of whether either of these would be acceptable isn’t a policy issue: not a decision that needs to be made by someone well credentialed, but is an empirical issue (and in the latter case, an engineering issue) that needs to be answered by learning comprehension studies testing how well the different methods of generating descriptions present the equivalent content to the students.
An outstanding issue is how to let students know how to report when an audio description file is missing for a video that they want one for. For captions, this is straightforward: if the ensemble server that catalogs our media reports the the player that no captions are available, it displays a message in the caption area on how to report that the captions aren’t available. It’s not clear how to go about providing the same information for reporting missing audio descriptions, but not an unresolvable problem.
That’s the end of the presentation.
If I could go back and add anything, I’d probably talk about how relatively easy it would be to do audio description for live streaming video. To caption live video, our regular procedure is to hire a professional caption writer and a very bright WILL broadcast engineer named Matt Jones to add line-21 captions to the video feed (which would probably be aired live on UI7, the campus television station, anyways) then decode them with a PCD-88 before capturing the video for encoding and streaming up to the server.
For audio description, we’d just need to send an audio only stream to the server, make sure it’s synced up on the downstream side, and have the player use the same controls for both streams. An alternative method would be to send two video streams, one with the descriptive audio mixed in, and re-purpose the dynamic bitrate switching machinery to swap between the two on demand.
Since I think extended descriptions are of real benefit to the users, I think it would be best if whatever standard the television broadcasters adopt would allow users with DVRs to allow the video to pause while the descriptions play as needed and until the buffer runs out. I assume that alternative audio tracks are extra audio streams in an mp4 container so the video stream and audio stream would have to go increasingly out of sync as the program played, which might be possible with a DVR, but I’m thinking more thinking is needed.