Wednesday, September 1

Voice-Over in eLearning

Over the past couple of months, Dr. Joel Harband has been teaching me all about Using Text-to-Speech in eLearning. This has been a great way for me to learn about the topic.

However, there was a comment on one of my posts that made me realize that the discussion of the use of voice-over in eLearning was far beyond the conversation that Joel and I were having. The comment was:
Even the best Text-to-Speech can only do one thing - receive text and spit it back out. There is no substitute for a professional voice talent, who can interpret the meaning and message of your e-learning scripts. A good voice talent knows how and when to change up the tone or feel of a read when things are getting overly technical or have gone on a while. The most sophisticated text-to-speech cannot approach a real voice person for e-learning. Why do text-to-speech when the cost of a good voice talent will more than pay for itself with satisfied clients and learners?
If you step back, there's a set of broader questions that I've often struggled with:
  • When does it make sense to use voice-over in your eLearning course?
  • Given the range of solutions for voice-over from text-to-speech, home-grown human voice-over, professional voice-over: how do you decide what's right for your course?
  • How do you justify the budget and how does that factor into your choice of solution?
  • Are there places where text-to-speech makes sense?
  • Given relatively low-cost recording and editing solutions, does anyone use a studio anymore? When/why?
  • And, last but not least, I've read a lot of conflicting information about the right way to use voice-over in a course. How do you do it right? Can you have the same text on the screen? Can you have text on the screen or diagrams/animations only?

The September Question is:

Effective voice-over in eLearning?

This is one of the bigger big questions. I'm hoping that we can use this to collect up some pretty good information to help eLearning professionals to make smart choices about voice-over in eLearning.

How to Respond:

Option 1 - Simply put your thoughts in a comment below.

Option 2 -

Step 1 - Post in your blog (please link to this post).
Step 2 - Put a comment in this blog with an HTML ready link that I can simply copy and paste (an HTML anchor tag). I will only copy and past, thus, I would also recommend you include your NAME immediately before your link. So, it should look like:

Tony Karrer - e-Learning 2.0

or you could also include your blog name with something like:

Tony Karrer - e-Learning 2.0 : eLearningTechnology

Posts so far (and read comments as well):


escalante blogger said...

Welcome back.

Jeffrey Kafer said...

I'm so glad you made this post. As a professional voice actor specializing in long form narration such as elearning, it's often an uphill battle getting people to understand the importance of real voice acting. Not only is it great for those with disabilities, it makes a real honest connection with the listener. TTS can't deliver emotion or emphasis. Only humans can make true emotional contact with other humans. and certainly that should be your goal with elearning.

Jason said...

Jason McDonald - Maybe You Should Read the Manual

Alex Taylor - TJ Taylor Milan said...

In my opinion a useful parallel is automatic translation. It will certainly get A message across, but certainly not a polished message, and often proves a turn-off. It will help us understand something from the text if the text is virtually incomprehensible, but that's ignoring a more fundamental problem with the text.
As we read faster than a voice can narrate, let's leave voice-over to when it really adds cognitive value to the learning process.
Now, when are those situations? That I don't have the knowledge to answer, but I suspect it's not all that often...
Alex Taylor

Mike Harrison said...

I agree completely with Jeffrey Kafer, and not only because I, too, am a professional voice actor. I firmly believe there is no substitute for the real, true human voice. I've posted on this issue more than several times on various eLearning blogs. Repeating one from March of this year:

All one need do is remember their school days, and how easy it was to 'zone out' or doze off as a live yet lifeless teacher droned on about one thing or another. Because it is entirely possible to hear without listening, one needs to actively listen in order to evaluate and respond accordingly to what is being said. What separates a skilled actor or narrator from a lay-person is the ability to read and interpret words written by someone else and breathe life into them, making the words believable, interesting.

Natural speech has many nuances, such as inflection; the placing of emphasis on the correct syllables or words. Projecting slightly louder, or dropping the volume. Raising and lowering the pitch of their voice, increasing the pace for excitement... or pulling back for uncertainty. This is the natural 'music' of speech. The power that makes human speech engaging.

Everyone knows how to inflect the proper words. When in everyday conversation we talk about things we know intimately, the key words get the proper inflection automatically. We don't even think about it. There are those who might be called professionals because their jobs require the use of their voices, but they may not have the skill necessary to interpret words written by someone else and apply the proper inflection. They may have pleasant speaking voices, but without the proper inflection, what they say does not have the impact it otherwise would.

We have all seen or heard unskilled speakers; those who are perhaps forced to address a group. They are most times quite monotone and seem unsure of themselves. An audience senses this, and it most times makes them uneasy, wishing for the speech to be over. But, on the other hand, we have also witnessed some who speak very comfortably and confidently. And that's what wins our attention. We merely accept the sound of GPS navigation devices and other such things because they deliver usually very short phrases every so often.

When we read silently to ourselves, we are free to interpret the words as we wish, and create the appropriate mental images to color the story. Text-to-speech can do nothing more than assemble and project the strings of syllables it recognizes in a file. Because it cannot interpret ('feel') words, TTS cannot breathe life into them and make them any more interesting than the bits of data they are. The true test of education is not what is spoken, but what is retained to the degree that it becomes useful. We can add coloring to water to make it look like coffee. But it's not coffee.

We all thought our lives were fast-paced years ago. That's even more true today, and we place a greater importance on time. Because of that, our time allotment for patience has grown shorter. We want satisfaction sooner, and we tire of things sooner. If we don't get what we want within our own time constraint, we move on. This makes us less apt to sit through boring, lifeless speech. Even if we know we need to listen, we will do so grudgingly, or we will find every excuse to do something else. A non-motivated learner.

I can see how TTS could be deemed acceptable for shorter content for those wanting to learn. And it would be interesting to see how people of various ages retain lessons of various subjects taught by an engaging human speaker, as opposed to the lifeless drone of assembled bits & bytes delivering the same material. But there is no question that I would prefer real human speech and it's ability to make things sound interesting any day.

Tony Karrer said...

These are great comments. And a great post by Jason.

Jeff - I'm not sure I agree that only humans can deliver emotion or emphasis. They can do it better, right? Just having voice gives some level of connection/emotion?

Mike - please point us to your other posts on this topic. Your comment is great. Not sure how I can capture the "rules" that result.

We seem to be quickly coming to the heart of the issue. How do you navigate the complex question of the use of voice?

This should be a good conversation!

Inge (Ignatia) de Waard said...

Ignatia / Inge de Waard's thoughts on when to use TTS or the human voice in eLearning.

Jason said...

Tony asked in his comment, how do you navigate the complex question of the use of voice?

An important factor is to help learners identify with another person who exemplifies expertise in the subject. This would again suggest that we should use real voices rather than text-to-speech. If you only look at the possible cognitive effects (a la Clark and Meyer), text-to-speech may have some benefits, but I don't think you can only look at the issue from the cognitive perspective. I don't know many people who identify with an electronically-generated voice.

Parenthetically, one could also argue on a philosophical basis that you should also avoid voice acting in favor of real experts, even if voice actors give you a product with higher production values. But that's a complex issue I won't go into here. :)

vtlau said...

One hints of using Text to Speech is the learner cannot rely on his visual. Maybe he is busy at other things, maybe practicing live and need guidance. When he cannot rely on his visual then the same time it may be convenient for him to response in audio too. Thus to me Text to Speech and Speech to Text is complimentary. Effective use of one cannot miss the other.

Cathy Moore said...

In my blog: Studies suggest that learner control and silent text work better than narration.

Jeff Goldman said...

Jeff's Response - Narration in e-Learning, Sometimes

Shaun said...
This comment has been removed by the author.
Shaun said...

Shaun - Perception of value

Karl Kapp said...

Kapp Notes: Audio in E-Learning

Jennifer Zapp said...

I actually use it in all my courses, when they are in a review stage. I have found that many people can't really "visualize" with just the text of a script. So using text to speech in the review cycle gives a close enough result. When the script is edited and receives final approval, I go to voice over talent and really minimize the amount of re-work with the talent.

Jim Everett said...

These are just a few "it depends" thoughts, based on my own work and experience...

Quality and listenability of voice is one aspect. Another is that the talent ideally knows the content area, the location or the industry, so as to get pronunciation right. Small inaccuracies in emphasis can diminish credibility. For example, calling Maryland "Mary-land", or GPS as "G-P-S", rather than "Jeepy-ess".

An important consideration is where the person will be listening. If they are sitting at home, learning new skills such as software, they will need a different degree of vocal projection (softer and reassuring, rather than full-on ESPN-style) than if the material is designed to be listened to in an industrial environment (where they will need strength and clarity).

Intonation can also be generationally-specific. If the audience is over 40, then having a more mature-sounding voice with a dropping inflection and appropriate vowel pronunciation is preferred. If the audience is under 25, then it may be better to have a talent with the appropriate inflection and vowel pronunciation for that demographic.

Anthony said...

Anthony Montalvo

I think voice-over is a much-overused resource, more related to inflated development costs for the client than instructional effectiveness. Given that people read much faster than they speak, most of us will have finished reading the text on the screen long before the narration is finished (unless, of course, the evil instructional designer has synchronized the text to the audio, forcing us to read at a predetermined pace).

The only situations where I think voice-over is justified and desirable are to transmit emotional content in dialog (which, in all truth, can also be done effectively through imagery or illustration), to imply indirect meaning in dialog ("reading between the lines"), and to highlight cultural or geographic differences through accents. Moreover, I believe none of these can be adequately done through text-to-speech engines.

Regarding this last point, the only case I know of where text-to-speech is truly beneficial is in learning correct pronunciation. The "Hear it" tool used by GlobalEnglish ( in its business English capability development solution allows the user to quickly learn the correct pronunciation of any word or phrase in English, which can come in handy when non-native speakers need to confirm pronunciation of an unfamiliar word or phrase.

Andy said...

Thanks all for the great comments and insightful tidbits. I can definitely appreciate the greater control that comes with TTS, but in my areas of focus (higher ed and healthcare) there really is no substitute yet for the humanity and inflection of an actor, subject expert, or teacher.

If we have a healthy budget and a larger project (and of course a true need for narration), it has worked well for us to cultivate a stable pool of professional voice actors. Changing voices for different sections / modules can actually be an effective way to capture learner attention. In recording a more challenging script with items such as proper names, using a good voice actor will actually result in much shorter hands-on development time than TTS because of a higher trust in the quality. I can receive the audio files from a voice actor and integrate them directly into the presentation with just a quick spot check vs. generating the TTS files myself, carefully reviewing the quality, making any fixes, and then integrating. In terms of long term maintenance, I've found that most presentations become outdated and need a complete overhaul long before you lose contact with an actor.

For shorter projects or tighter budgets, we have had good success as well recording in-house. The trick is to set up a consistent recording environment and lock down the many variables.

Anonymous said...

I'm concerned with the focus on the bottom line that I'm seeing from purveyors of TTS software. Budgets should always be a concern, but not at the expense of pedagogical goals and learning outcomes. Most of the case studies don't pit human voice vs. TTS. It also only aims to appeal to companies wallets and at best comes to the conclusion that TTS is "better than nothing".