Voice talent, the problems with selection for speech dialogue systems: laypersons or professionals?

The voice is the figurehead of an IVR. The “Hear & Feel” is what is perceived by the caller, and not the technology behind it. Often, the most reliable and high-quality software and hardware is purchased in order to operate a system that has scrimped on the speaker. But the voice should also exude high quality and reliability. Nobody would buy a new BMW and say to the manufacturer: “You don’t need to paint the body, I have a friend who is a painter, and my friend is good at sewing, she can do the upholstery for the interior. The main thing is that the engine is good.”

With an untrained ear, one may not necessarily ‘hear’ or understand the value of a professional voice immediately, but everyone feels it; language works very directly. A ‘cheap’ and unprofessional voice portal that emanates a certain do-it-yourself charm is not necessarily what operators seek, but frequently, this is exactly what is presented to the customer.

A common misconception is that opting for a layperson over a professional speaker will reduce costs. Additionally, there is often the belief that a layperson will have better availability for working, and this is a significant contributor in the decision not to choose a professional speaker. This is especially true if said layperson is an internal employee.

The selection criteria that is then applied in the search for a suitable voice are typically the following: they have a ‘good voice’, and their day job involves a lot of speaking, e.g., in a call centre.

However, the findings are:

  • A layperson or untrained is more expensive than professionals.
  • The availability of lay people is undeterminable and often difficult to manage.
  • A ‘good voice’ is not a quality criterion, nor is it objective, but more a matter of opinion.
  • A call centre or any other employee is not a professional speaker, but a customer advisor.

These points will be explored in more detail below.


Costs / Expenses

Even if the layperson does not ask for a fee, the process still costs more. The studio time when recording with laypeople is considerably longer. In our experience, you need at least three times longer in the studio than you would do with a professional speaker. (Note: a studio hour costs between 70 and 120 euros per hour, plus the working hours for the voice coaches). There are many reasons why a professional is faster. Firstly, a professional will usually read a text with minimal errors, whereas the layperson will typically need several attempts before the actual recording can be started. A professional does not need extensive instructions on how to intonate and pronounce a text. Their voice is also more resilient and requires fewer pauses.

Additionally, the factor of the increased post-processing time should not be underestimated. Audio files recorded with professionals require significantly less editing time and it is seldom necessary to piece together an audio file from different takes, which is a common case with laypersons. Since a layperson typically moves a lot in front of the microphone due to lack of experience, additional manual work is required to adjust both the volume and sound quality.

In relative terms, a good, professional speaker doesn’t cost much. The speaker’s fees are almost negligible, even for ‘stars’, compared to the total budget of an IVR project. The costs for a man-day consulting are usually much higher and normally you’d only need a speaker once or twice for a session of couple of hours to cover an entire project. Even if regular updates of the IVR and its recordings are necessary, this is not something which will escalate costs unreasonably. Professionals speakers spend the large majority of their time in studios every day and can “quickly” record a sentence or two anywhere. Technical solutions such as music taxi or APTX (long-established standards) enable voice recordings in real time in DVD quality via unified communications platforms.  Today, voice artists subscribe and invest in High Quality recording equipment to maintain continuity  are not even required to be physically present in a recording studio.



Of course, a professional speaker will have work scheduled and the best of them are often fully booked. But likewise, a layperson usually has other activities and is not necessarily available at any time. Further to this, quick and direct availability is not necessary. The majority of projects have a duration of at least one month, while a recording session is typically completed within half a day. With good project planning, busy speakers can be booked in good time. Even if that should not be possible, fast availability and high flexibility are key to the role and are commonly integrated into the modus operandi of a professional speaker. At Sovran, we have the optimum in voice-powered automation platforms to respond dynamically to change requests.

There is also another aspect of availability that may be even more important, which will be referred to as “internal availability”. The performance of laypersons is strongly dependent on the condition of their voice. Of course, the professional is only human and also has “bad days”, but these fluctuations, which can only be noticed during the recording session, are barely noticeable in the end product.

The human voice is extremely permeable to emotions, and the human ear is specialised in perceiving the finest nuances in the sound of a voice. The human ear is most sensitive in the frequency range that corresponds to the spectrum of the human voice. A professional can “mask” nervousness, colds, a bad mood, etc., whereas the layperson is usually at the mercy of his or her sensitivities. Even professionals are nervous at the beginning of a recording session, but throughout their profession they have developed strategies to deal with that. They are familiar with techniques, such as the so-called ‘support’ and the right breathing technique, which enable the voice talent to sound solid and natural even under pressure and tension. The layperson’s voice ‘flutters’, the breath is shallow, the voice usually becomes higher as recording time increases, the quality of articulation deteriorates, and the mouth becomes dry. Being in an unfamiliar situation, such as a recording studio, does not help to let this tension dissipate quickly; the speaker is sitting in a sterile room, a microphone in front of their face, wearing headphones with their own voice on their ears. On the table in front of them are stacks of paper, and behind a window are two or three people who utter well-intentioned instructions like “just be natural”, which rob the last spark of hope that a word will ever again flow fluently from his lips.

Speaking professionally is a profession. “That sounds good”, “The voice is nice and personal”, “They have a great voice” are not the criteria that make one voice more suitable for an IVR than another. As aforementioned, the best voice in a studio situation will fail if the speaker has not learned how to train, condition, and handle it. The central question that should be asked when testing a voice with regard to its suitability for use in IVRs is the following: Is the voice talent receptive to the direction of the voice coach in order to create a contact with the caller and become truly conversational?  The directness and naturalness of prompts determines whether callers will enter into a natural conversation, forgetting that they are talking to a system.

We experience that, with prompts spoken by laypersons, callers do not always know when they can speak (turn taking) because they do not get a clear signal (landmark) from the speaker. It’s about more than just providing information: the caller has to feel when to speak and intuitively know what to say without having to think about how to say it. Within an IVR which assumes a natural conversational style, callers tend to formulate answers instinctively, which subsequently increases the accuracy rate of the speech recognition. In everyday spoken communication, it is often necessary to adapt one’s style of speech depending on the conversational partner; it is standard protocol to meet ‘in the middle’. If these compromises are not made, the listener will be immediately aware, and the conversation loses any chance of feeling natural. Such situations are commonplace in everyday life; for example, when someone speaks too loudly or quickly in a given situation, or someone remains rigid in their dialect or language without adjusting to the other person. In our experience, a layperson or semi-professional is not flexible enough to always strike the right balance and find an appropriate compromise. They usually stick to “their” tone with a poor ability to anticipate what will be appropriate in the given circumstances and will also struggle to apply any understanding of the context linguistically to the underlying texts. This uncertainty in the voice of the speaker is subconsciously apparent to the hearer and will negatively impact their experience.

It is also imperative to consider in which situations people will call the IVR. For example, the caller may have an urgent problem and require instant help. In this case, a speaker who responds too slowly may evoke frustration and aggression. Moreover, too friendly a tone may not be well-received. Experience has shown that a layperson is not trained to recognise or adapt to these nuances. Typically, they concentrate their focus on pronouncing the text without errors. A layperson usually doesn’t know much about the techniques which allow them to manipulate their voice to convey differing moods and tones, nor do they have the experience of implementing such techniques. In these circumstances, even the most experienced voice coach can do little to help. A lot of experience is required to bring a text to life so that it no longer sounds like a narration, but instead becomes conversational, and therefore triggering the language instinct of the listener directly. The text is only the medium with which the ‘message’ is sent; the ‘what’. The ‘how’ of the message is crucial. The word ‘and’, for example, can be reduced to its simple function of a connective word, but it can also tell a whole story:

My friend drank far too much the evening before. We meet the next morning and I ask: “And?”. Do I ask gleefully, hatefully, or worriedly?

Speech is always context-related, but the context does not become apparent of its own volition. It must be created, manufactured, and therefore we need a voice coach to instruct the speaker.

When recording with an amateur, overly strong fluctuations in energy, volume, pitch, tempo etc. not only occur within the micro-setting of a recording session, but also over several recording sessions in time. In IVRs, we often deal with prompts being concatenated out of different audio files. To preserve a stable and solid quality, consistency is essential. Professionals know their vocal position exactly (pitch, phrasing, intonation, etc.) and can immediately reproduce the required style. They are able to record long series of variables, such as numbers, without any loss of quality. An inexperienced speaker will almost never succeed, and the inconsistency will become painfully clear.

There are also some physical factors that play an important role in assessing the suitability of a voice for an IVR. Due to the narrow bandwidth of the telephone channel, frequency components that are important for the transmission of speech are cut off. In order to preserve an understandable quality of audio, the density of the voice is a most important attribute, alongside precise articulation. When the vocal muscles transform air into vibrating air (= sound), production of false air will only result in white noise after down-sampling. Usually, laypersons do not have a dense and stable, well-balanced voice, as it is above all a matter of extensive practice. The phenomenon of sigmatism or S-errors, known in its strong form as a “lisp”, is also detrimental to the comprehensibility of down-sampled audios. There are studies that claim that this phenomenon applies to approximately one third of the population (in its mild form it is almost imperceptible). If necessary, professionals have trained to get rid of this sigmatism, as microphones reproduce these slight S errors as loud, unpleasant sibilants. Some examples are a hissing S which turns into an emphasised lisp noise, or whistle over the phone due to the low bandwidth.

The influences of dialect on speech, which affects almost all people, can only be discarded through years of training. Often such influences are not desired, as many dialects carry connotations relating to social status or prejudice. Such stereotypes may come into play when particular dialects are apparent in the speaker, often resulting in disturbance or discomfort during the interaction with the IVR.

Not all professional speakers (voice talents) are suitable to record prompts. As such, it is not about the distinction between good and bad professionals, but rather that speakers come from different professional backgrounds and fields of work and are more or less specialized. Speakers recording audio books rely on the emotionality of the voice, speakers that work with commercials are focused on seductive and persuasive qualities, and news anchors on neutrality. Personal experience has shown that the best results are achieved when working with those from an acting background. They have the greatest flexibility, knowledge of dialogical speaking and the necessary empathy. Radio moderators, for example, are only suitable to a limited extent: their monologue approach differs too much from the function of a voice in an IVR that asks questions and gives answers conversationally. It takes much more than a ‘good voice’ for a speech application to enter into a conversational style. The clarity and comprehensibility of a system output does not only depend on the quality of the text, but above all on the ability of the speaker and voice coach to anticipate a dialogue.

