Multisensory contributions to speech perception

Why understating someone can get much better if we can see them?

Image from Weikum et. al. 2007

The speech signal is a naturally multisensory, because it is simultaneously available to the ear and to the eye. The case of speech is particularly rich and interesting because the different nested layers of information involved in communication (from semantics to phonetics). For example, the study of how articulatory gestures from the speakers’ mouth are bound to their correlated sounds has become one of the most popular test models from which multisensory processes are studied in humans. It has been long known that the availability of facial movements from the speaker improves speech perception under noisy condition. On the other hand, the classic McGurk illusion (many web demonstrations, see ours: Video1 and Video2 from Cosmocaixa demos) speaks to the profound influence of visual speech information in the way we perceive the spoken message. Our interest is to go beyond the expression of audio-visual integration in speech and attempt to grasp the mechanisms that underlie this particularly important multisensory capacity of humans. 

see ours video demonstrations from Cosmocaixa demos: Video1and Video2

 

Audio-visual speech integration and predictive coding

 

One interesting idea currently under debate is the possible role of visual information in facilitating the processing of auditory speech via predictive mechanisms. In simple words, visual information might provide ahead of time information about the possible speech sounds, thereby constraining (and speeding up) the parsing of the spoken sounds. This framework capitalizes on the fact that, for biomechanical reasons, speakers mouth shapes up in advance of sound production, and that this information could be used up very efficiently by the perceiver. We have attempted to provide some evidence for such mechanism showing behavioral advantages and ERP correlates for predictive vs. non-predictive visual context on auditory speech processing. In one of our last studies, we have extended this idea to the realm of the hand gestures that accompany speech. In this case, gestures would provide reliable advance cues for important words. Using EEG time-frequency analyses on the perceiver’s side, we have been able to discover that hand gestures tend to synchronize low-frequency (theta) neural activity, and desynchronize alpha activity, in preparation for upcoming words.

 

The case of audio-visual integration in the second language 

Many of us have experienced the frustration of trying to hold a conversation over the phone in a second language, just when we thought we started to be fluent!  Why is it so much better to understand a second language when you see the speaker? The integration of heard and seen speech has received a good deal of attention in monolingual contexts and its benefits have been repeatedly documented. Yet, the potential benefits of multisensory integration when dealing with speech in a second language has not received nearly as much attention. This is especially striking given that most of the world population lives in multilingual societies. From the mere perspective of sensory and perceptual processing, it is widely accepted that the speakers of a language (monolingual or bilingual) will make use of as many cues as there are available to them in order to effectively decode the spoken signal. We have hypothesized that audio-visual integration can help make up for the impoverished speech recognition in second languages. At present, we have performed two studies where we use fMRI to compare brain activity related to the audio-visual perception of native language with that of second language. Our results converge on the idea that audiovisual integration of a second language does not rely on a different network than the first language, but it does modulate the weight of the different nodes in a common network.

 

Representative publications:

Biau E, Soto-Faraco S.  2013.  Beat gestures modulate auditory integration in speech perception. Brain and Language. 124(2):143-152.

Barrós-Loscertales A, Ventura-Campos N, Bustamante JC, Alsius A, Calabresi M, Soto-Faraco S, Avila C.  2009.  Neural correlates of audio-visual integration in bilinguals. NeuroImage. 47(1):S39-S41.

Soto-Faraco S, Alsius A.  2007.  Conscious access to the unisensory components of a cross-modal illusion. Neuroreport. 18(4):347–50.

Weikum WM, Vouloumanos A, Navarra J, Soto-Faraco S, Sebastián-Gallés N, Werker JF.  2007.  Visual language discrimination in infancy. Science. 316(5828):1159.