Speech perception as categorization
Perception and Psychophysics, Jul 2010 by Holt, Lori L, Lotto, Andrew J
Speech perception (SP) most commonly refers to the perceptual mapping from the highly variable acoustic speech signal to a linguistic representation, whether it be phonemes, diphones, syllables, or words. This is an example of categorization, in that potentially discriminable speech sounds are assigned to functionally equivalent classes. In this tutorial, we present some of the main challenges to our understanding of the categorization of speech sounds and the conceptualization of SP that has resulted from these challenges. We focus here on issues and experiments that define open research questions relevant to phoneme categorization, arguing that SP is best understood as perceptual categorization, a position that places SP in direct contact with research from other areas of perception and cognition.
Spoken syllables may persist in the world for mere tenths of a second. Yet, as adult listeners, we are able to gather a great deal of information from these fleeting acoustic signals. We may apprehend the physical location of the speaker, the speaker’s gender, regional dialect, age, emotional state, or identity. These spatial and indexical factors are conveyed by the acoustic speech signal in parallel with the linguistic message of the speaker (Abercrombie, 1967). Although these factors are of much interest in their own right, speech perception (SP) most commonly refers to the perceptual mapping from acoustic signal to some linguistic representation, such as phonemes, diphones, syllables, words, and so forth.1
Most of the research in the field of SP has focused on the mapping from the acoustic speech signal to phonemes, the smallest linguistic unit that changes meaning within a particular language (e.g., /r/ and /l/ as in rake vs. lake), with the often implicit assumption that phoneme representations are a necessary step in the comprehension of spoken language. The transformation from acoustics to phonemes occurs so rapidly and automatically that it mostly escapes our notice (Ntnen & Winkler, 1999). Yet this apparent ease masks the complexity of the speech signal and the remarkable challenges inherent in phoneme perception.
As a starting point, one might presume that phoneme perception is accomplished by detecting characteristics in the acoustic signal that correspond to each phoneme or by comparing a phoneme template in memory with segments of the incoming signal. In fact, this was the presumption in the early days of SP, starting in the 1940s (see Liberman, 1996), and it led to the hope that machine speech recognition was on the horizon. However, it became clear rather quickly that SP was not a simple detection or match-to-pattern task (Liberman, Delattre, & Cooper, 1952). Although there has been a wealth of studies documenting the acoustic “cues” that can signal the identity of different phonemes (see Stevens, 2000, for a review), there is significant variability in the relationship of these cues to the intended phonemes of a speaker and the perceived phonemes of a listener. The variability is due to a multitude of sources, including differences in speaker anatomy and physiology (Fant, 1966), differences in speaking rate (Gay, 1978; Miller & Baer, 1983), effects of the surrounding phonetic context (Kent & Minifie, 1977; hman, 1966), and effects of the acoustic environment such as noise or reverberation (Houtgast & Steeneken, 1973). The end result of all of these sources of variability is that there appear to be few or no invariant acoustic cues to phoneme identity (Cooper, Delattre, Liberman, Borst, & Gerstman, 1952; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; but see Blumstein & Stevens, 1981, for a possible exception). This means that listeners cannot accomplish SP by simply detecting the presence or absence of cues.
In place of a simple match-to-sample or detection approach, SP is now often conceived of as a complex categorization task accomplished within a highly multidimensional space. One can conceptualize a segment of the speech signal as a point in this space representing values across multiple acoustic dimensions. In most cases, the dimensions of this space are continuous acoustic variables such as fundamental frequency, formant frequency, formant transition duration, and so forth. That is, speech stimuli are represented by continuous values, as opposed to binary values of the presence or absence of some feature. SP is the process that maps from this space onto representations of phonemes or linguistic features that subsequently define the phoneme (Jakobson, Fant, & Halle, 1952). This is an example of categorization, in that potentially discriminable sounds are assigned to functionally equivalent classes (Massaro, 1987).
An early example of such an acoustic space representation for phoneme classes is present in Peterson and Barney (1952), where vowel productions by adult males and females and children were displayed in terms of first and second formant (F1 and F2) frequencies. This simple distribution map demonstrates that exemplars of particular phonemes tend to cluster together in acoustic space (e.g., instances of the vowel /i/ as in heat tend to have low F1s and high F2s), but there is a tremendous amount of overlap among the distributions of different vowels owing to variability in speech productions (see also Hillenbrand, Getty, Clark, & Wheeler, 1995, for an update on these vowel measures, and Lisker & Abramson, 1964, for overlap in consonant voicing distributions). Presumably, listeners have to determine boundaries in order to parse these acoustic spaces and perceive the intended phonemes despite acoustic variability
lotto tutorials