Koninklijke Vereniging voor Nederlandse Muziekgeschiedenis

Acoustic Vowel Space in the Speech-to-Song Illusion

Anna Hiemstra is onlangs afgestudeerd met een M.A. in muziekwetenschap aan de Universiteit van Amsterdam. Deze blogpost gaat over haar scriptie, waarin een interessant fenomeen op het raakpunt van taal- en muziekcognitie werd onderzocht.

Acoustic Vowel Space in the Speech-to-Song Illusion

The story of how Diana Deutsch, Professor of Psychology at the University of California, discovered the Speech-to-Song Illusion has achieved somewhat of a legendary status in music cognition circles. As she was editing spoken commentary on a cd, Deutsch looped a single phrase – “sometimes behave so strangely” – and inadvertently left it playing while she attended to another task. Suddenly, it seemed to Deutsch that the words she was listening to were sung, rather than spoken. The change was so stark that she even looked around for a moment to see if someone had not entered the room and started singing. In fact, no one had entered, but a perceptual transformation had occurred after repeated listening to the same phrase – a transformation from speech to song.



In this video we see a group of children react to the phenomenon.


Listeners can usually determine without any effort whether they are hearing speech or song. The fact that the two modes are perceived as so different implies that there are clear acoustical differences between them. However, acoustical distinctions between speech and song evaporate in the case of the Speech-to-Song Illusion. Spoken sentences that are subject to this illusion are initially perceived as speech but sound like song after repetition. This illusion shows that, in some circumstances, the distinction between speech and song is purely perceptual rather than acoustical. In other words: the workings of our brain are solely responsible for what we are hearing. Among other auditory illusions – think of the recent “Laurel” or “Yanny” phenomenon – the Speech-to-Song Illusion is particularly fascinating because it has the potential to inform us about the differences in language and music processing. There is considerable discussion about the mechanisms behind language and music processing. In particular, researchers disagree about whether music and language are processed independently or through shared circuitry. Testing theories of shared or domain-specific neural representation is complicated by the difficulty in finding comparable musical and linguistic stimuli. Sentences subject to the STS illusion make for perfectly matching stimuli – an acoustical signal that in identical form can be perceived as either speech or music.

Deutsch et al. (2011) showed that repetition is key to the illusion. A phrase can transform from speech to song by virtue of repetition only – the sound signal stays exactly the same. In fact, when researchers made small changes to the sentence, the illusion stopped happening. Since repetition is at the root of the Speech-to-Song Illusion, one might expect the same effect for any looped section of speech. But later research has shown that some sentences do not transform to song, even after many repetitions (Tierney et al. 2013). It is intriguing that some spoken sentences appear to easily convince our brain that we are listening to song, yet others do not. It seems that some sentences have characteristics that invite or inhibit the perceptual change, properties that give hints to our brain that we are listening to song or speech. Some sentences may have more “musical” properties relating to rhythm or pitch that allow for a quick switch to music processing mode. Similarly, sentences may sound so strongly speech-like that they resist the transformation. As our brain is trained to differentiate between music and song through countless hours of exposure to both modes, cues pointing our brain to music or speech may be subtle enough to escape our conscious detection. For my master’s thesis in musicology at the University of Amsterdam, I was interested in finding out whether subtle acoustic cues could have an influence on the transformation effect of the Speech-to-Song Illusion.

In my research, I looked at whether the properties of vowels may affect the Speech-to-Song Illusion. I compared the acoustic properties of vowels in a large dataset of sentences that had previously been shown to either strongly transform to song, or reliably remain stable as speech. Specifically, I isolated three vowel groups in this dataset and measured all formant frequencies (frequency components of the vowel produced by the vocal tract) to draw up a 2D-representation of the formant frequencies in both groups of sentences. I wanted to compare the size of this overall “vowel space area” in the non-transforming and transforming groups (Figure 1). In general, larger vowel space areas are associated with more acoustic distinction between vowel categories. I found that the overall vowel space was larger in speech-like compared to song-like stimuli. A possible reason for this difference may be that listeners learn to associate large vowel spaces with speech and small vowel spaces with song through exposure to formant differences between spoken and sung vowels. The expanded vowel spaces in which non-transforming vowels find themselves may be perceptually associated with speech, thereby activating speech processing circuitry in the brain that inhibits the perceptual transformation to song.


Figure 1 - The 'vowel space area'


How does the brain decide whether we are listening to speech or song? As in many perceptive processes, our previous experiences help our brain figure out what is going on around us. We grow up hearing countless hours of speaking and singing, and we extract countless bits of information out of all these experiences to categorize the information coming in. Deutsch’s “sometimes behave so strangely” transforms into a catchy tune quickly, easily convincing our brain that yes, this is music. Perhaps this sentence has rhythmic qualities that subconsciously reminds us of a tune to bob our head to or a melodic contour that is reminiscent of a children’s song. Listeners clearly rely on unmistakable cues such as semantic meaning, pitch and rhythm when encoding music and speech, but the results of my thesis suggest that such subtle indications as differences in vowel formant frequencies may also play a role in this process.


Deutsch, D., Henthorn, T., & Lapidis, R. (2011). Illusory transformation from speech to song. The Journal of the Acoustical Society of America, 129, 2245–2252. https://doi.org/10.1121/1.3562174

Tierney, A., Dick, F., Deutsch, D., & Sereno, M. (2013). Speech versus song: Multiple pitch-sensitive areas revealed by a naturally occurring musical illusion. Cerebral Cortex. https://doi.org/10.1093/cercor/bhs003