Phonetics and Artificial Intelligence: ready for the paradigm shift?

Photo by Andy Kelly on Unsplash


Modern phonetics has relied, to a large extent, on researchers’ ability to extract patterns from visual representations of speech. In this respect, if linguists were medical doctors, phoneticians would be radiologists. Speaking of radiologists, recent progress in artificial intelligence has made it possible for certain deep learning algorithms to outperform human pathologists at detecting abnormalities in medical images (Litjens et al., 2017). If the analogy holds, it is fair to ask whether artificial intelligence can beat phoneticians at their own game or, at least, constitute a significant addition to their toolbox. My contention is that the advent of deep learning opens up a whole new research programme for the humanities in general, and phonetics in particular. While deep neural networks (DNNs) have been duly praised for bringing about a major breakthrough in applied fields like automatic speech (Hinton et al., 2012) or image (Simonyan & Zisserman, 2015) recognition, we are only just starting to realize how fundamental research in our field can benefit from them (Ferragne et al., 2019; Pellegrini & Mouysset, 2016). There are at least three reasons why DNNs will trigger a paradigm shift in phonetics. Firstly, unlike other quantitative techniques, DNNs can extract relevant representations from the speech signal without the need for a human expert to provide the system with hand-picked features (Goodfellow et al., 2016, for a comprehensive account of DNN properties). As a result, typical workflows now boast improved reproducibility; the possibility is raised that previously unnoticed parameters can be brought to light; and manual segmentation – a major bottleneck in phonetic analysis – is no longer needed in some cases. Secondly, deep learning will contribute to bringing the old parsimony-driven paradigm to a close. There is a whole record of experimental research that demonstrates that mental phonetic representations are detailed and multidimensional (Pierrehumbert, 2016). So, now that the high-dimensionality taboo has been broken, and that increasingly powerful and cheap computing resources have become available, the time is just right for the emergence of DNNs in phonetics, with their rich inputs and outputs. Thirdly, the current focus on explicability in the deep learning community has led to effective methods to visualize what DNNs learn (Chattopadhay et al., 2018). My claim here is that scientific findings based on visuals are key to bridging the divide between the hard sciences and humanities. And “visible speech”, the powerful synaesthetic cornerstone of contemporary phonetics, is more than ever legitimatized by DNN-based methods. Moreover, such techniques undoubtedly represent the logical alternative to the modern unreasonable urge to (over-) use inferential statistics and its misleading probability values. I will illustrate these claims with examples taken from on-going work in this nascent research field. I will more specifically focus on how convolutional neural networks used in image recognition and computer vision can be adapted to the study of phonetics. I will discuss the advantages and shortcomings of this novel approach, and I hope to show that while deep learning lies at the intersection of experimental and corpus phonetics, it offers the best of both worlds.

Jun 5, 2019 2:00 PM
Aix-en-Provence, Laboratoire Parole et Langage, 5 Avenue Pasteur


Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. WACV, Lake Tahoe.

Ferragne, E., Gendrot, C., & Pellegrini, T. (2019). Towards Phonetic Interpretability in Deep Learning Applied to Voice Comparison. ICPhS, Melbourne.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. Cambridge, Massachusetts: The MIT Press.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82-97.

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J., van Ginneken, B., & Sánchez, C. I. (2017). A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis, 42, 60-88.

Pellegrini, T., & Mouysset, S. (2016). Inferring Phonemic Classes from CNN Activation Maps Using Clustering Techniques. Interspeech, San Francisco.

Pierrehumbert, J. B. (2016). Phonological Representation: Beyond Abstract Versus Episodic. Annual Review of Linguistics, 2(1), 33-52.

Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015, San Diego.