Deep learning and voice comparison: phonetically-motivated vs. automatically-learned features

Photo by Solal Ohayon on Unsplash


Broadband spectrograms of French vowels /ɑ̃/, /a/, /ɛ/, /e/, /i/, /ə/, and /ɔ/ extracted from radio broadcast corpora were used to recognize 45 speakers with a deep convolutional neural network (CNN). The same network was also trained with 62 phonetic parameters to i) see if the resulting confusions were identical to those made by the CNN trained with spectrograms, and ii) understand which acoustic parameters were used by the network. The two networks had identical discrimination results 68% of the time. In 22% of the data, the network trained with spectrograms achieved successful discrimination while the network trained with phonetic parameters failed, and the reverse was found in 10% of the data. We display the relevant phonetic parameters with raw values and values relative to the speakers’ means and show cases favouring bad discrimination results. When the network trained with spectrograms failed to discriminate between some tokens, parameters related to f0 proved significant.

International Congress of Phonetic Sciences