Rob Clark, Google, « Text-to-speech in 2023 »
We take a look at the implications for TTS systems and speech researchers as the TTS landscape drastically changes with fast paced developments in machine learning. We see how TTS has changed, and show that while some TTS problems are being actively solved by large scale ML approaches, other problems that are being ignored, and new problems are arising due to the changing landscape.
Rob Clark joined The University of Edinburgh in 1995 first as an MSc student after his experience as an undergraduate in mathematics sparked an interest in Speech. Rob stayed in Edinburgh for 19 years in total at the Centre for Speech Technology Research, in a number of roles, mostly working on the Festival TTS system. At Edinburgh, he received his PhD in 2003, and at one point ended up running the MSc programme that he himself took. Rob Joined Google in 2015 and took over leading TTS Research, this led to collaborations on the development of TTS models such as Wavenet and Tacotron and the systems that currently serve millions of queries a day to Google users. Rob’s own primary research interest has always been prosody in TTS, where his research has focused both on both the expressiveness/emotion side of prosody and on the realizing of prominence patterns correctly. Rob continues to work in this area today.
Alejandrina Cristia, CNRS/LSCP Paris, « Babble everywhere: A cross-linguistic approach to understanding how early speech experiences impact language development »
A shocking recent report found that 53% of papers published in mainstream language development journals focused on English, and that fewer than 1% of the world’s languages had even been represented in a single language development paper (Kidd & Garcia, 2022). Empirical and theoretical conclusions that appear incontrovertible may actually be only true of the small fraction of the world’s population that has captured most of our scientific attention. In this talk, I describe how emergent technologies are helping us better represent the diversity in languages and child-rearing practices that characterize human populations. Data come from long-form audio-recordings which capture children’s input and production from an ego perspective. Combining unique speech technology approaches with citizen science support, we are gaining new insights on the diversity of children’s spoken input and the relative stability of their own vocal productions. In addition, self-supervised machine learning is employed to attempt to reverse-engineer the language acquisition process, revealing tantalizing differences between human children’s and AI’s development.
Alejandrina Cristia is a senior researcher at the Centre National de la Recherche Scientifique (CNRS), leader of the Language Acquisition Across Cultures team, at the Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP) cohosted by the Ecole Normale Supérieure, EHESS, and PSL. Her long-term aim is to shed light on the child language development, both descriptively and a mechanistically. To this end, her team draws methods and insights from linguistics, psychoogy, anthropology, economics, and speech technology. This interdisciplinary approach has resulted in over 100 publications in international journals and conferences. With an interest in cumulative, collaborative, and transparent science, she co-founded the first meta-meta-analysis platform (metalab.stanford.edu) and several international networks, including DAylong Recordings of Children’s Language Environment (darcle.org), and the Consortium on Language Variation in Input Environments around the World (LangVIEW), which aims to increase participant and researcher diversity in language development studies. She received the 2017 John S. McDonnell Scholar Award in Understanding Human Cognition, the 2020 Médaille de Bronze CNRS Section Linguistique, and an ERC Consolidator Award (2021-2026) for the [ExELang](exelang.fr) project.
Chloé Clavel, Telecom-Paris, « Socio-conversational AI: Modelling the socio-emotional component of interactions using neural models »
A single lack of social tact on the part of a conversational system (chatbot, voice assistant, social robot) can cause the user’s trust and engagement with the interaction to drop. This lack of social intelligence affects the willingness of a large audience to view conversational systems as acceptable. To understand the state of the user, the current affective/social computing research community has drawn on research in artificial intelligence and the social sciences. However, in recent years, the trend has shifted towards a monopoly of deep learning methods, which are quite powerful but opaque and greedy for annotated data and less suitable for integrating social science knowledge. I will present here the research we are doing within the Social Computing team at Telecom-Paris to develop Machine/Deep Learning models for modeling the social component of interactions. In particular, I will focus on research aimed at improving the explainability of the models as well as their transferability to new data and new socio-emotional phenomena.
Main references for this talk:
Aina Garí Soler, Matthieu Labeau and Chloé Clavel (2022). One Word, Two Sides: Traces of Stance in Contextualized Word Representations. Proceedings of the 29th International Conference on Computational Linguistics (COLING), Gyeongju, Korea, October 12-17.
Raphalen, Y., Clavel, C. & Cassell, J. (2022). « You might think about slightly revising the title »: identifying hedges in peer-tutoring interactions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Long Papers.
Gaël Guibon, Matthieu Labeau, Hélène Flamein, Luce Lefeuvre, Chloé Clavel (2021). Few-Shot Emotion Recognition in Conversation with Sequential Prototypical Networks. In EMNLP
Nicolas Rollet, Chloé Clavel. (2020) “Talk to you later” Doing social robotics with conversation analysis. Towards the development of an automatic system for the prediction of disengagement, Article in : Interaction Studies Vol. 21:2 pp. 269–293
Leo Hemamou;Arthur Guillon;Jean-Claude Martin;Chloe Clavel. (2021) Multimodal Hierarchical Attention Neural Network: Looking for Candidates Behaviour which Impact Recruiter’s Decision, IEEE Trans. of Affective Computing
I am a Professor of Affective Computing at LTCI, Telecom-Paris, Institut Polytechnique de Paris, where I coordinate the Social Computing team. My research interests are in the areas of Affective Computing and Artificial Intelligence that lie at the crossroad of multiple disciplines including, speech and natural language processing, machine learning, multimodal interaction, and social robotics. I study computational models of socio-emotional behaviors (e.g., sentiments, social stances, engagement, trust) in interactions be it human-human interactions (social networks, job interviews) or human-agent interactions (conversational agents, social robots). I am motivated by application areas such as health and education, where research in affective computing and artificial intelligence is dedicated to empowering people and improving their quality of life.