Program at a glance

Time table

CESTSaturday 26/8Sunday 27/8Monday 28/8
Rob Clark
Alejandrina Cristia
Oral session 5
10:15Coffee BreakCoffee BreakCoffee Break
10:30Oral session 1Oral session 3Keynote:
Chloé Clavel
12:00Lunch BreakLunch BreakLunch Break
13:30Posters session
for regular papers
Oral session 4Poster session
for late breaking reports
15:00Coffee BreakCoffee Break
15:15Oral session 2Roundtable on Ethics & Generative AI
Patrick Kuban, Jeannette Gorzala
& Ambre Davant
Oral session 6
16:45General assembly
18:00 – 20:30Welcome reception
(wine & cheese at the venue)
19:30 – 23:30Social event
(buffet & musical concert at Fort de la Bastille)

Oral 1: Text encoding for TTS

O1Advocating for text input in multi-speaker text-to-speech systemsGérard Bailly, Martin Lenglet, Olivier Perrotin and Esther Klabbers
O2Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciationsJason Fong, Hao Tang and Simon King
O3A Comparative Analysis of Pretrained Language Models for Text-to-SpeechMarcel Granero Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet and Thomas Drugman
O4Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language SelectionPhat Do, Matt Coler, Jelske Dijkstra and Esther Klabbers

Oral 2: Evaluation

O5Importance of Human Factors in Text-To-Speech EvaluationsLev Finkelstein, Joshua Camp and Rob Clark
O6Re-examining the quality dimensions of synthetic speechFritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach and Petra Wagner
O7Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluationAmbika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely and Joakim Gustafson
O8MooseNet: A Trainable Metric for Synthesized Speech with a PLDA ModuleOndřej Plátek and Ondrej Dusek

Oral 3: Beyond text-to-speech

O9Cross-lingual transfer using phonological features for resource-scarce text-to-speechJohannes Abraham Louw
O10Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertionYuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi and Hiroshi Saruwatari
O11Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTSHarm Lameris, Ambika Kirkland, Joakim Gustafson and Eva Szekely
O12Synthesising turn-taking cues using natural conversational dataJohannah O’Mahony, Catherine Lai and Simon King

Oral 4: Voice conversion

O13StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep EmbeddingsArnab Das, Suhita Ghosh, Tim Polzehl, Ingo Siegert and Sebastian Stober
O14PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational AutoencoderKou Tanaka, Hirokazu Kameoka and Takuhiro Kaneko
O15Federated Learning for Human-in-the-Loop Many-to-Many Voice ConversionRyunosuke Hirai, Yuki Saito and Hiroshi Saruwatari
O16HiFi-VC: High Quality ASR-based Voice ConversionAnton Kashkin, Ivan Karpukhin and Svyatoslav Shishkin

Oral 5: Expressivity, emotion & styles

O17EmoSpeech: guiding FastSpeech2 towards Emotional Text to SpeechDaria Diatlova and Vitalii Shutov
O18Controllable Emphasis with zero data for text-to-speechArnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman and Elena Sokolova
O19Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive ControlMartin Lenglet, Olivier Perrotin and Gérard Bailly
O20Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis ProsodySofoklis Kakouros, Juraj Šimko, Martti Vainio and Antti Suni

Oral 6: Long form, multimodal & multi-speaker TTS

O21An analysis on the effects of speaker embedding choice in non auto-regressive TTSAdriana Stan and Johannah O’Mahony
O22Audiobook synthesis with long-form neural text-to-speechWeicheng Zhang, Cheng-Chieh Yeh, Will Beckman, Tuomo Raitio, Ramya Rasipuram, Ladan Golipour and David Winarsky
O23Improving the quality of neural TTS using long-form content and multi-speaker multi-style modelingTuomo Raitio, Javier Latorre, Andrea Davis, Tuuli Morrill and Ladan Golipour
O24Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesisShivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Eva Szekely and Gustav Eje Henter

Poster session

P1Diffusion Transformer for Adaptive Text-to-SpeechHaolin Chen and Philip N. Garner
P2On the Use of Self-Supervised Speech Representations in Spontaneous Speech SynthesisSiyang Wang, Gustav Eje Henter, Joakim Gustafson and Eva Szekely
P3Voice Cloning: Training Speaker Selection with Limited Multi-Speaker CorpusDavid Guennec, Lily Wadoux, Aghilas Sini, Nelly Barbot and Damien Lolive
P4Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time WarpingRavi Shankar and Archana Venkataraman
P5Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel DataJarod Duret, Yannick Estève and Titouan Parcollet
P6Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination TestsKishor Kayyar, Christian Dittmar, Nicola Pia and Emanuel Habets
P7Better Replacement for TTS Naturalness EvaluationSajad Shirali-Shahreza and Gerald Penn
P8The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized LecturesMikey Elmers and Eva Szekely
P9SPTK4: An Open-Source Software Toolkit for Speech Signal ProcessingTakenori Yoshimura, Takato Fujimoto, Keiichiro Oura and Keiichi Tokuda
P10FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From SpectrogramsLev Finkelstein, Chun-an Chan, Vincent Wan, Heiga Zen and Rob Clark
P11Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applicationsBiel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu and Jaime Lorenzo-Trueba
P12Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech SynthesisIbrahim Ibrahimov, Gabor Gosztolya and Tamas Gabor Csapo

Poster session for late breaking reports (LBR)

LBR1Universal Approach to Multilingual Multispeaker Child Speech SynthesisShaimaa Alwaisi, Mohammed Salah Al-Radhi and Géza Németh
LBR2Towards Speaker-Independent Voice Conversion for Improving Dysarthric Speech IntelligibilitySeraphina Fong, Marco Matassoni, Gianluca Esposito and Alessio Brutti
LBR3Exploring the multidimensional representation of individual speech acoustic parameters extracted by deep unsupervised modelsMaxime Jacquelin, Maeva Garnier, Laurent Girin, Rémy Vincent and Olivier Perrotin
LBR4SarcasticSpeech: Speech Synthesis for Sarcasm in Low-Resource ScenariosZhu Li, Xiyuan Gao, Shekhar Nayak and Matt Coler
LBR5Recovering Discrete Prosody Inputs via Invert-ClassifyNicholas Sanders and Korin Richmond
LBR6Using a Large Language Model to Control Speaking Style for Expressive TTSAtli Thor Sigurgeirsson and Simon King
LBR7NaijaTTS: A pitch-controllable TTS model for Nigerian PidginEmmett Strickland, Dana Aubakirova, Dorin Doncenco, Diego Torres and Marc Evrard