Program at a glance

Time table

CEST	Saturday 26/8	Sunday 27/8	Monday 28/8
8:00	Registration
8:30	Opening
9:00	Keynote: Rob Clark	Keynote: Alejandrina Cristia	Oral session 5
10:15	Coffee Break	Coffee Break	Coffee Break
10:30	Oral session 1	Oral session 3	Keynote: Chloé Clavel
12:00	Lunch Break	Lunch Break	Lunch Break
13:30	Posters session for regular papers	Oral session 4	Poster session for late breaking reports
15:00	Coffee Break	Coffee Break
15:15	Oral session 2	Roundtable on Ethics & Generative AI Patrick Kuban, Jeannette Gorzala & Ambre Davant	Oral session 6
16:45	General assembly SynSIG
18:00 – 20:30	Welcome reception (wine & cheese at the venue)
19:30 – 23:30		Social event (buffet & musical concert at Fort de la Bastille)

Oral 1: Text encoding for TTS

O1	Advocating for text input in multi-speaker text-to-speech systems	Gérard Bailly, Martin Lenglet, Olivier Perrotin and Esther Klabbers
O2	Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations	Jason Fong, Hao Tang and Simon King
O3	A Comparative Analysis of Pretrained Language Models for Text-to-Speech	Marcel Granero Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet and Thomas Drugman
O4	Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection	Phat Do, Matt Coler, Jelske Dijkstra and Esther Klabbers

Oral 2: Evaluation

O5	Importance of Human Factors in Text-To-Speech Evaluations	Lev Finkelstein, Joshua Camp and Rob Clark
O6	Re-examining the quality dimensions of synthetic speech	Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach and Petra Wagner
O7	Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation	Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely and Joakim Gustafson
O8	MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module	Ondřej Plátek and Ondrej Dusek

Oral 3: Beyond text-to-speech

O9	Cross-lingual transfer using phonological features for resource-scarce text-to-speech	Johannes Abraham Louw
O10	Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion	Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi and Hiroshi Saruwatari
O11	Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS	Harm Lameris, Ambika Kirkland, Joakim Gustafson and Eva Szekely
O12	Synthesising turn-taking cues using natural conversational data	Johannah O’Mahony, Catherine Lai and Simon King

Oral 4: Voice conversion

O13	StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings	Arnab Das, Suhita Ghosh, Tim Polzehl, Ingo Siegert and Sebastian Stober
O14	PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder	Kou Tanaka, Hirokazu Kameoka and Takuhiro Kaneko
O15	Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion	Ryunosuke Hirai, Yuki Saito and Hiroshi Saruwatari
O16	HiFi-VC: High Quality ASR-based Voice Conversion	Anton Kashkin, Ivan Karpukhin and Svyatoslav Shishkin

Oral 5: Expressivity, emotion & styles

O17	EmoSpeech: guiding FastSpeech2 towards Emotional Text to Speech	Daria Diatlova and Vitalii Shutov
O18	Controllable Emphasis with zero data for text-to-speech	Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman and Elena Sokolova
O19	Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control	Martin Lenglet, Olivier Perrotin and Gérard Bailly
O20	Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody	Sofoklis Kakouros, Juraj Šimko, Martti Vainio and Antti Suni

Oral 6: Long form, multimodal & multi-speaker TTS

O21	An analysis on the effects of speaker embedding choice in non auto-regressive TTS	Adriana Stan and Johannah O’Mahony
O22	Audiobook synthesis with long-form neural text-to-speech	Weicheng Zhang, Cheng-Chieh Yeh, Will Beckman, Tuomo Raitio, Ramya Rasipuram, Ladan Golipour and David Winarsky
O23	Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling	Tuomo Raitio, Javier Latorre, Andrea Davis, Tuuli Morrill and Ladan Golipour
O24	Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis	Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Eva Szekely and Gustav Eje Henter

Poster session

P1	Diffusion Transformer for Adaptive Text-to-Speech	Haolin Chen and Philip N. Garner
P2	On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis	Siyang Wang, Gustav Eje Henter, Joakim Gustafson and Eva Szekely
P3	Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus	David Guennec, Lily Wadoux, Aghilas Sini, Nelly Barbot and Damien Lolive
P4	Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping	Ravi Shankar and Archana Venkataraman
P5	Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data	Jarod Duret, Yannick Estève and Titouan Parcollet
P6	Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests	Kishor Kayyar, Christian Dittmar, Nicola Pia and Emanuel Habets
P7	Better Replacement for TTS Naturalness Evaluation	Sajad Shirali-Shahreza and Gerald Penn
P8	The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures	Mikey Elmers and Eva Szekely
P9	SPTK4: An Open-Source Software Toolkit for Speech Signal Processing	Takenori Yoshimura, Takato Fujimoto, Keiichiro Oura and Keiichi Tokuda
P10	FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From Spectrograms	Lev Finkelstein, Chun-an Chan, Vincent Wan, Heiga Zen and Rob Clark
P11	Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu and Jaime Lorenzo-Trueba
P12	Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis	Ibrahim Ibrahimov, Gabor Gosztolya and Tamas Gabor Csapo

Poster session for late breaking reports (LBR)

LBR1	Universal Approach to Multilingual Multispeaker Child Speech Synthesis	Shaimaa Alwaisi, Mohammed Salah Al-Radhi and Géza Németh
LBR2	Towards Speaker-Independent Voice Conversion for Improving Dysarthric Speech Intelligibility	Seraphina Fong, Marco Matassoni, Gianluca Esposito and Alessio Brutti
LBR3	Exploring the multidimensional representation of individual speech acoustic parameters extracted by deep unsupervised models	Maxime Jacquelin, Maeva Garnier, Laurent Girin, Rémy Vincent and Olivier Perrotin
LBR4	SarcasticSpeech: Speech Synthesis for Sarcasm in Low-Resource Scenarios	Zhu Li, Xiyuan Gao, Shekhar Nayak and Matt Coler
LBR5	Recovering Discrete Prosody Inputs via Invert-Classify	Nicholas Sanders and Korin Richmond
LBR6	Using a Large Language Model to Control Speaking Style for Expressive TTS	Atli Thor Sigurgeirsson and Simon King
LBR7	NaijaTTS: A pitch-controllable TTS model for Nigerian Pidgin	Emmett Strickland, Dana Aubakirova, Dorin Doncenco, Diego Torres and Marc Evrard