Time table
CEST | Saturday 26/8 | Sunday 27/8 | Monday 28/8 |
8:00 | Registration | ||
8:30 | Opening | ||
9:00 | Keynote: Rob Clark | Keynote: Alejandrina Cristia | Oral session 5 |
10:15 | Coffee Break | Coffee Break | Coffee Break |
10:30 | Oral session 1 | Oral session 3 | Keynote: Chloé Clavel |
12:00 | Lunch Break | Lunch Break | Lunch Break |
13:30 | Posters session for regular papers | Oral session 4 | Poster session for late breaking reports |
15:00 | Coffee Break | Coffee Break | |
15:15 | Oral session 2 | Roundtable on Ethics & Generative AI Patrick Kuban, Jeannette Gorzala & Ambre Davant | Oral session 6 |
16:45 | General assembly SynSIG | ||
18:00 – 20:30 | Welcome reception (wine & cheese at the venue) | ||
19:30 – 23:30 | Social event (buffet & musical concert at Fort de la Bastille) |
Oral 1: Text encoding for TTS
O1 | Advocating for text input in multi-speaker text-to-speech systems | Gérard Bailly, Martin Lenglet, Olivier Perrotin and Esther Klabbers |
O2 | Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations | Jason Fong, Hao Tang and Simon King |
O3 | A Comparative Analysis of Pretrained Language Models for Text-to-Speech | Marcel Granero Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet and Thomas Drugman |
O4 | Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection | Phat Do, Matt Coler, Jelske Dijkstra and Esther Klabbers |
Oral 2: Evaluation
O5 | Importance of Human Factors in Text-To-Speech Evaluations | Lev Finkelstein, Joshua Camp and Rob Clark |
O6 | Re-examining the quality dimensions of synthetic speech | Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach and Petra Wagner |
O7 | Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation | Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely and Joakim Gustafson |
O8 | MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module | Ondřej Plátek and Ondrej Dusek |
Oral 3: Beyond text-to-speech
O9 | Cross-lingual transfer using phonological features for resource-scarce text-to-speech | Johannes Abraham Louw |
O10 | Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion | Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi and Hiroshi Saruwatari |
O11 | Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS | Harm Lameris, Ambika Kirkland, Joakim Gustafson and Eva Szekely |
O12 | Synthesising turn-taking cues using natural conversational data | Johannah O’Mahony, Catherine Lai and Simon King |
Oral 4: Voice conversion
O13 | StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings | Arnab Das, Suhita Ghosh, Tim Polzehl, Ingo Siegert and Sebastian Stober |
O14 | PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder | Kou Tanaka, Hirokazu Kameoka and Takuhiro Kaneko |
O15 | Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion | Ryunosuke Hirai, Yuki Saito and Hiroshi Saruwatari |
O16 | HiFi-VC: High Quality ASR-based Voice Conversion | Anton Kashkin, Ivan Karpukhin and Svyatoslav Shishkin |
Oral 5: Expressivity, emotion & styles
O17 | EmoSpeech: guiding FastSpeech2 towards Emotional Text to Speech | Daria Diatlova and Vitalii Shutov |
O18 | Controllable Emphasis with zero data for text-to-speech | Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman and Elena Sokolova |
O19 | Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control | Martin Lenglet, Olivier Perrotin and Gérard Bailly |
O20 | Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody | Sofoklis Kakouros, Juraj Šimko, Martti Vainio and Antti Suni |
Oral 6: Long form, multimodal & multi-speaker TTS
O21 | An analysis on the effects of speaker embedding choice in non auto-regressive TTS | Adriana Stan and Johannah O’Mahony |
O22 | Audiobook synthesis with long-form neural text-to-speech | Weicheng Zhang, Cheng-Chieh Yeh, Will Beckman, Tuomo Raitio, Ramya Rasipuram, Ladan Golipour and David Winarsky |
O23 | Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling | Tuomo Raitio, Javier Latorre, Andrea Davis, Tuuli Morrill and Ladan Golipour |
O24 | Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis | Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Eva Szekely and Gustav Eje Henter |
Poster session
P1 | Diffusion Transformer for Adaptive Text-to-Speech | Haolin Chen and Philip N. Garner |
P2 | On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis | Siyang Wang, Gustav Eje Henter, Joakim Gustafson and Eva Szekely |
P3 | Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus | David Guennec, Lily Wadoux, Aghilas Sini, Nelly Barbot and Damien Lolive |
P4 | Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping | Ravi Shankar and Archana Venkataraman |
P5 | Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data | Jarod Duret, Yannick Estève and Titouan Parcollet |
P6 | Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests | Kishor Kayyar, Christian Dittmar, Nicola Pia and Emanuel Habets |
P7 | Better Replacement for TTS Naturalness Evaluation | Sajad Shirali-Shahreza and Gerald Penn |
P8 | The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures | Mikey Elmers and Eva Szekely |
P9 | SPTK4: An Open-Source Software Toolkit for Speech Signal Processing | Takenori Yoshimura, Takato Fujimoto, Keiichiro Oura and Keiichi Tokuda |
P10 | FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From Spectrograms | Lev Finkelstein, Chun-an Chan, Vincent Wan, Heiga Zen and Rob Clark |
P11 | Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu and Jaime Lorenzo-Trueba |
P12 | Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis | Ibrahim Ibrahimov, Gabor Gosztolya and Tamas Gabor Csapo |
Poster session for late breaking reports (LBR)
LBR1 | Universal Approach to Multilingual Multispeaker Child Speech Synthesis | Shaimaa Alwaisi, Mohammed Salah Al-Radhi and Géza Németh |
LBR2 | Towards Speaker-Independent Voice Conversion for Improving Dysarthric Speech Intelligibility | Seraphina Fong, Marco Matassoni, Gianluca Esposito and Alessio Brutti |
LBR3 | Exploring the multidimensional representation of individual speech acoustic parameters extracted by deep unsupervised models | Maxime Jacquelin, Maeva Garnier, Laurent Girin, Rémy Vincent and Olivier Perrotin |
LBR4 | SarcasticSpeech: Speech Synthesis for Sarcasm in Low-Resource Scenarios | Zhu Li, Xiyuan Gao, Shekhar Nayak and Matt Coler |
LBR5 | Recovering Discrete Prosody Inputs via Invert-Classify | Nicholas Sanders and Korin Richmond |
LBR6 | Using a Large Language Model to Control Speaking Style for Expressive TTS | Atli Thor Sigurgeirsson and Simon King |
LBR7 | NaijaTTS: A pitch-controllable TTS model for Nigerian Pidgin | Emmett Strickland, Dana Aubakirova, Dorin Doncenco, Diego Torres and Marc Evrard |