Invited Talks

Simultaneous Translation: Recent Advances and Remaining Challenges

Liang Huang

Principal Scientist and Head
Institute of Deep Learning USA (IDL-US)
Baidu Research, Sunnyvale, CA, USA

Assistant Professor (on leave)
School of EECS
Oregon State University, Corvallis, OR, USA


Simultaneous interpretation (i.e., translating concurrently with the source language speech) is widely used in many scenarios including multilateral organizations (UN/EU), international summits (APEC/G-20), legal proceedings, and press conferences. However, it is well known to be one of the most challenging tasks for humans due to the simultaneous perception and production in two languages. As a result, there are only a few thousand professional simultaneous interpreters world-wide, and each of them can only sustain for 15-30 minutes in each turn. On the other hand, simultaneous translation (either speech-to-text or speech-to-speech) is also notoriously difficult for machines and has remained one of the holy grails of AI. A key challenge here is the word order difference between the source and target languages. For example, if you simultaneously translate German (an SOV language) to English (an SVO language), you often have to wait for the sentence-final German verb. Therefore, most existing "real-time" translation systems resort to conventional full-sentence translation, causing an undesirable latency of at least one sentence, rendering the audience largely out of sync with the speaker. There have been efforts towards genuine simultaneous translation, but with limited success.

Recently, at Baidu Research, we discovered a much simpler and surprisingly effective approach to simultaneous (speech-to-text) translation by designing a "prefix-to-prefix" framework tailed to simultaneity requirements. This is in contrast with the "sequence-to-sequence" framework which assumes the availability of the full input sentence. Our approach results in the first simultaneous translation system that achieves reasonable translation quality with controllable latency. Our technique has been successfully deployed to simultaneously translate Chinese speeches into English subtitles at the 2018 Baidu World Conference, and has been demoed live at NeuIPS 2018 Expo Day.

Inspired by the success of this very simple approach, we have extended it to produce more flexible translation strategies. Our work has also generated renewed interest in this long-standing problem in the CL community; for instance, two recent papers from Google proposed interesting improvements based on our ideas. Time permitting, I will also discuss our efforts towards the ultimate goal of simultaneous speech-to-speech translation, and conclude with a list of remaining challenges.

See demos, media coverage, and more info at:


Liang Huang is Principal Scientist and Head of Institute of Deep Learning USA (IDL-US) at Baidu Research and Assistant Professor (on leave) at Oregon State University.

He received his PhD from the University of Pennsylvania in 2008 and BS from Shanghai Jiao Tong University in 2003. He was previously a research scientist at Google, a research assistant professor at USC/ISI, an assistant professor at CUNY, a part-time research scientist at IBM. His research is in the theoretical aspects of computational linguistics.

Many of his efficient algorithms in parsing, translation, and structured prediction have become standards in the field, for which he received a Best Paper Award at ACL 2008, a Best Paper Honorable Mention at EMNLP 2016, and several best paper nominations (ACL 2007, EMNLP 2008, ACL 2010, and SIGMOD 2018). He is also a computational biologist where he adapts his parsing algorithms to RNA and protein folding. He is an award-winning teacher and a best-selling author. His work has garnered widespread media attention including Fortune, CNBC, IEEE Spectrum, and MIT Technology Review.

Loquentes Machinae: Technology, Applications, and Ethics of Conversational Systems

Pascale Fung


From HAL in “2001:Space Odyssey” to Samantha in “Her”, conversational systems have always captured the public’s imagination as the ultimate intelligent machine. The famous Turing Test was designed to determine whether a machine “thinks” like human or not, based on natural conversation between human and a machine. With the advent of smart devices, conversational systems are suddenly everywhere, talking and responding to us from our phones, speakers, cars and call centers. Meanwhile, the public is also becoming increasingly concerned about privacy and security issues of these systems.

In the decades since the first DARPA Communicator project, conversational systems come in many different forms. Whereas research systems are predominantly based on deep learning approaches today, most of the commercial systems from the US and Asia are still using template-based and retrieval-based approaches.

Recent advances in such systems include endowing them with the ability to (1) learn to memorize; (2) learn to personalize; and (3) learn to empathize. In all aspects of R&D in this area, we encounter the challenge of a lack of well-balanced and well-labeled data. Hence, multi-task and meta-learning have been proposed as possible solutions.

In this talk, I will give an overview of some of the technical challenges, approaches and applications of conversational systems, and the debates on ethical issues surrounding them. I will also highlight some of the cultural differences in this area and discuss how we can collaborate internationally to build conversational systems that are secure, safe, and fair for all.


Pascale Fung is a Professor in the Department of Electronic & Computer Engineering and the Department of Computer Science & Engineering at the Hong Kong University of Science & Technology(HKUST).

She is the Director of the multidisciplinary Centre for AI Research (CAiRE) at HKUST, to promote R&D in beneficial and human-centered AI. She is an elected Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and an elected Fellow of the International Speech Communication Association (ISCA) for her contributions to the interdisciplinary area of spoken language human-machine interactions.

She co-founded the Human Language Technology Center (HLTC) and is an affiliated faculty with the Robotics Institute and the Big Data Institute, both at HKUST. She is a past president and current board member of ACL SIGDAT, and past technical program chair of ACL and EMNLP conferences. She was Editor for Computer Speech and Language, Associate Editor for the IEEE/ACM Transactions on Audio, Speech and Language Processing and the Transactions on Association for Computational Linguistics. She was Technical Program Chair for IEEE ICASSP 2018 and will be TPC again in 2020 and 2024.

Fung's work has always been focused on building intelligent systems that can understand and empathize with humans. Her specific areas of research are using statistical modelling and deep learning for natural language processing, spoken language systems, emotion and sentiment recognition. Pascale Fung has applied many of her research group's results in the fields of, among others, robotics, IoT, and financial analytics. Her efforts led to the launch of the world's first Chinese natural language search engine in 2001, the first Chinese virtual assistant for smartphones in 2010, and the first emotional intelligent speaker in 2017.

Pascale Fung is a faculty expert for the World Economic Forum, and is a member of the Partnership on AI for the Benefit of Humanity and Society. She has been invited as an AI expert to different government initiatives in China, Japan, the UAE, India, the European Union and the United Nations. She has spoken extensively on using AI for public good and on the need for international collaborations in this area.