Conference Keynote Speakers
ABSTRACT: Speaker recognition has been driven by the evaluation paradigm since the end of nineties. The NIST Speaker Recognition Evaluation (SRE) campaigns have helped build a strong scientific community, where the performance evaluation is a central motif. These campaigns have concentrated international research efforts on focused topics using common protocols. Impressive results have been achieved, in terms of error rate reductions. Although protocols have become progressively stricter, typical error rates have improved to around 2%, from between 10% to 20% only 15 years ago. Although it cannot be contested that the NIST SRE is the gold standard for speaker recognition, the evaluation is driven by a unique and global performance measure that is not focused on a particular practical application. The protocol is designed for this general, global view and thus does not emphasize “local” effects such as the individual speaker or the nature of the speech. Several works shown the limits of the protocol, as well as, the performance criterion and challenged the behavior of the results. The real question concerns the underlying paradigm that is the basis for performance evaluation. In this talk, an alternative will be discussed, the reliability paradigm. The talk will begin with a survey of state of the art of speaker recognition technologies. This will be followed by a review of performance evolution over the years and then a deeper analysis of the actual significance of the numbers. The specific case of voice comparison in terms of decision and performance evaluation will also be discussed. The presentation will be concluded with an explanation of the paradigm shift and an introduction to the reliability paradigm, and specific proposals for speaker recognition and voice comparison will be made.
BIO: Jean-Francois Bonastre is a Professor in computer sciences at the LIA, University of Avignon. He has been vice-president of the University of Avignon since December 2008. He holds BSc and MSc degrees in Computer Science from the University of Marseille and completed his PhD thesis focusing on automatic speaker identification at Avignon in 1994. In 2000, he obtained an HDR (authorizing Ph.D. supervision) in the same topic. His main research focus is on biometry and more specifically voice authentication. More generally, he is active in several topics such as speaker recognition and diarization, forensic voice comparison, language recognition and pathological voice processing or speech-based voice information retrieval. J-F Bonastre has been an invited professor at Panasonic Speech Technology Laboratory (Santa Barbara) during one year (in 2002-2003). He joined the Institut Universitaire de France (IUF) as Junior Member in 2006. J-F Bonastre is a former President of ISCA and of ISCA regional branch, AFCP. He is an IEEE Senior Member, a member of IEEE biometry council and a former member of IEEE SLTC. He was Guest Editor for several international journals including Speech Communication and Computer Speech and Language. He is also an Associate Editor for IEEE/TSALP.
ABSTRACT: The last 25 years have seen a dramatic progress in statistical methods for recognizing speech signals and for translating spoken and written language. This lecture gives an overview of the underlying statistical methods. In particular, the lecture will focus on the remarkable fact that, for these tasks and similar tasks like handwriting recognition, the statistical approach makes use of the same four principles: 1) Bayes decision rule for minimum error rate; 2) probabilistic models, e.g. Hidden Markov models or conditional random fields for handling strings of observations (like acoustic vectors for speech recognition and written words for language translation); 3) training criteria and algorithms for estimating the free model parameters from large amounts of data; 4) the generation or search process that generates the recognition or translation result. Most of these methods had originally been designed for speech recognition. However, it has turned out that, with suitable modifications, the same concepts carry over to language translation and other tasks in natural language processing. This lecture will summarize the achievements and the open problems in this field.
BIO: Hermann Ney is a full professor of computer science at RWTH Aachen University, Germany. His main research interests lie in the area of statistical methods for pattern recognition and human language technology and their specific applications to speech recognition, machine translation and handwriting recognition. In particular, he has worked on dynamic programming and discriminative training for speech recognition, on language modelling and on phrase-based approaches to machine translation. His work has resulted in more than 600 conference and journal papers (h-index 73, estimated using Google scholar). He is a fellow of both IEEE and ISCA. In 2005, he was the recipient of the Technical Achievement Award of the IEEE Signal Processing Society. In 2010, he was awarded a senior DIGITEO chair at LIMIS/CNRS in Paris, France. In 2013, he received the award of honour of the International Association for Machine Translation.
ABSTRACT: Over the past decade, speech recognition technology has become increasingly commonplace in consumer and enterprise applications. As higher expectations and greater demands are being placed on speech recognition as the technology matures, robustness in recognition is becoming increasingly important. This talk will review and discuss several classical and contemporary approaches that render the performance of automatic speech recognition systems and related technology robust to changes and degradations in the acoustical environment within which they operate. The most tractable types of environmental degradation are produced by quasi-stationary additive noise and quasi-stationary linear filtering. These distortions can be largely ameliorated by "classical" techniques such as cepstral high-pass filtering as well as by techniques that develop statistical models of the distortion (such as vector Taylor series expansion). Nevertheless, these types of approaches fail to provide much useful improvement when speech is degraded by transient or non-stationary noise such as background music or speech, or in environments that include nonlinear distortion. We describe and compare the effectiveness in difficult acoustical environments of techniques based on missing-feature compensation, multi-band analysis, combination of complementary streams of information, physiologically-motivated auditory processing, and specialized techniques directed at compensation for nonlinearities, with a focus on how these techniques are applied to the practical problems facing us today.
BIO: Richard M. Stern received the S.B. degree from the Massachusetts Institute of Technology in 1970, the M.S. from the University of California, Berkeley, in 1972, and the Ph.D. from MIT in 1977, all in electrical engineering. He has been on the faculty of Carnegie Mellon University since 1977, where he is currently a Professor in the Department of Electrical and Computer Engineering, the Department of Computer Science, and the Language Technologies Institute, and a Lecturer in the School of Music. Much of Dr. Stern's current research is in spoken language systems, where he is particularly concerned with the development of techniques with which automatic speech recognition can be made more robust with respect to changes in environment and acoustical ambience. In addition to his work in speech recognition, Dr. Stern has worked extensively in psychoacoustics, where he is best known for theoretical work in binaural perception. Dr. Stern is a Fellow of the IEEE, the Acoustical Society of America, and the International Speech Communication Association (ISCA). He was the ISCA 2008-2009 Distinguished Lecturer, a recipient of the Allen Newell Award for Research Excellence in 1992, and he served as the General Chair of Interspeech 2006. He is also a member of the Audio Engineering Society.
ABSTRACT: The telecommunications industry and device manufacturers are moving at breakneck speed to realize the “connected world” – seamless access to information and services anyway, anywhere, anytime. Speech and language technologies are critical to making access to such huge amounts of information “effortless” – the key to usability. The ecosystem necessary to develop this complex world is coming together rapidly.
At first glance, speech is an attractive complement to the small form factor of many connected devices, and how they are used – for example, while driving in a connected car, watching TV in a connected home or searching for information and services in a connected world. Virtual assistants have re-energized the services world as speech and languages technologies as now capable of performing at levels that make them usable for complex applications – well beyond the IVR services that drove the speech market for the past 20 years.
The unprecedented bandwidth available to connected devices makes it possible to run speech and language at scale in the cloud, side-stepping device dependent issues, such as software management, limited processing/memory and battery life. Multimodal services can now be device and OS agnostic. Over the top intelligent virtual assistants can be developed where the consumer has consistent and effortless access to services anytime, anywhere and on any device they choose.
The demand for speech and language technologies has never been greater. However, the expectation on the performance of these technologies has also never been greater. Users expect that technologies will ‘just work’ everywhere and for every application. As we all know, there are still huge technical challenges for speech and language scientists and engineers to solve to meet this growing market expectation:
- Robustness to new acoustic environments (the never ending challenge)
- High accuracy across any application domain and complexity
- Better use of semantic information and reasoning to drive intelligence
- Increased personalization and unsupervised adaptation
- Scalable architectures that minimize network latency and costs
- “Invisible deployment” – balancing tradeoffs for cloud vs embedded solutions
- “Intuitive” tools enabling non-speech experts to develop new applications
All are still very much critical open questions requiring new thinking and new innovation. From a business side, the union of technology and cloud computing is changing the ecosystem of the speech industry. New business models are emerging, new technologies providers are appearing, and novel partnerships are being formed. The role of vendors, operators and developers are being redefined. And as expected, cloud services with open APIs are unlocking the talents of the best and brightest innovators, enabling them to create new and exciting multimodal mobile services. Driving to “effortless” access to content and services in a connected world is the exciting challenge facing all of us – the opportunities to participate in this ecosystem are endless.
BIO: Jay Wilpon is Executive Director of Natural Language Processing Research at AT&T Labs. Beginning his career in 1977, Jay is one of the world’s pioneers and chief evangelist, for speech and language technologies and services. Jay has authored over 100 publications and patents. He has been a leading innovator for a number of advanced voice enabled services throughout his career, including AT&T’s How May I Help You? (HMIHY) – the first nationwide deployment of a true human-like spoken language understanding based service. His work led to the first nationwide deployments of both speech recognition and spoken language understanding technologies. Jay and his team’s current focus include several of the key challenges that will promote the ubiquitous use of speech and language technologies to enable a wide spectrum of ‘intelligent’ services, including virtual assistants and customer care. In particular, he is focusing on (1) innovations in natural language processing, language translation and spoken dialog systems enabling ‘effortless’ access to communication services, information and transactions; (2) advancing voice biometric technologies for providing increased security to information and personalized services; and (3) innovating new speech architectures, components, standards and tools necessary to enable rapid development and scalable deployment of advanced voice enabled services, a.k.a. speech mashups. Jay was awarded the distinguished honor of IEEE Fellow for his leadership in the development of automatic speech recognition algorithms. For pioneering leadership in the creation and deployment of speech recognition-based services in the telephone network, Jay was awarded the honor of AT&T Fellow. Altogether, the service innovations that Jay has been associated with have produced revenue and savings of billions of dollars for AT&T and its clients and have enabled AT&T to become a leader in the voice-enabled services marketplace.
Satellite Event Keynote Speaker
ABSTRACT: At the onset of the 21st century, it will be an era in which the very nature of what it means to be human will be both enriched and challenged, as our species breaks the shackles of its genetic legacy, and achieves inconceivable heights of intelligence, material progress, and longevity. The paradigm shift rate is now doubling every decade, so the twenty-first century will see 20,000 years of progress at today’s rate. Computation, communication, biological technologies (for example, DNA sequencing), brain scanning, knowledge of the human brain, and human knowledge in general are all accelerating at an even faster pace, generally doubling price-performance, capacity, and bandwidth every year. Three-dimensional molecular computing will provide the hardware for human-level "strong" AI well before 2030. The more important software insights will be gained in part from the reverse-engineering of the human brain, a process well under way. While the social and philosophical ramifications of these changes will be profound, and the threats they pose considerable, we will ultimately merge with our machines, live indefinitely, and be a billion times more intelligent...all within the next three to four decades.
BIO: Ray Kurzweil is one of the world’s leading inventors, thinkers, and futurists, with a thirty-year track record of accurate predictions. Called "the restless genius" by The Wall Street Journal and "the ultimate thinking machine" by Forbes magazine, Kurzweil was selected as one of the top entrepreneurs by Inc. magazine, which described him as the "rightful heir to Thomas Edison." PBS selected him as one of the "sixteen revolutionaries who made America."
Kurzweil was the principal inventor of the first CCD flat-bed scanner, the first omni-font optical character recognition, the first print-to-speech reading machine for the blind, the first text-to-speech synthesizer, the first music synthesizer capable of recreating the grand piano and other orchestral instruments, and the first commercially marketed large-vocabulary speech recognition.
Among Kurzweil’s many honors, he is the recipient of the National Medal of Technology, was inducted into the National Inventors Hall of Fame, holds twenty honorary Doctorates, and honors from three U.S. presidents.
Ray has written five national best-selling books, including New York Times best sellers The Singularity Is Near (2005) and How To Create A Mind (2012). He is a Director of Engineering at Google heading up a team developing machine intelligence and natural language understanding.