The aim of the research project is the design of a system capable of the generation of an intelligible audio speech signal using information solely derived from a visual speech signal. Such a system has application in determining what a speaker is saying when there is only a visual signal available.
The system is broken down into two main areas.
- The development of a speech model that can generate intelligible speech from a highly reduced set of acoustic features.
- The discovery of robust methods of extracting acoustic speech features from only visual speech features.
Fig 1. Block diagram of proposed lips-to-speech system.
The problem faced is that there is no audio information contained within a visual speech signal. The fundamental frequency and a voicing classification can only be inferred, and the only spectral information available may be that which can be determined from the lips. Phase is not important as the model ensures it is kept continuous.
Testing conducted in the initial stages will aim to identify which of the aforementioned parameters affect the intelligibility of speech. It is hoped that the results collected will aid in the development of the main system.