Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN).
View Article and Find Full Text PDFIEEE/ACM Trans Audio Speech Lang Process
June 2022
Being able to estimate speech intelligibility without the need for listening tests would confer great benefits for a wide range of speech processing applications. Many attempts have therefore been made to introduce an objective, and ideally referencefree measure for this purpose. Most works analyze speech intelligibility prediction (SIP) methods from a macroscopic point of view, averaging over longer time spans.
View Article and Find Full Text PDF