Neural Network Based on Convolutional, Recurrent Layers and an Attention Mechanism for Visual Speech Recognition

D. Makar; M. Vashkevich

doi:10.35596/1729-7648-2026-24-1-75-82

Neural Network Based on Convolutional, Recurrent Layers and an Attention Mechanism for Visual Speech Recognition

D. Makar, M. Vashkevich

https://doi.org/10.35596/1729-7648-2026-24-1-75-82

Full Text:

PDF (Rus)

Generate QR code

Abstract

Visual speech recognition is the task of classifying spoken words or letters from a video stream capturing lip movements. This paper presents the synthesis and study of a neural network architecture for visual speech recognition based on a combination of convolutional and recurrent neural networks with an attention mechanism. The model was trained and evaluated on the AVLetters2 dataset in the most challenging speakerindependent mode. The model architecture includes an encoder based on convolutional layers for extracting spatial features, recurrent layers based on GRU units for modeling temporal dependencies, and an attention mechanism for highlighting informative fragments of the speech sequence. To assess the accuracy of the model, five-fold cross-validation was performed. Model hyperparameters were selected using Bayesian optimization, which allowed us to determine the optimal configuration of the model parameters and the training process. As a result of the experiments, an average recognition accuracy of 14.3 % was achieved. Analysis of the results revealed signi ficant variability in recognition quality depending on the characteristics of the speakers (accuracy ranged from 3.9 to 31.9 %), which indicates the need to further improve the invariance of the model to inter-speaker differences.

Keywords

visual speech recognition, AVLetters2, convolutional neural network, recurrent neural network, attention mechanism

About the Authors

D. Makar

Belarusian State University of Informatics and Radioelectronics
Belarus

Darya Makar - Postgraduate of the Electronic Computing Facilities Department

Minsk

M. Vashkevich

Belarusian State University of Informatics and Radioelectronics
Belarus

Vashkevich Maxim -Dr. Sci. (Tech.), Professor at the Electronic Computing Facilities Department

220013, Minsk, P. Brovki St., 6

Tel.: +375 17 293-84-20

References

1. Cox S., Harvey R., Lan Y., Newman J. L., Theobald B.-J. (2008) The Challenge of Multispeaker Lip-Reading. International Conference on Auditory-Visual Speech Processing. 179–184.

2. Matthews I., Cootes T. F., Bangham J. A., Cox S., Harvey R. (2002) Extraction of Visual Features for Liprea ding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24 (2). 198–213.

3. Nikolenko S., Kadurin A., Arkhangelskaya E. (2020) Deep Learning: A Dive into the World of Neural Networks. St. Petersburg, Piter Publ. (in Russian).

4. Chung S. J., Senior A., Vinyals O., Zisserman A. (2017) Lip Reading Sentences in the Wild. Computer Vision and Pattern Recognition. https://doi.org/10.48550/arXiv.1611.05358.

5. Cheng J., Dong L., Lapata M. (2016) Long Short-Term Memory-Networks for Machine Reading. EMNLP 2016 Conference. https://doi.org/10.48550/arXiv.1601.06733.

6. Pei Y., Kim T.-K., Zha H. (2013) Unsupervised Random Forest Manifold Alignment for Lipreading. IEEE International Conference on Computer Vision. 129–136.

7. Petridis S., Wang Y., Ma P., Li Z., Pantic M. (2020) End-to-End Visual Speech Recognition for Small-Scale Datasets. Pattern Recognition Letters. 131, 421–427. https://doi.org/10.48550/arXiv.1904.01954.

Review

For citations:

Makar D., Vashkevich M. Neural Network Based on Convolutional, Recurrent Layers and an Attention Mechanism for Visual Speech Recognition. Doklady BGUIR. 2026;24(1):75-82. (In Russ.) https://doi.org/10.35596/1729-7648-2026-24-1-75-82

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1729-7648 (Print)
ISSN 2708-0382 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Doklady BGUIR

Neural Network Based on Convolutional, Recurrent Layers and an Attention Mechanism for Visual Speech Recognition

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy