<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">bsuir</journal-id><journal-title-group><journal-title xml:lang="ru">Доклады БГУИР</journal-title><trans-title-group xml:lang="en"><trans-title>Doklady BGUIR</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1729-7648</issn><issn pub-type="epub">2708-0382</issn><publisher><publisher-name>БГУИР</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.35596/1729-7648-2026-24-1-75-82</article-id><article-id custom-type="elpub" pub-id-type="custom">bsuir-4301</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Нейронная сеть на основе сверточных, рекуррентных слоев и механизма внимания для визуального распознавания речи</article-title><trans-title-group xml:lang="en"><trans-title>Neural Network Based on Convolutional, Recurrent Layers and an Attention Mechanism for Visual Speech Recognition</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Макар</surname><given-names>Д. А.</given-names></name><name name-style="western" xml:lang="en"><surname>Makar</surname><given-names>D.</given-names></name></name-alternatives><bio xml:lang="ru"><p>асп. каф. электронных вычислительных средств</p><p>Минск</p></bio><bio xml:lang="en"><p>Darya Makar - Postgraduate of the Electronic Computing Facilities Department</p><p>Minsk </p></bio><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Вашкевич</surname><given-names>М. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Vashkevich</surname><given-names>M.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Вашкевич Максим Иосифович - д-р техн. наук, проф. каф. элект­ронных вычислительных средств</p><p>220013, Минск, ул. П. Бровки, 6</p><p>Тел.: +375 17 293-84-20 </p></bio><bio xml:lang="en"><p>Vashkevich Maxim -Dr. Sci. (Tech.), Professor at the Elect­ronic Computing Facilities Department</p><p>220013, Minsk, P. Brovki St., 6</p><p>Tel.: +375 17 293-84-20 </p></bio><email xlink:type="simple">vashkevich@bsuir.by</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет информатики и радиоэлектроники</institution></aff><aff xml:lang="en"><institution>Belarusian State University of Informatics and Radioelectronics</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>04</day><month>03</month><year>2026</year></pub-date><volume>24</volume><issue>1</issue><fpage>75</fpage><lpage>82</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Макар Д.А., Вашкевич М.И., 2026</copyright-statement><copyright-year>2026</copyright-year><copyright-holder xml:lang="ru">Макар Д.А., Вашкевич М.И.</copyright-holder><copyright-holder xml:lang="en">Makar D., Vashkevich M.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://doklady.bsuir.by/jour/article/view/4301">https://doklady.bsuir.by/jour/article/view/4301</self-uri><abstract><p>Визуальное распознавание речи представляет собой задачу классификации произносимых слов или букв по видеопотоку, фиксирующему движения губ. В статье представлены синтез и исследование нейросетевой архитектуры для визуального распознавания речи на основе комбинации сверточных и рекуррентных нейронных сетей с механизмом внимания. Обучение и оценка модели проводились на базе данных AVLetters2 в наиболее сложном дикторонезависимом режиме. Архитектура модели включает кодировщик на основе сверточных слоев для извлечения пространственных признаков, рекуррентные слои на основе блоков GRU для моделирования временных зависимостей и механизм внимания для выделения информативных фрагментов речевой последовательности. Для оценки точности модели проведена пятикратная перекрестная проверка. Подбор гиперпараметров модели осуществлялся на основе байесовской оптимизации, позволившей определить оптимальную конфигурацию параметров модели и процесса обучения. В результате проведенных экспериментов достигнута средняя точность распознавания 14,3 %. Анализ результатов выявил значительную вариативность качества распознавания в зависимости от характеристик дикторов (точность составила от 3,9 до 31,9 %), что указывает на необходимость дальнейшего повышения инвариантности модели к междикторским различиям.</p></abstract><trans-abstract xml:lang="en"><p>Visual speech recognition is the task of classifying spoken words or letters from a video stream capturing lip movements. This paper presents the synthesis and study of a neural network architecture for visual speech recognition based on a combination of convolutional and recurrent neural networks with an attention mechanism. The model was trained and evaluated on the AVLetters2 dataset in the most challenging speakerindependent mode. The model architecture includes an encoder based on convolutional layers for extracting spatial features, recurrent layers based on GRU units for modeling temporal dependencies, and an attention mechanism for highlighting informative fragments of the speech sequence. To assess the accuracy of the model, five-fold cross-validation was performed. Model hyperparameters were selected using Bayesian optimization, which allowed us to determine the optimal configuration of the model parameters and the training process. As a result of the experiments, an average recognition accuracy of 14.3 % was achieved. Analysis of the results revealed signi­ ficant variability in recognition quality depending on the characteristics of the speakers (accuracy ranged from 3.9 to 31.9 %), which indicates the need to further improve the invariance of the model to inter-speaker differences.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>визуальное распознавание речи</kwd><kwd>AVLetters2</kwd><kwd>сверточная нейронная сеть</kwd><kwd>рекуррентная нейронная сеть</kwd><kwd>механизм внимания</kwd></kwd-group><kwd-group xml:lang="en"><kwd>visual speech recognition</kwd><kwd>AVLetters2</kwd><kwd>convolutional neural network</kwd><kwd>recurrent neural network</kwd><kwd>attention mechanism</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">The Challenge of Multispeaker Lip-Reading / S. Cox [et al.] // International Conference on Auditory-Visual Speech Processing. 2008. P. 179–184.</mixed-citation><mixed-citation xml:lang="en">Cox S., Harvey R., Lan Y., Newman J. L., Theobald B.-J. (2008) The Challenge of Multispeaker Lip-Reading. International Conference on Auditory-Visual Speech Processing. 179–184.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Extraction of Visual Features for Lipreading / I. Matthews [et al.] // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002. Vol. 24, No 2. P. 198–213.</mixed-citation><mixed-citation xml:lang="en">Matthews I., Cootes T. F., Bangham J. A., Cox S., Harvey R. (2002) Extraction of Visual Features for Liprea­ ding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24 (2). 198–213.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Николенко, С. Глубокое обучение. Погружение в мир нейронных сетей / С. Николенко, A. Кадурин, E. Архангельская. СПб.: Питер, 2020.</mixed-citation><mixed-citation xml:lang="en">Nikolenko S., Kadurin A., Arkhangelskaya E. (2020) Deep Learning: A Dive into the World of Neural Networks. St. Petersburg, Piter Publ. (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Lip Reading Sentences in the Wild / S. J. Chung [et al.] // Conference on Computer Vision and Pattern Recognition. 2017. https://doi.org/10.48550/arXiv.1611.05358.</mixed-citation><mixed-citation xml:lang="en">Chung S. J., Senior A., Vinyals O., Zisserman A. (2017) Lip Reading Sentences in the Wild. Computer Vision and Pattern Recognition. https://doi.org/10.48550/arXiv.1611.05358.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Cheng, J. Long Short-Term Memory-Networks for Machine Reading / J. Cheng, L. Dong, M. Lapata // EMNLP 2016 Conference. https://doi.org/10.48550/arXiv.1601.06733.</mixed-citation><mixed-citation xml:lang="en">Cheng J., Dong L., Lapata M. (2016) Long Short-Term Memory-Networks for Machine Reading. EMNLP 2016 Conference. https://doi.org/10.48550/arXiv.1601.06733.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Pei, Y. Unsupervised Random Forest Manifold Alignment for Lipreading / Y. Pei, T.-K. Kim, H. Zha // IEEE International Conference on Computer Vision. 2013. P. 129–136.</mixed-citation><mixed-citation xml:lang="en">Pei Y., Kim T.-K., Zha H. (2013) Unsupervised Random Forest Manifold Alignment for Lipreading. IEEE International Conference on Computer Vision. 129–136.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">End-to-End Visual Speech Recognition for Small-Scale Datasets / S. Petridis [et al.] // Pattern Recognition Letters. 2020. P. 131, 421–427. https://doi.org/10.48550/arXiv.1904.01954.</mixed-citation><mixed-citation xml:lang="en">Petridis S., Wang Y., Ma P., Li Z., Pantic M. (2020) End-to-End Visual Speech Recognition for Small-Scale Datasets. Pattern Recognition Letters. 131, 421–427. https://doi.org/10.48550/arXiv.1904.01954.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
