<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">bsuir</journal-id><journal-title-group><journal-title xml:lang="ru">Доклады БГУИР</journal-title><trans-title-group xml:lang="en"><trans-title>Doklady BGUIR</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1729-7648</issn><issn pub-type="epub">2708-0382</issn><publisher><publisher-name>БГУИР</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.35596/1729-7648-2025-23-3-70-76</article-id><article-id custom-type="elpub" pub-id-type="custom">bsuir-4164</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Оценка сходства между наборами данных с помощью векторных представлений</article-title><trans-title-group xml:lang="en"><trans-title>Assessing Similarity Between Datasets Using Vector Representations</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Усатов</surname><given-names>А. А.</given-names></name><name name-style="western" xml:lang="en"><surname>Usatoff</surname><given-names>А. А.</given-names></name></name-alternatives><bio xml:lang="ru"><sec><title>магистр, асп. каф. информационных систем управления</title></sec></bio><bio xml:lang="en"><sec><title>Master’s, Postgraduate at the, Depart- ment of Information Management Systems</title></sec></bio><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Недзьведь</surname><given-names>А. М.</given-names></name><name name-style="western" xml:lang="en"><surname>Nedzved</surname><given-names>A. M.</given-names></name></name-alternatives><bio xml:lang="ru"><sec><title>д-р техн. наук, доц., зав. каф. информационных систем управления</title></sec></bio><bio xml:lang="en"><sec><title>Dr. Sci. (Tech.), Associate Professor, Head of the Department of Information Management Systems</title></sec></bio><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Цзижань</surname><given-names>Го</given-names></name><name name-style="western" xml:lang="en"><surname>Jiran</surname><given-names>Guo</given-names></name></name-alternatives><bio xml:lang="ru"><sec><title>асп. каф. информационных систем управления</title></sec></bio><bio xml:lang="en"><sec><title>Postgraduate at the Department of Information Management Systems</title></sec></bio><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет</institution></aff><aff xml:lang="en"><institution>Belarusian State University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>15</day><month>07</month><year>2025</year></pub-date><volume>23</volume><issue>3</issue><fpage>70</fpage><lpage>76</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Усатов А.А., Недзьведь А.М., Цзижань Г., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Усатов А.А., Недзьведь А.М., Цзижань Г.</copyright-holder><copyright-holder xml:lang="en">Usatoff А.А., Nedzved A.M., Jiran G.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://doklady.bsuir.by/jour/article/view/4164">https://doklady.bsuir.by/jour/article/view/4164</self-uri><abstract><p>Рассмотрен подход к определению сходства наборов данных (датасетов) для обучения алгоритмов на примере датасетов с лицами людей. Такой подход позволяет находить похожие наборы данных из разных источников, расширяя детектирование признаков и классов и не нанося серьезного вреда балансировке. Для каждого объекта датасета получено векторное представление (эмбеддинг), затем выполнено сравнение эмбеддингов в обоих датасетах. Эксперименты проводились на примере датасетов с изображениями лиц людей. Для получения эмбеддингов использовалась предобученная сеть ResNet. В процессе исследований один датасет делился на две части, представляющие собой схожие датасеты, затем каждая из частей сравнивалась с отличающимся набором данных. Предлагается новая метрика сходства, которая обладает рядом преимуществ и позволяет находить наиболее похожие датасеты. </p></abstract><trans-abstract xml:lang="en"><p>The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>набор данных</kwd><kwd>векторное представление</kwd><kwd>ResNet</kwd><kwd>сходство датасетов</kwd><kwd>глубокое обучение</kwd></kwd-group><kwd-group xml:lang="en"><kwd>dataset</kwd><kwd>vector representation</kwd><kwd>ResNet</kwd><kwd>dataset similarity</kwd><kwd>deep learning.</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Ивахненко, А. Г. Кибернетические предсказывающие устройства / А. Г. Ивахненко, В. Г. Лапа. Киев: Акад. наук Укр. ССР, 1965.</mixed-citation><mixed-citation xml:lang="en">Ivakhnenko A. G., Lapa V. G. (1965) Cybernetic Predictive Devices. Kyiv, Academy of Sciences of the Ukrainian SSR (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Gradient-Based Learning Applied to Document Recognition / Y. Lecun [et al.] // Proceedings of the IEEE. 1998. Vol. 86, Iss. 11. Р. 2278–2324.</mixed-citation><mixed-citation xml:lang="en">Lecun Y., Bottou L., Bengio Y., Haffner P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE. 86 (11), 2278–2324.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Label-Embedding for Image Classification / Z. Akata [et al.] // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015. Vol. 38, No 7. Р. 1425–1438. DOI: 10.1109/TPAMI.2015.2487986.</mixed-citation><mixed-citation xml:lang="en">Akata Z., Perronnin F., Harchaoui Z., Schmid C. (2015) Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (7), 1425–1438. DOI: 10.1109/ TPAMI.2015.2487986.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Image Quality Assessment: From Error Visibility to Structural Similarity / Z. Wang [et al.] // IEEE Transactions on Image Processing. 2024. Vol. 13, No 4. Р. 600–612. DOI: 10.1109/TIP.2003.819861.</mixed-citation><mixed-citation xml:lang="en">Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. (2024) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 13 (4), 600–612. DOI: 10.1109/ TIP.2003.819861.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Rubner, Y. The Earth Mover’s Distance as a Metric for Image Retrieval / Y. Rubner, C. Tomasi, L. J. Guibas // International Journal of Computer Vision. 2000. Vol. 40, No 2. Р. 99–121. DOI: 10.1023/A:1026543900054.</mixed-citation><mixed-citation xml:lang="en">Rubner Y., Tomasi C., Guibas L. J. (2000) The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision. 40 (2), 99–121. DOI: 10.1023/A:1026543900054.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Lin, J. Divergence Measures Based on the Shannon Entropy / J. Lin // IEEE Transactions on Information Theory. 1991. Vol. 37, Iss. 1. Р. 145–151. DOI: 10.1109/18.61115.</mixed-citation><mixed-citation xml:lang="en">Lin J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37 (1), 145–151. DOI: 10.1109/18.61115.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Swain, M. J. Color Indexing / M. J. Swain, D. H. Ballard // International Journal of Computer Vision. 1991. Vol. 7, No 1. Р. 11–32.</mixed-citation><mixed-citation xml:lang="en">Swain M. J., Ballard D. H. (1991) Color Indexing. International Journal of Computer Vision. 7 (1), 11–32.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Simonyan, K. Very Deep Convolutional Networks for Large-Scale Image Recognition / К. Simonyan, A. Zisserman // arXiv.1409.1556. 2014. Vol. 1.</mixed-citation><mixed-citation xml:lang="en">Simonyan K., Zisserman A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.1409.1556. 1.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification / Y. Pang [et al.] // Expert Systems with Applications. 2024. Vol. 237. https://doi.org/10.1016/j.eswa.2023.121504.</mixed-citation><mixed-citation xml:lang="en">Pang Y., Zhang H., Zhu L., Liu D., Liu L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification. Expert Systems with Applications. 237. https:// doi.org/10.1016/j.eswa.2023.121504.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Efficient Estimation of Word Representations in Vector Space / Т. Mikolov [et al.] // arXiv:1301.3781. 2013. http://arxiv.org/abs/1301.3781.</mixed-citation><mixed-citation xml:lang="en">Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">The Unreasonable Effectiveness of Deep Features as a Perceptual Metric / R. Zhang [et al.] // arXiv:1801.03924. 2023. https://doi.org/10.48550/arXiv.1801.03924.</mixed-citation><mixed-citation xml:lang="en">Zhang R., Isola P., Efros A. A., Shechtman E., Wang O. (2023) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924. https://doi.org/10.48550/arXiv.1801.03924.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Deep Residual Learning Forimage Recognition / K. He [et al.] // arXiv:1512.03385. 2015. https://doi.org/10.48550/arXiv.1512.03385.</mixed-citation><mixed-citation xml:lang="en">He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/arXiv.1512.03385.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Learning Transferable Visual Models from Natural Language Supervision / А. Radford [et al.] // arXiv:2103.00020. 2021. https://doi.org/10.48550/arXiv.2103.00020.</mixed-citation><mixed-citation xml:lang="en">Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. https://doi.org/10.48550/arXiv.2103.00020.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Imagenet: A Large-Scale Hierarchicalimage Database / Jia Deng [et al.] // 2009 IEEE Conference on Computer Vision and Pattern Recognition. P. 248–255.</mixed-citation><mixed-citation xml:lang="en">Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Недзьведь, А. М. Анализ изображений для решения задач медицинской диагностики / А. М. Недзьведь, С. В. Абламейко. Минск: Объедин. ин-т проблем информ. Нац. акад. наук Беларуси, 2012.</mixed-citation><mixed-citation xml:lang="en">Nedzved A., Ablameyko S. (2012) Image Analysis for Tasks of Medical Diagnostic. Minsk, United Institute of Informatics Problems of the National Academy of Sciences of Belarus (in Russian).</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
