магистр, асп. каф. информационных систем управления

bsuir

Доклады БГУИР

Doklady BGUIR

1729-76482708-0382

БГУИР

10.35596/1729-7648-2025-23-3-70-76

bsuir-4164

Research Article

Статьи

Оценка сходства между наборами данных с помощью векторных представлений

Assessing Similarity Between Datasets Using Vector Representations

Усатов

А. А.

Usatoff

А. А.

магистр, асп. каф. информационных систем управленияMaster’s, Postgraduate at the, Depart- ment of Information Management Systems

Недзьведь

А. М.

Nedzved

A. M.

д-р техн. наук, доц., зав. каф. информационных систем управленияDr. Sci. (Tech.), Associate Professor, Head of the Department of Information Management Systems

Цзижань

Го

Jiran

Guo

асп. каф. информационных систем управленияPostgraduate at the Department of Information Management Systems

Белорусский государственный университетBelarusian State University

2025

15072025

2337076

2025

Усатов А.А., Недзьведь А.М., Цзижань Г.

Usatoff А.А., Nedzved A.M., Jiran G.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://doklady.bsuir.by/jour/article/view/4164

Рассмотрен подход к определению сходства наборов данных (датасетов) для обучения алгоритмов на примере датасетов с лицами людей. Такой подход позволяет находить похожие наборы данных из разных источников, расширяя детектирование признаков и классов и не нанося серьезного вреда балансировке. Для каждого объекта датасета получено векторное представление (эмбеддинг), затем выполнено сравнение эмбеддингов в обоих датасетах. Эксперименты проводились на примере датасетов с изображениями лиц людей. Для получения эмбеддингов использовалась предобученная сеть ResNet. В процессе исследований один датасет делился на две части, представляющие собой схожие датасеты, затем каждая из частей сравнивалась с отличающимся набором данных. Предлагается новая метрика сходства, которая обладает рядом преимуществ и позволяет находить наиболее похожие датасеты.

The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.

набор данныхвекторное представлениеResNetсходство датасетовглубокое обучение

datasetvector representationResNetdataset similaritydeep learning.

References1

Ивахненко, А. Г. Кибернетические предсказывающие устройства / А. Г. Ивахненко, В. Г. Лапа. Киев: Акад. наук Укр. ССР, 1965.

Ivakhnenko A. G., Lapa V. G. (1965) Cybernetic Predictive Devices. Kyiv, Academy of Sciences of the Ukrainian SSR (in Russian).

Gradient-Based Learning Applied to Document Recognition / Y. Lecun [et al.] // Proceedings of the IEEE. 1998. Vol. 86, Iss. 11. Р. 2278–2324.

Lecun Y., Bottou L., Bengio Y., Haffner P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE. 86 (11), 2278–2324.

Label-Embedding for Image Classification / Z. Akata [et al.] // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015. Vol. 38, No 7. Р. 1425–1438. DOI: 10.1109/TPAMI.2015.2487986.

Akata Z., Perronnin F., Harchaoui Z., Schmid C. (2015) Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (7), 1425–1438. DOI: 10.1109/ TPAMI.2015.2487986.

Image Quality Assessment: From Error Visibility to Structural Similarity / Z. Wang [et al.] // IEEE Transactions on Image Processing. 2024. Vol. 13, No 4. Р. 600–612. DOI: 10.1109/TIP.2003.819861.

Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. (2024) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 13 (4), 600–612. DOI: 10.1109/ TIP.2003.819861.

Rubner, Y. The Earth Mover’s Distance as a Metric for Image Retrieval / Y. Rubner, C. Tomasi, L. J. Guibas // International Journal of Computer Vision. 2000. Vol. 40, No 2. Р. 99–121. DOI: 10.1023/A:1026543900054.

Rubner Y., Tomasi C., Guibas L. J. (2000) The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision. 40 (2), 99–121. DOI: 10.1023/A:1026543900054.

Lin, J. Divergence Measures Based on the Shannon Entropy / J. Lin // IEEE Transactions on Information Theory. 1991. Vol. 37, Iss. 1. Р. 145–151. DOI: 10.1109/18.61115.

Lin J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37 (1), 145–151. DOI: 10.1109/18.61115.

Swain, M. J. Color Indexing / M. J. Swain, D. H. Ballard // International Journal of Computer Vision. 1991. Vol. 7, No 1. Р. 11–32.

Swain M. J., Ballard D. H. (1991) Color Indexing. International Journal of Computer Vision. 7 (1), 11–32.

Simonyan, K. Very Deep Convolutional Networks for Large-Scale Image Recognition / К. Simonyan, A. Zisserman // arXiv.1409.1556. 2014. Vol. 1.

Simonyan K., Zisserman A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.1409.1556. 1.

Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification / Y. Pang [et al.] // Expert Systems with Applications. 2024. Vol. 237. https://doi.org/10.1016/j.eswa.2023.121504.

Pang Y., Zhang H., Zhu L., Liu D., Liu L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification. Expert Systems with Applications. 237. https:// doi.org/10.1016/j.eswa.2023.121504.

Efficient Estimation of Word Representations in Vector Space / Т. Mikolov [et al.] // arXiv:1301.3781. 2013. http://arxiv.org/abs/1301.3781.

Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric / R. Zhang [et al.] // arXiv:1801.03924. 2023. https://doi.org/10.48550/arXiv.1801.03924.

Zhang R., Isola P., Efros A. A., Shechtman E., Wang O. (2023) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924. https://doi.org/10.48550/arXiv.1801.03924.

Deep Residual Learning Forimage Recognition / K. He [et al.] // arXiv:1512.03385. 2015. https://doi.org/10.48550/arXiv.1512.03385.

He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/arXiv.1512.03385.

Learning Transferable Visual Models from Natural Language Supervision / А. Radford [et al.] // arXiv:2103.00020. 2021. https://doi.org/10.48550/arXiv.2103.00020.

Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. https://doi.org/10.48550/arXiv.2103.00020.

Imagenet: A Large-Scale Hierarchicalimage Database / Jia Deng [et al.] // 2009 IEEE Conference on Computer Vision and Pattern Recognition. P. 248–255.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

Недзьведь, А. М. Анализ изображений для решения задач медицинской диагностики / А. М. Недзьведь, С. В. Абламейко. Минск: Объедин. ин-т проблем информ. Нац. акад. наук Беларуси, 2012.

Nedzved A., Ablameyko S. (2012) Image Analysis for Tasks of Medical Diagnostic. Minsk, United Institute of Informatics Problems of the National Academy of Sciences of Belarus (in Russian).

The authors declare that there are no conflicts of interest present.