Assessing Similarity Between Datasets Using Vector Representations
https://doi.org/10.35596/1729-7648-2025-23-3-70-76
Abstract
The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.
About the Authors
А. А. UsatoffBelarus
Master’s, Postgraduate at the, Depart- ment of Information Management Systems
A. M. Nedzved
Belarus
Dr. Sci. (Tech.), Associate Professor, Head of the Department of Information Management Systems
Guo Jiran
Belarus
Postgraduate at the Department of Information Management Systems
References
1. Ivakhnenko A. G., Lapa V. G. (1965) Cybernetic Predictive Devices. Kyiv, Academy of Sciences of the Ukrainian SSR (in Russian).
2. Lecun Y., Bottou L., Bengio Y., Haffner P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE. 86 (11), 2278–2324.
3. Akata Z., Perronnin F., Harchaoui Z., Schmid C. (2015) Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (7), 1425–1438. DOI: 10.1109/ TPAMI.2015.2487986.
4. Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. (2024) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 13 (4), 600–612. DOI: 10.1109/ TIP.2003.819861.
5. Rubner Y., Tomasi C., Guibas L. J. (2000) The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision. 40 (2), 99–121. DOI: 10.1023/A:1026543900054.
6. Lin J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37 (1), 145–151. DOI: 10.1109/18.61115.
7. Swain M. J., Ballard D. H. (1991) Color Indexing. International Journal of Computer Vision. 7 (1), 11–32.
8. Simonyan K., Zisserman A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.1409.1556. 1.
9. Pang Y., Zhang H., Zhu L., Liu D., Liu L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification. Expert Systems with Applications. 237. https:// doi.org/10.1016/j.eswa.2023.121504.
10. Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.
11. Zhang R., Isola P., Efros A. A., Shechtman E., Wang O. (2023) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924. https://doi.org/10.48550/arXiv.1801.03924.
12. He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/arXiv.1512.03385.
13. Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. https://doi.org/10.48550/arXiv.2103.00020.
14. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
15. Nedzved A., Ablameyko S. (2012) Image Analysis for Tasks of Medical Diagnostic. Minsk, United Institute of Informatics Problems of the National Academy of Sciences of Belarus (in Russian).
Review
For citations:
Usatoff А.А., Nedzved A.M., Jiran G. Assessing Similarity Between Datasets Using Vector Representations. Doklady BGUIR. 2025;23(3):70-76. (In Russ.) https://doi.org/10.35596/1729-7648-2025-23-3-70-76