Preview

Doklady BGUIR

Advanced search

Assessing Similarity Between Datasets Using Vector Representations

https://doi.org/10.35596/1729-7648-2025-23-3-70-76

Abstract

The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.

About the Authors

А. А. Usatoff
Belarusian State University
Belarus
Master’s, Postgraduate at the, Depart- ment of Information Management Systems


A. M. Nedzved
Belarusian State University
Belarus
Dr. Sci. (Tech.), Associate Professor, Head of the Department of Information Management Systems


Guo Jiran
Belarusian State University
Belarus
Postgraduate at the Department of Information Management Systems


References

1. Ivakhnenko A. G., Lapa V. G. (1965) Cybernetic Predictive Devices. Kyiv, Academy of Sciences of the Ukrainian SSR (in Russian).

2. Lecun Y., Bottou L., Bengio Y., Haffner P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE. 86 (11), 2278–2324.

3. Akata Z., Perronnin F., Harchaoui Z., Schmid C. (2015) Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (7), 1425–1438. DOI: 10.1109/ TPAMI.2015.2487986.

4. Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. (2024) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 13 (4), 600–612. DOI: 10.1109/ TIP.2003.819861.

5. Rubner Y., Tomasi C., Guibas L. J. (2000) The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision. 40 (2), 99–121. DOI: 10.1023/A:1026543900054.

6. Lin J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37 (1), 145–151. DOI: 10.1109/18.61115.

7. Swain M. J., Ballard D. H. (1991) Color Indexing. International Journal of Computer Vision. 7 (1), 11–32.

8. Simonyan K., Zisserman A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.1409.1556. 1.

9. Pang Y., Zhang H., Zhu L., Liu D., Liu L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification. Expert Systems with Applications. 237. https:// doi.org/10.1016/j.eswa.2023.121504.

10. Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.

11. Zhang R., Isola P., Efros A. A., Shechtman E., Wang O. (2023) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924. https://doi.org/10.48550/arXiv.1801.03924.

12. He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/arXiv.1512.03385.

13. Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. https://doi.org/10.48550/arXiv.2103.00020.

14. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

15. Nedzved A., Ablameyko S. (2012) Image Analysis for Tasks of Medical Diagnostic. Minsk, United Institute of Informatics Problems of the National Academy of Sciences of Belarus (in Russian).


Review

For citations:


Usatoff А.А., Nedzved A.M., Jiran G. Assessing Similarity Between Datasets Using Vector Representations. Doklady BGUIR. 2025;23(3):70-76. (In Russ.) https://doi.org/10.35596/1729-7648-2025-23-3-70-76

Views: 16


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1729-7648 (Print)
ISSN 2708-0382 (Online)