Assessing Similarity Between Datasets Using Vector Representations

А. А. Usatoff; A. M. Nedzved; Guo Jiran

doi:10.35596/1729-7648-2025-23-3-70-76

Assessing Similarity Between Datasets Using Vector Representations

А. А. Usatoff, A. M. Nedzved, Guo Jiran

https://doi.org/10.35596/1729-7648-2025-23-3-70-76

Full Text:

PDF (Rus)

Generate QR code

Abstract

The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.

Keywords

dataset, vector representation, ResNet, dataset similarity, deep learning.

About the Authors

А. А. Usatoff

Belarusian State University
Belarus
Master’s, Postgraduate at the, Depart- ment of Information Management Systems

A. M. Nedzved

Belarusian State University
Belarus
Dr. Sci. (Tech.), Associate Professor, Head of the Department of Information Management Systems

Guo Jiran

Belarusian State University
Belarus
Postgraduate at the Department of Information Management Systems

References

1. Ivakhnenko A. G., Lapa V. G. (1965) Cybernetic Predictive Devices. Kyiv, Academy of Sciences of the Ukrainian SSR (in Russian).

2. Lecun Y., Bottou L., Bengio Y., Haffner P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE. 86 (11), 2278–2324.

3. Akata Z., Perronnin F., Harchaoui Z., Schmid C. (2015) Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (7), 1425–1438. DOI: 10.1109/ TPAMI.2015.2487986.

4. Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. (2024) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. 13 (4), 600–612. DOI: 10.1109/ TIP.2003.819861.

5. Rubner Y., Tomasi C., Guibas L. J. (2000) The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision. 40 (2), 99–121. DOI: 10.1023/A:1026543900054.

6. Lin J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37 (1), 145–151. DOI: 10.1109/18.61115.

7. Swain M. J., Ballard D. H. (1991) Color Indexing. International Journal of Computer Vision. 7 (1), 11–32.

8. Simonyan K., Zisserman A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.1409.1556. 1.

9. Pang Y., Zhang H., Zhu L., Liu D., Liu L. (2024) Self-Similarity Guided Probabilistic Embedding Matching Based on Transformer for Occluded Person Re-Identification. Expert Systems with Applications. 237. https:// doi.org/10.1016/j.eswa.2023.121504.

10. Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.

11. Zhang R., Isola P., Efros A. A., Shechtman E., Wang O. (2023) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924. https://doi.org/10.48550/arXiv.1801.03924.

12. He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://doi.org/10.48550/arXiv.1512.03385.

13. Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. https://doi.org/10.48550/arXiv.2103.00020.

14. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, et al. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

15. Nedzved A., Ablameyko S. (2012) Image Analysis for Tasks of Medical Diagnostic. Minsk, United Institute of Informatics Problems of the National Academy of Sciences of Belarus (in Russian).

Review

For citations:

Usatoff А.А., Nedzved A.M., Jiran G. Assessing Similarity Between Datasets Using Vector Representations. Doklady BGUIR. 2025;23(3):70-76. (In Russ.) https://doi.org/10.35596/1729-7648-2025-23-3-70-76

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1729-7648 (Print)
ISSN 2708-0382 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Doklady BGUIR

Assessing Similarity Between Datasets Using Vector Representations

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy