Preview

Doklady BGUIR

Advanced search

Embedding With Preservation of Semantics of the Original Data

https://doi.org/10.35596/1729-7648-2022-20-2-46-52

Abstract

In the modern world, the data used to describe objects is often presented as sparse vectors with a large number of features. Working with them can be computationally inefficient, and often leads to overfitting; therefore, the data dimension reduction algorithms are used, one of which is auto encoders. In this article, we propose a new approach for evaluating the properties of the obtained vectors of lower dimension, as well as a loss function based on this approach. The idea of the suggested loss function is to evaluate the quality of preserving the semantic structure in the embedding space, and to add that metric to loss function to save object relations in the embedding space and thus save more useful information about objects. The results obtained show that using a combination of the mean squared loss function together with the suggested one allows to improve the quality of the embeddings.

About the Authors

M. E. Vatkin
“Sber Bank”
Belarus

Vatkin Maksim Evgenyevich - Cand. of Sci., Chief Data Scientist

220005, Minsk, Mulyavina blv., 6

tel. +375-29-278-13-78



D. A. Vorobey
“Sber Bank”
Belarus

Data Scientist

220005, Minsk, Mulyavina blv., 6

tel. +375-29-278-13-78



M. V. Yakovlev
“Sber Bank”
Belarus

Data Scientist

220005, Minsk, Mulyavina blv., 6

tel. +375-29-278-13-78



M. G. Krivova
“Sber Bank”
Belarus

Data Scientist

220005, Minsk, Mulyavina blv., 6

tel. +375-29-278-13-78



References

1. Gupta P., Banchs R.E., and Rosso P. Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities. Neurocomputing. 2016;175:1001–1008.

2. Mikolov Т., Sutskever I., Chen К., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. NIPS. 2013:3111–3119.

3. Bourlard H., Kamp Y. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 1988;59(September (4)):291-294. DOI: 10.1007/bf00332918.

4. Al-Shabi M.A. Credit Card Fraud Detection Using Autoencoder Model in Unbalanced Datasets. JAMCS. 2019;33(5):1-16.

5. Saito T., Rehmsmeier M. The Precision-Recall Plot is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One. 2015;10(3).

6. Husejinović А. Credit card fraud detection using naive Bayesian and C4.5 decision tree classifiers. Periodicals of Engineering and Natural Sciences. 2020;8(1):1-5.


Review

For citations:


Vatkin M.E., Vorobey D.A., Yakovlev M.V., Krivova M.G. Embedding With Preservation of Semantics of the Original Data. Doklady BGUIR. 2022;20(2):46-52. (In Russ.) https://doi.org/10.35596/1729-7648-2022-20-2-46-52

Views: 2356


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1729-7648 (Print)
ISSN 2708-0382 (Online)