AN EFFICIENT SPEECH GENERATIVE MODEL BASED ON DETERMINISTIC/STOCHASTIC SEPARATION OF SPECTRAL ENVELOPES

The paper presents a speech generative model that provides an efficient way of generating speech waveform from its amplitude spectral envelopes. The model is based on hybrid speech representation that includes deterministic (harmonic) and stochastic (noise) components. The main idea behind the approach originates from the fact that speech signal has a determined spectral structure that is statistically bound with deterministic/stochastic energy distribution in the spectrum. The performance of the model is evaluated using an experimental low-bitrate wide-band speech coder. The quality of reconstructed speech is evaluated using objective and subjective methods. Two objective quality characteristics were calculated: Modified Bark Spectral Distortion (MBSD) and Perceptual Evaluation of Speech Quality (PESQ). Narrow-band and wide-band versions of the proposed solution were compared with MELP (Mixed Excitation Linear Prediction) speech coder and AMR (Adaptive Multi-Rate) speech coder, respectively. The speech base of two female and two male speakers were used for testing. The performed tests show that overall performance of the proposed approach is speakerdependent and it is better for male voices. Supposedly, this difference indicates the influence of pitch highness on separation accuracy. In that way, using the proposed approach in experimental speech compression system provides decent MBSD values and comparable PESQ values with AMR speech coder at 6,6 kbit/s. Additional subjective listening testsdemonstrate that the implemented coding system retains phonetic content and speaker’s identity. It proves consistency of the proposed approach.


Introduction
Contemporary speech synthesis algorithms have made a great leap forward due to developing of artificial neural networks. Now it is possible to synthesize high-quality speech using WaveNet algorithm [1] or one of the wide range of similar solutions [2,3]. One of the drawbacks, however, is high computational complexity of these algorithms that can reach tens of billions of floating-point operations per second (GFLOPS) which requires using high-end GPUs. A much more efficient solution LPCNet has been reported recently [4], however, it requires around 2.8 GFLOPS which is still very high compared to conventional parametric methods. The crucial part of the synthesis is the problem of transforming amplitude spectrum or amplitude spectral envelope into waveform. The classical solution to the problem was proposed by Griffin and Lim known as Griffin/Lim ДОКЛАДЫ БГУИР DOKLADY BGUIR № 18 (2) (2020) NO. 18 (2) (2020) 24 algorithm [5]. The algorithm has less computational requirement, however, it is very sensitive to hop size, requires amplitude spectrum and does not work with amplitude spectral envelopes.
In the present paper an efficient algorithm is proposed for speech waveform generation from its amplitude spectral envelopes. The algorithm utilizes Harmonic plus Noise Model (HNM) and statistical deterministic/stochastic separation of the envelopes. The main idea behind the approach originates from the fact that speech signal has a determined spectral structure that is statistically bound with deterministic/stochastic energy distribution in the spectrum. The separation function is estimated through a training procedure that involves fitting of data obtained through instantaneous harmonic analysis and short time spectrum.
The flexible HNM synthesis where the deterministic part accounts for the periodic (harmonic) structure of the signal and the stochastic models its noise part was presented in [6,7]. The model has been successfully applied to a number of different speech applications: speech coding, text to speech synthesis, voice conversion and other. The main benefits of the model can be shortly listed as follows: explicit control over prosodic features of the speech that is a benefit in text to speech synthesis and voice conversion; efficiency of the representation; high quality speech reconstruction. However, harmonic parameters estimation is pitch-based. It means that the method is extremely sensitive to pitch estimation errors. Pitch estimation itself is a fundamental problem of speech analysis that does not have an ultimate solution yet. The estimation is prone to errors especially for transitional (partially voiced) speech sounds. Inaccuracy of harmonic parameters values causes audible artifacts in reconstructed speech. Majority of the mentioned speech processing applications require estimation of harmonic spectral envelopes rather than parameters of individual harmonics. This is also true for stochastic part of the signal that is usually estimated as difference between source and harmonic signals in time domain and then represented by spectral envelopes (e. g. using all-pole filter).
The performance of the proposed speech generation algorithm is evaluated using an experimental low bitrate wide-band speech coder. The quality of reconstructed speech is rated using objective and subjective methods.

Overview of the method
Deterministic/stochastic spectrum separation of a speech signal is carried out using separation function that is determined through a training procedure using a speech data corpus. The training process illustrated in Fig. 1 involves the following steps: 1. Speech data is analyzed using instantaneous harmonic analyzer [8] and separated into deterministic and stochastic parts.
2. Harmonic spectral envelopes are calculated using interpolation from instantaneous harmonic parameters; noise spectral envelopes are calculated from stochastic part using short-time Fourier transform (STFT).
3. Short-time spectra are calculated from the source signal and transformed into spectral envelopes.
4. The separation function is estimated that minimizes quadratic error of separated spectra. During harmonic analysis speech frames are classified either as voiced or unvoiced. Unvoiced frames are modeled as pure stochastic signals.
The spectrum separation process illustrated in Fig. 2   The spectrum separation process is quite simple while the training process requires implementation of complex algorithms: pitch estimation, harmonic analysis and training.

Estimation of instantaneous harmonic parameters
The hybrid deterministic/stochastic model assumes that the signal ) (n s can be expressed as the sum of its periodic and noise parts: The procedure goes from the first harmonic to the last, adjusting fundamental frequency at every step. The fundamental frequency recalculation formula can be written as follows: The fundamental frequency values become more precise while moving up the frequency range. It allows making proper analysis of high order harmonics with significant frequency modulations.
Harmonic envelopes are calculated from instantaneous harmonic parameters using linear interpolation. The deterministic part of the signal is synthesized using estimated harmonic parameters and subtracted from the source signal frame in order to obtain residual. The residual (stochastic) part of the signal ) (n r is parameterized as a bark-band noise. The noise envelopes are calculated as energies of the signal in bark subbands. After applying the parameterization technique, the speech signal is represented as a set of instantaneous harmonic envelopes, short-time noise envelopes and a pitch contour.

Estimation of energy separation function
Let us denote spectral envelope vector estimated through STFT as The experiments show that the best result is achieved when these bands are not uniform. A relatively good result was obtained for bark scalethe average squared error was 0.07 per harmonic/noise vector value for multi-speaker speech database. Fig. 3 illustrates an example of bark-band spectrum envelopes separation into deterministic and stochastic parts. The possibility of speech reconstruction from its bark-band energy envelopes and pitch contour can be especially useful in speech coding.

Experimental results
In order to evaluate applicability of the method to speech compression experimental speech coding systems for narrow-and wide-band speech were implemented. The encoding/decoding processes are illustrated in Fig. 4. The distinguishing feature of the coding scheme is that harmonic/noise separation is done at the decoding phase. The input of the decoder consists of quantized bark-band energy values and pitch contour. The energy values are obtained using 1024-point short-time Fourier transform (512-point for narrow-band version) and combined in barkband envelope vectors. The envelopes are calculated with 10ms time offset. Pitch values are estimated using analysis filters as was reported in [3]. that are quantized using different codebooks. The codebooks were trained on multi-speaker speech material (with duration about 10 minutes) through standard K-means algorithm. The sequence of energy envelopes is reconstructed in the decoder and their harmonic/noise separation is carried out using separation function. The function was trained using the same training speech material. The coding scheme is very efficient and can be implemented using a non-uniform filter bank [4].
Performance of the proposed speech coder was evaluated using objective measures of speech quality. The two following quality characteristics were calculated: Modified Bark Spectral Distortion (MBSD) and Perceptual Evaluation of Speech Quality (PESQ). Proposed narrow-band solution was ДОКЛАДЫ БГУИР DOKLADY BGUIR № 18 (2) (2020) NO. 18 (2) (2020) compared with MELP (Mixed Excitation Linear Prediction) speech coder and wide-band version was compared with AMR (Adaptive Multi-Rate) speech codec. The speech base used for testing contained sentences pronounced by four different speakers (two male and two female speakers whose speech was not used during training). The average obtained values are presented in Tables 1, 2 (proposed speech coding system is labeled as 'joint coding'). Considering that accuracy of deterministic/stochastic spectrum decomposition might be speaker-depended average results are calculated for male and female voices separately.
The objective quality tests show that overall performance of the proposed approach is speaker-dependent. It can be seen from the presented results demonstrating that the quality of reconstructed speech is better for male voices. Supposedly, this difference indicates influence of pitch highness on separation accuracy. The experimental speech compression system provides decent MBSD values and comparable PESQ values with AMR codec at 6,6 kbit/s (it is the lowest bitrate possible for a free AMR encoder used in the experiments).
Additional subjective listening tests were carried out as well. The quality of signal reconstruction was compared in the following pairs: AMR 6.6joint coding 6.6 and MELP 2.4joint coding 2.4. Twenty different sentences were chosen and played back in random order. Five listeners were asked to rate which sentence from the pair sounded more natural. The proposed encoder was chosen in about 40 percent of cases for wide-band speech and in 35 percent of cases for narrowband speech. All listeners approved that the implemented coding system retains phonetic content and speaker's identity at every bitrate that proves consistency of the proposed approach.

Conclusion
A model for speech generation from its spectral amplitude envelopes has been proposed. The model involves deterministic/stochastic decomposition that is carried out using separation function without conventional harmonic analysis. The separation function is represented as a matrix of linear regression coefficients and evaluated using least-squares method. Training sequence contains harmonic and noise envelopes estimated via instantaneous harmonic analysis. The method has been experimentally applied to speech coding. The quality of reconstructed signal has been rated using objective and subjective methods. The obtained results show high potential of the presented approach.

Authors contribution
Taha M. realized the speech modeling.