Autoregressive Switching LDS
The code is written in Objective Caml and has similar requirements as the code for the SAR-HMM.
Original signal [ wav ]
Corrupted signal at 0dB [ wav ]
Reconstructed signal [ wav ]
The Matlab C Mex code used to produce those plots is available.
An AR-SLDS can be used for automatic speech recognition in noisy environement. Compared to the SAR-HMM, it has an additional level of complexity which allows for modelling noise.
Similarly to what has been done with the SAR-HMM, it is possible to evaluate the log-likelihood of a (possibly noisy) speech utterance with respect to an AR-SLDS model, by using the
./arslds_eval <arhmm file> <utt file> <CovV> [<useKim>]
The first argument to the
arslds_eval command is a trained (or at least initialised) SAR-HMM. The SAR-HMM resulting from the initialisation or training procedure must however be slightly modified before being used with
arslds_eval. The reason is that gain adaptation can be performed directly with the SAR-HMM, while in an AR-SLDS it first requires an estimation of the mean and variance of the continuous hidden variable.
In order to obtain a first estimation, a per state innovation variance needs to be provided. The value of this parameter can be obtained by running the
arhmm_train_cov command which is part of the source code of the SAR-HMM.
./arhmm_train_cov arhmm-trained.dat train.lst arslds.dat
The first argument is the original model, the second the list of training files and the third is the resulting model, i.e, the same as the original, but with the state covariances properly set. Note that here we call the model arslds.dat, but the format of the file is exactly the same as that of a SAR-HMM. The file arslds.dat can therefore be used anywhere where a SAR-HMM is required; the state covariances are simply ignored.
Once we have a proper AR-SLDS, we can find the log-likelihood of a speech utterance.
./bin/arslds_eval arslds.dat noisy.dat 1
The second argument is the speech utterance; a single column ASCII file with one sample per line. The thirs is the initial noise variance. Since the variance is automatically adapted, we simply set it to a reasonably large value. There is an optional fourth argument which, if set to 1, means that Kim's backward pass is used instead of the Expectation Correction (EC) backward pass.
A typical output of the
arslds_eval command looks like
-9.196236e-01 1.993868e+00 2.015252e+00 ... 2.028725e+00 2.028745e+00 2.028763e+00 2.028763e+00
Those numbers show the log-likelihood of the utterance after each iteration. All numbers apart from the last one, which is always the same as the one just above, are printed on the standard error output. The last one only is printed on the standard output. If you do not want to see the intermediate results, simply redirect the standard error output to /dev/null.
./bin/arslds_eval arslds.dat noisy.dat 1 >/dev/null