My blogs reporting quantitative financial analysis, artificial intelligence for stock investment & trading, and latest progress in signal processing and machine learning

Monday, February 28, 2011

Misunderstandings on Sparse Bayesian Learning (SBL) for Compressed Sensing (1)

I will gradually discuss some misunderstandings on the sparse Bayesian learning (SBL), which I have seen in some published papers.

One misunderstanding is the "noise variance" term (written in $\lambda$ or $\sigma^2$ in the literature on SBL). From the Bayesian view, it seems OK to view $\lambda$ as the noise variance. And hence, in computer simulations, people think that setting $\lambda$ to be the true noise variance can allow SBL algorithms to reach their full strength for compressed sensing, and thus this strategy is fair to SBL algorithms when doing performance comparison.

But I have to say, for SBL algorithms the optimal value for $\lambda$ is not the true noise variance, and thus some performance comparisons are not correct!

Here I show some simple experiment results. We can see the optimal value for $\lambda$ is generally larger than the true noise variance.

I use the M-SBL algorithm proposed by David in 2007. The code can be downloaded from the link. Note that this code is a modified version of David's code, since some parameter settings in the latter code are not suitable for performance comparison in low SNR cases.

The Gaussian dictionary matrix was of size 40 by 120 (i.e. N=40, M=120). The number of measurement vectors, L, was 3. The true noise variance was 0.01 (The SNR was around 10dB). The number of nonzero rows in the source matrix (or called the coefficient matrix, the solution matrix), K, was set to be 4, 12, and 16. For each different K, the experiment was repeated 200 times. In each trial, the M-SBL algorithm was fed with different values of $\lambda$. Its performance was measured by two measures: the failure rate and the mean square error (MSE). The failure rate is defined in David's M-SBL paper. These are general experiment settings, for detailed description of the experiments, one can see my paper.

Here is the result (click the figure for full view):
From the figure, we can draw the following conclusions:
(1) The optimal value of $\lambda$ is not equal to the true noise variance. The optimal value is larger than the true noise variance.
(2) Different measures correspond to different optimal values (or different ranges of optimal values). In other words, for given experiment settings, the optimal value for $\lambda$ in terms of MSE is different to the optimal one in terms of the failure rate. This observation is more obvious when using my T-SBL/T-MSBL. 
(3) Changing any experiment setting (e.g. K, L, N, M), the optimal value will accordingly change (in the above experiment, I changed the K). For example, in the figure we can see: with K (the number of nonzero rows in the coefficient matrix) increasing, the optimal value is decreasing, but still larger than the true noise variance.

In fact, I have to add the fourth conclusion (although I don't show here):
(4) For different SBL algorithms, the optimal values for $\lambda$ are different.

Based on these conclusions, now we know that the strategy of setting $\lambda$ being the true noise variance is problematic in algorithm comparison. Suppose we compare two SBL algorithms, say SBL (I) and SBL (II). And suppose that for given experiment settings,  the optimal $\lambda$ of SBL (I) is much closer to the true noise variance than the one of SBL (II). Then, when we set the $\lambda$ to be the true noise variance for the two algorithms, SBL (I) most likely shows better performance than SBL (II), although SBL (II) actually has better performance than SBL (I) if both the two algorithms choose their optimal $\lambda$ values.

Another lesson from the experiment is: when changing any experiment setting, we should again find the optimal $\lambda$ value; otherwise, we can get wrong conclusion. For example, in the above experiment, if we fixed $\lambda$ to be the true noise variance (0.01), we could find that the algorithm's MSE at K=4 is even larger than the MSE at K=12 ! (note that the smaller the K is, the easier the inverse problem is).

Now I think you've believed how important to choose the optimal $\lambda$ in performance comparison.  Especially, when two algorithms have very close performance curves, you must be very cautious to avoid the above wrong strategies.

NOTE: The above conclusions and lessons not only are applied to the comparison of SBL algorithms, but also are applied to the comparison of SBL algorithms to non-SBL algorithms, and also are applied to the comparison of non-SBL algorithms (the regularization parameters of $\ell_1$ algorithms exist similar phenomena!).

You may ask: although the optimal values of $\lambda$ are important, in practice we probably have no way to find them. Yes, you are right. The above conclusions and lessons should be kept in mind when people are really care about which one is better than which one. But in practice, you can use some widely used methods to choose a sub-optimal value for $\lambda$ and do performance comparison. But in this case, you need to be very careful about your conclusions. At least, you cannot say, algorithm (I) is better than algorithm (II). You need to emphasis that when using your $\lambda$ setting strategy, algorithm (I) shows better performance than algorithm (II), and keep in mind that your reader may draw the opposite conclusion when using his/her own strategy for finding a sub(optimal) value for $\lambda$.

The fact that the optimal $\lambda$ is not equal to the true noise variance should not be surprising. In fact, $\lambda$ can be interpreted as noise variance is only reasonable when the dictionary matrix is not underdetermined.

Similarly,  the matrix B in my T-SBL/T-MSBL algorithms cannot be simply interpreted as the covariance matrix of sources. And one should not be surprised that using a single matrix B can lead to excellent performance even if different sources have different temporal correlations.


Reference:

Zhilin Zhang, Bhaskar D. Rao, Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery, Technical Report, University of California, San Diego, September, 2011

Thursday, February 24, 2011

Losing sleep is a big problem for me when I get excited in my research

Again, I lost sleep tonight. I have many things to do, many thoughts to organize, and many papers to write down. I couldn't stop my brain thinking.

Just now I remembered a saying: if you love a person, ask him/her to do research; if you hate a person, also ask him/her to do research.

Yes, research is such a magic thing making people like me forget eating and losing sleep.

M-SBL Using Gaussian Scale Mixture Model ?

Recently, there are some interesting works that use the Gaussian scale mixture (GSM) model in the framework of sparse Bayesian learning (SBL) . But I have not seen any papers using this model to modify the original MSBL algorithm proposed by David, or to deal with the multiple measurement vector (MMV) model under the common sparsity assumption, the model that was proposed by Cotter & Rao (Although there are some papers considering MMV models, their models are different to here). It seems that the GSM model can model the spatial dependency in X in some degree. So, to satisfy my curiosity, I modified the David's MSBL using the GSM model.

The MMV model under the common sparsity assumption is given by:
Y=AX + V,
where Y is the available data matrix of size N x L, A is the known dictionary matrix, and X is the unknown solution matrix of size M x L. V is the noise matrix. Assume:
X_i = sqrt(z) U_i,    i=1,...,L,
where X_i is the i-th column of the solution matrix X, z is a positive scalar, U_i is a column vector with elements being independent. Each element of U_i satisfied a Gaussian distribution N(0,\gamma_i). Then, following the steps of the derivation of MSBL, we can easily obtain the GSM based MSBL algorithm.

The algorithm has the similar form to MSBL. Essentially, their only difference is that the $\lambda$ in MSBL becomes $\lambda / z $ in GSM-MSBL. So, it seems that GSM-MSBL may correct the learning of $\lambda$. It is known that SBL's learning rules for $\lambda$ are not robust in low SNR cases. So, I expect that GSM-MSBL can yield a better recovery performance than MSBL through the correction of the learning rule for $\lambda$.

However,  by observing the derived learning rule for z, i.e.

I found the learning rule is exactly equal to 1, i.e. z = 1! In this case, GSM-MSBL is the same to MSBL. Admittedly, the above learning rule is obtained from the data using the empirical Bayesian method. Maybe there exist other methods to estimate z. But currently I don't know.

If anybody has got a GSM based MSBL algorithm showing better performance, please let me know.

Wednesday, February 23, 2011

Answer Bob's question on the paper: Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning

Today I received an email from a reader, Bob, asking me why in high SNR cases or noiseless cases T-MSBL is better than T-SBL. What confused Bob is that in the paper T-MSBL is an approximation to T-SBL (using the approximation (20)). Both of them, in theory, are identical in the noiseless cases or the correlation-free cases. So, by intuition, T-MSBL should not be better than T-SBL, and in noiseless cases or correlation-free cases, T-MSBL has the same performance to T-SBL.

Thanks Bob for the good question.

First, I have to say, there is a slight difference between T-MSBL and T-SBL in noiseless cases or correlation-free cases. For T-SBL, the learning rule for B is given by Equation (13). For T-MSBL, the learning rule is given by Equation (28)-(29), which cannot be obtained from (13) using the approximation (20). As I stated in my paper, different B's can result in different performance. But I guess if T-MSBL adopts the rule (27), instead of (28)-(29), it should have the same performance to T-SBL in noiseless cases or correlation-free cases (because (27) is the simplified version of (13)).

Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning

Finally, I finished the revise of this paper, which was submitted to IEEE Journal of Selected Topics in Signal Processing for the second-round review. The paper can be downloaded from my homepage or from http://arxiv.org/abs/1102.3949

This paper addressed a problem in the multiple measurement vector (MMV) model in compressed sensing. For the MMV model, i.e.
Y=AX + V,
most algorithms apply Lq norm on the rows of the solution matrix X, while apply Lp norm on the evaluated values of the row norms. For example, the MMV extension of the classic Basis Pursuit uses the L2 norm to evaluate each row of X, while uses the L1 norm to constrain the row sparsity of X. In fact, most algorithms uses the L2 norm or the L_infinity norm to evaluate the rows of X. Clearly, these norm operations are blind to data structures, i.e. the correlations of elements in each nonzero row of X. I call this temporal correlation (this term is to emphasize the difference from those considering the correlations among different rows --- which are called spatial correlation in many literature). In this paper, by deriving two sparse Bayesian learning algorithms, I showed that algorithms can get  improved recovery performance if we replace the Lq norm imposed on each row with the Mahalanobis distance measure. Actually, the improvement is very surprising, as the experiments of the paper showed.

However, the development of new algorithms is just a part of the story. Computer simulations have found some interesting phenomena of the algorithms.

First, let me ask you a question:
When the temporal correlations exist, say with correlation coefficient being 0.9, does the recovery performance of algorithms become worse? 
It seems that the answer is YES, because one may intuitively think that when the temporal correlations exist, each individual solution vector (i.e. Xi, with [X1,..., X_L]=X ) provides less information about the locations of nonzero rows. For example, consider an extreme case: suppose the temporal correlation coefficients are 0.999999. In this case, each individual solution vector Xi (i=1,...,L) is almost the same. Consequently, algorithms' performance is very close to their performance when only one measurement vector is available, namely, L=1 (let's consider a noiseless environment for simplicity).

However, my paper showed that the answer is NO. Here is one experiment result:

This is Fig.7 in my paper. M/N is the ratio of column number over row number of the dictionary matrix. T-MSBL is one of my derived algorithms, which exploits temporal correlations. MSBL is a benchmark algorithm proposed by David Wipf in 2007. $\beta$ indicates the temporal correlations. The relation of MSBL's performance to the temporal correlations is expected by our intuition: the higher the temporal correlations are, the worse the performance is. However, one can see the T-MSBL's performance becomes better with higher temporal correlations.

Next, let's see another extreme experiment. In this experiment (noiseless), the temporal correlations  approximate to 1 or -1 infinitely (temporal correlation is given by $\beta = sign(C)(1-10^{-|C|})$ ). Just imaging what's the value when C=10 and what this means how similarity among individual solution vectors! However, from the figure below, we again see the different behaviors of T-MSBL and MSBL. For MSBL, when |C| becomes large, its performance is close to its performance in the case of L=1. In contrast, T-MSBL still keeps the same performance, no matter how large |C| is.

As a reminder, the high temporal correlations indicate the ill-condition of the solution matrix X.

The two experiments clearly showed the importance of exploiting temporal correlations among different solution vectors, which is largely ignored in the field of compressed sensing. Up to now I've not found any existing algorithms have the similar performance to my T-MSBL and T-SBL. Also, I didn't find any theories can explain the behaviors of T-MSBL/T-SBL in the above two figures. If you find any algorithms with the similar performance, or find any theories can explain these phenomena, please let me know.

The third point that I want to emphasize is: I strongly suggest the use of Mahalanobis distance measure instead of the Lq norms imposed on the nonzero rows of X. We have seen lots of works (both theoretical and algorithmic) to analyze the benefits from the Lq norms, such as the L2 norm or the L_infinity norm. However, these data-structure blind norms cannot lead to satisfying performance in some practical applications, since practical signals or image sequences generally have strong temporal correlations. At least, using Mahalanobis distance measure can lead to better performance. But I have to admit, this work is just a start-point. Further study is needed.