{"title": "MULAN: A Blind and Off-Grid Method for Multichannel Echo Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 2182, "page_last": 2192, "abstract": "This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo location and weights be recovered? This problem has broad applications in fields such as sonars, seismology, ultrasounds or room acoustics. It belongs to the broader class of blind channel identification problems, which have been intensively studied in signal processing. All existing methods proceed in two steps: (i) blind estimation of sparse discrete-time filters and (ii) echo information retrieval by peak picking. The precision of these methods is fundamentally limited by the rate at which the signals are sampled: estimated echo locations are necessary on-grid, and since true locations never match the sampling grid, the weight estimation precision is also strongly limited. This is the so-called basis-mismatch problem in compressed sensing. We propose a radically different approach to the problem, building on top of the framework of finite-rate-of-innovation sampling. The approach operates directly in the parameter-space of echo locations and weights, and enables near-exact blind and off-grid echo retrieval from discrete-time measurements. It is shown to outperform conventional methods by several orders of magnitudes in precision.", "full_text": "MULAN: A Blind and Off-Grid Method for\n\nMultichannel Echo Retrieval\n\nHelena Pei\u00b4c Tukuljac\n\nDepartment of Computer and Communication Sciences\n\n\u00c9cole polytechnique f\u00e9d\u00e9rale de Lausanne\nhelena.peictukuljac@epfl.ch\n\nAntoine Deleforge\n\nF-54000 Nancy, France\n\nR\u00e9mi Gribonval\n\n35000 Rennes, France\n\nUniversit\u00e9 de Lorraine, CNRS, Inria, LORIA\n\nUniv Rennes, Inria, CNRS, IRISA\n\nantoine.deleforge@inria.fr\n\nremi.gribonval@inria.fr\n\nAbstract\n\nThis paper addresses the general problem of blind echo retrieval, i.e., given M\nsensors measuring in the discrete-time domain M mixtures of K delayed and\nattenuated copies of an unknown source signal, can the echo locations and weights\nbe recovered? This problem has broad applications in \ufb01elds such as sonars, seismol-\nogy, ultrasounds or room acoustics. It belongs to the broader class of blind channel\nidenti\ufb01cation problems, which have been intensively studied in signal processing.\nExisting methods in the literature proceed in two steps: (i) blind estimation of\nsparse discrete-time \ufb01lters and (ii) echo information retrieval by peak-picking on\n\ufb01lters. The precision of these methods is fundamentally limited by the rate at which\nthe signals are sampled: estimated echo locations are necessary on-grid, and since\ntrue locations never match the sampling grid, the weight estimation precision is\nimpacted. This is the so-called basis-mismatch problem in compressed sensing.\nWe propose a radically different approach to the problem, building on the frame-\nwork of \ufb01nite-rate-of-innovation sampling. The approach operates directly in the\nparameter-space of echo locations and weights, and enables near-exact blind and\noff-grid echo retrieval from discrete-time measurements. It is shown to outperform\nconventional methods by several orders of magnitude in precision.\n\n1\n\nIntroduction\n\nWhen a wave propagates from a point source through a medium and is re\ufb02ected on surfaces before\nreaching sensors, the measured signals consist of mixtures of the direct path signal with delayed\nand attenuated copies of itself. This physical phenomenon is commonly referred to as echoes and\nhas a wide range of applications in different areas of science, from sonars [1] to seismology [2],\nfrom acoustics [3, 4, 5] to ultrasounds [6]. For instance, in acoustics, it has been shown that precise\nknowledge of early echo timing enables the estimation of the positions of re\ufb02ective surfaces in a room\n[3, 4]. In [3], the approximate 3D geometry of Lausanne cathedral could be retrieved in this way. On\nthe other hand, echoes\u2019 attenuation capture information about the acoustic impedance of surfaces,\nwhich is notoriously hard to measure or estimate in practice [7, 8]. In [9] and [10], it is shown that\nknowing the attenuation and timing of early echoes may improve beamforming and source separation\nperformance, respectively. Systems using echoes for beamforming are commonly referred to as rake\nreceivers in the wireless literature [11].\nRetrieving echo properties when the emitted signal is known is referred to as active echolocation in\nbiology, and is well exempli\ufb01ed by the sensory system of echoing bats. This principle is for instance\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fat the heart of active sonar technologies. In the signal processing literature, this problem belongs to\nthe category of system or channel identi\ufb01cation, i.e., estimating the \ufb01lters from a known input to the\nobserved output of a linear system. In the case of echoes, these linear \ufb01lters consist of streams of\nDiracs in the continuous-time domain and are hence sparse in the discrete-time domain. The more\nchallenging problem of estimating echoes/\ufb01lters when the emitted signal is unknown is referred to as\npassive echolocation or blind system identi\ufb01cation (BSI) [12, 13, 14, 15, 16, 17, 18, 5, 19, 20]. BSI is\na long-standing and still active research topic in signal processing, notably due to its fundamental ill-\nposedness. In the general setting of arbitrary signals and \ufb01lters, rigorous theoretical ambiguities under\nwhich the problem is unsolvable have been identi\ufb01ed [12]. A number of methods for multichannel\nBSI with general signals and \ufb01lters have been developed some time ago [12, 13, 14]. Some well-\nknown limitations of these approaches are their sensitivity to the chosen length of \ufb01lters, and their\nintractability when the \ufb01lters are too large. Following the compressed sensing wave [21], a number\nof methods extending these BSI methods to the case of sparse [15, 16, 17, 18, 5] or structured [20]\n\ufb01lters have been developed. They generally extend classical methods using regularizers such as\nthe (cid:96)1-norm for sparsity or a bilinear constraint as in [20]. Similarly to classical \ufb01lter estimation\nmethods, they require knowledge of the \ufb01lters\u2019 length and they work in the space of discrete-time\n\ufb01lters which are typically thousands of samples long. Because they work in the discrete-time domain,\nthe accuracy at which these methods can recover echo locations is fundamentally limited by the\nsignal\u2019s frequency of sampling: the recovered echoes are on-grid. Moreover, the sparsity assumption\non \ufb01lters is invalid in practice due to smoothing and sampling effects at sensors. Interestingly, [22]\nemploys a continuous-time spike model for single-channel blind deconvolution but relies on a strong\nlinear prior on the signal.\nIn this paper, we propose a drastically different approach to blind echo retrieval based on the\nframework of \ufb01nite-rate-of-innovation (FRI) sampling [23, 24, 25]. In stark contrast with existing\nmethods, the approach directly operates in the space of continuous-time echoes, and is hence able to\nblindly recover their locations off-grid. The proposed method is shown to recover echo delays and\nattenuation with an accuracy far higher than what the sampling rate would normally allow, using\nnoiseless multichannel discrete-time measurements of an unknown simulated speech emitter in a\nroom. The method does not assume that the \ufb01lters are \ufb01nite-length and only requires the number of\nechoes. The remainder of this paper is organized as follows. In section 2 the signal and measurement\nmodels and notations are introduced, and conventional methods are brie\ufb02y reviewed. In section 3\nthe proposed approach is presented in the non-blind and blind cases. In section 4 the method is\ncompared and evaluated on both synthetic and room acoustic data. Conclusions and future directions\nare outlined in section 5.\n\n2 Background\n\n2.1 The signal and measurement models\n\nWe start by de\ufb01ning the signal model in the continuous-time domain. Suppose a source emits a\n\nband-limited signal(cid:101)s(t) which is re\ufb02ected and attenuated K times before reaching M sensors. The\ncontinuous signal impinging at sensor m is(cid:101)xm(t) = ((cid:101)hm \u2217(cid:101)s)(t)\nwhere(cid:101)hm is a linear \ufb01lter from the source to sensor m and \u2217 denotes the continuous convolution\n\n(1)\n\noperator de\ufb01ned by\n\nThe \ufb01lter consists of the following stream of Diracs:\n\n(cid:90) +\u221e\nK(cid:88)\n\n\u2212\u221e\n\n(x \u2217 y)(t) =\n\nx(u)y(t \u2212 u)du.\n\n(cid:101)hm(t) =\n\ncm,k\u03b4(t \u2212 \u03c4m,k),\n\n(2)\n\n(3)\n\nwhere \u03b4 denotes the Dirac delta function, {\u03c4m,k}K\nk=1 denote the K propagation times from the source\nto sensor m in seconds, i.e.\nk=1 denote the echo\nattenuations or Dirac weights. In practical applications, continuous time-domain signals are not\naccessible. They are measured by sensors and discretized to be stored in a computer\u2019s memory.\n\nthe echo delays or Dirac locations and {cm,k}K\n\nk=1\n\n2\n\n\fFigure 1: (a) Continuous-time stream of Diracs(cid:101)h(t), (b) sinc kernel (cid:101)\u03c6(t), (c) smoothed stream\n((cid:101)\u03c6 \u2217(cid:101)h)(t), (d) original stream(cid:101)h(t) (red) and its smoothed, sampled version \u02c6h \u2208 RL (blue).\n\nLet \u02c6xm \u2208 RN denote N consecutive discrete samples collected by sensor m. Most measurement\nmodels assume that the impinging signal undergoes an ideal low-pass \ufb01lter with frequency support\n[\u2212Fs/2, Fs/2] before being regularly sampled at the rate Fs in Hz. This is expressed by\n\nwhere(cid:101)\u03c6 = sin(\u03c0t)/\u03c0t is the classical sinc sampling Kernel. The continuous-time model (1) can then\n\n\u02c6xm(n) = ((cid:101)\u03c6 \u2217(cid:101)xm)(n/Fs), n = 0, . . . , N \u2212 1\n\n(4)\n\nbe approximated in two different ways, described in the next two sub-sections.\n\n2.1.1 Discrete time-domain model\nFirst, model (1) can be approximated in the discrete, \ufb01nite-time domain. Let \u02c6hm \u2208 RL and\n\n\u02c6s \u2208 RN +L\u22121 denote discrete, sampled versions of the \ufb01lter(cid:101)hm and signal(cid:101)s respectively. We then\n\nhave\n(5)\nwhere the discrete \ufb01nite convolution operator (cid:63) between two vectors u \u2208 RL and v \u2208 RD (L \u2264 D)\nis de\ufb01ned by\n\n\u02c6xm(n) \u2248 (\u02c6hm (cid:63) \u02c6s)(n)\n\n(u (cid:63) v)(n) =\n\nu(j)v(L \u2212 1 + n \u2212 j), n = 0, . . . , D \u2212 L.\n\n(6)\n\nL\u22121(cid:88)\n\nj=0\n\nThe following convenient matrix notation will be used in the paper:\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n...\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\nvL\n\nvL\u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 v1\n\nv2\n...\nvD\n\nuL . . . u1\n0\n...\n\nu (cid:63) v = Toep0(u)v = Toep(v)u =\n0\n0\n\n\uf8ee\uf8ef\uf8ef\uf8f0u1\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n(5) depends on the way(cid:101)hm and(cid:101)s are sampled. In [26](Proposition 2), it is showed that if(cid:101)s(t) is\n\nwhere Toep0(u) \u2208 R(D\u2212L+1)\u00d7D and Toep(v) \u2208 R(D\u2212L+1)\u00d7L. The validity of approximation\n\n. . .\n0\n...\n...\n0\nuL . . . u1\n\n0\nuL . . . u1\n...\n...\n...\n0\n\nband-limited with maximum frequency lower than Fs/2 and if we let the number of samples N and\nthe \ufb01lter length L grow to in\ufb01nity, then model (5) is exact for the following sampling schemes:\n\nvL\n...\nvD\u22121\n\nvL+1\n...\nvD\n\n. . .\n...\n...\n. . .\n\n. . .\n. . .\n...\n...\n\n0\n0\n\n. . .\n. . .\n\n...\n\n0\n. . .\n\nu2\n...\nuL\n\n(7)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\nv1\n\nv2\n...\n\nvD\u2212L+1\n\n\u02c6s(n) =(cid:101)s(n/Fs), n \u2208 Z\n\u02c6hm(n) = ((cid:101)\u03c6 \u2217(cid:101)hm)(n/Fs), n \u2208 Z.\n\n(8)\n(9)\nHere, it is important to note that contrary to intuition, even in the idealized case where an in\ufb01nite\nnumber of samples are available, the discrete-time \ufb01lters {\u02c6hm}M\nm=1 involved in the measurement\nmodel are never streams of Diracs, but non-sparse, in\ufb01nite-length \ufb01lters consisting of decimated\ncombinations of sinc functions. This is illustrated in Fig. 1. Recovering the original Dirac positions\nand coef\ufb01cients from \ufb01nitely many samples of such \ufb01lters is a challenging task in itself.\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n2.1.2 Discrete frequency-domain model\nAlternatively, one may approximate model (1) in the discrete \ufb01nite-frequency domain. Let xm \u2208 CF\ndenote the discrete Fourier transform (DFT) of \u02c6xm, de\ufb01ned by\n\nxm(f ) = DFT(\u02c6xm) =\n\n\u02c6xm(n)e\u22122\u03c0if n/Fs\n\n(10)\n\nN\u22121(cid:88)\n\nn=0\n\n3\n\n\fwhere f belongs to a set of F regularly-spaced frequencies F = {f1, . . . , fF} \u2282]0, Fs/2] in Hz. We\nthen have the following approximate model:\n\n(cid:32) K(cid:88)\n\n(cid:33)\n\nxm(f ) \u2248 hm(f )s(f ) \u2248\n\ncm,ke\u22122\u03c0if \u03c4m,k\n\ns(f )\n\n(11)\n\nk=1\n\nwhere hm \u2208 CF and s \u2208 CF denote the DFT of \u02c6hm and \u02c6s, respectively. Two approximations are\nmade in (11). First, the time-domain convolution between \u02c6hm and \u02c6s has been transformed into a\nmultiplication through the DFT. This would be exact for a circular convolution, but the actual model\nis a linear convolution between in\ufb01nite and non periodic signals, resulting in an approximation error.\nSecond, the formula used for hm in the right hand side of (11) is the one that would result from the\ndiscrete-time Fourier transform (DTFT) of \u02c6hm which would require in\ufb01nitely many samples N to be\n\ncalculated exactly, as opposed to the DFT. Note that the smoothing sinc kernel(cid:101)\u03c6(t) does not impact\n\nthis formula, since only frequencies below Fs/2 are considered. Importantly, both approximations in\n(11) become arbitrarily precise as the number of samples N grows to in\ufb01nity.\n\nWhile both the discrete-time model (5) and the discrete-frequency model (11) become increas-\ningly accurate when N becomes large, the latter directly incorporates the variables of interest\n{cm,k, \u03c4m,k}M,K\nm,k=1, as opposed to the former. In the remainder of this paper, it will be assumed that\n\n(cid:101)s(t) is bandlimited with maximum frequency less than Fs/2 and that N is suf\ufb01ciently large such that\n\nboth models hold very well. This is a reasonable assumption in audio applications, where sensors\ntypically acquire tens of thousands of samples per second. Moreover, we focus on situations where\nsensor noise is negligible. Hence, the approximation signs will be dropped for convenience.\n\n2.2 Existing methods in channel identi\ufb01cation\n\nTo the best of the authors\u2019 knowledge, all existing methods in blind channel identi\ufb01cation rely on the\ndiscrete-time model (5) [12, 13, 14, 15, 16, 17, 18, 5, 19, 20]. The case of general emitted signals\nand \ufb01nite \ufb01lters was studied both methodologically and theoretically in the 90s [12, 13, 14], where\ntwo main categories of methods emerged, which we brie\ufb02y review here, focusing on the two-channel\n(M = 2) case for simplicity. First, the so-called subspace methods rely on the estimation of a\ntime-domain M P \u00d7 M P covariance matrix where P is a time-window length that must be larger\nthan the \ufb01lters\u2019 length L [14]. The \ufb01lters are estimated by spectral decomposition of this matrix.\nSecond, the more common cross-relation (CR) methods rely on the observation that under noiseless\nconditions we have \u02c6hm (cid:63) \u02c6xl \u2212 \u02c6hl (cid:63) \u02c6xm = 0N\u2212L+1 for l (cid:54)= m \u2208 {1, . . . , M}, by associativity of the\nconvolution. A common approach is therefore to solve a minimization problem of the form:\n\n(cid:13)(cid:13)(cid:13)Toep(\u02c6x2)\u02c6h1 \u2212 Toep(\u02c6x1)\u02c6h2\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2217\n\u2217\n1, \u02c6h\n\u02c6h\n2 = argmin\n\u02c6h1(1)=1\n\n,\n\n(12)\n\nwhich is a simple least-square problem. The constraint \u02c6h1(1) = 1 is used to avoid the trivial solution\n\u02c6h1 = \u02c6h2 = 0L. Alternatively, the normalization (cid:107)\u02c6h1(cid:107)2\n2 = 1 can be used, leading to a\nminimum eigenvalue problem.\n\n2 + (cid:107)\u02c6h2(cid:107)2\n\nIn the case of interest where the goal is to retrieve echo information from the \ufb01lters, both subspace\n[17] and to a larger extent CR [15, 16, 18, 5] methods have been extended in order to handle sparse\n\ufb01lters. This approach requires two independent steps: \ufb01rst estimating sparse \ufb01lters, second retrieving\necho locations and weights from them, typically using a peak-picking technique. Following the\ncompressed sensing idea [21], sparsity is usually promoted using an (cid:96)1-norm penalty term on the\n\ufb01lters. For instance in [16], the following LASSO-type [27] problem is considered:\n\n(cid:13)(cid:13)(cid:13)Toep(\u02c6x2)\u02c6h1 \u2212 Toep(\u02c6x1)\u02c6h2\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2217\n\u2217\n1, \u02c6h\n\u02c6h\n2 = argmin\n\u02c6h1(1)=1\n\n+ \u03bb((cid:107)\u02c6h1(cid:107)1 + (cid:107)\u02c6h2(cid:107)1)\n\n(13)\n\nand a Bayesian-learning method for the automatic inference of \u03bb is proposed. Several other approaches\nrelying on similar schemes [15, 18, 5] have been proposed.\n\nFour important bottlenecks of discrete-time methods for echo retrieval can be identi\ufb01ed:\n\n4\n\n\f\u2022 Although they rely on sparsity-enforcing regularizers, the \ufb01lters are strictly-speaking non-\nsparse in practice, due to the sinc kernel (Fig. 1). This general bottle-neck in compressed\nsensing has been referred to as basis mismatch and was notably studied in [28]. In particular,\nthe true peaks of the \ufb01lters do not correspond to the true echoes (Fig. 1), even for N \u2192 \u221e.\nThough, most existing methods rely on peak-picking [18, 5].\n\u2022 For the same reason, these methods are fundamentally on-grid, namely, they can only\noutput echo locations which are integer multiple of the sampling period 1/Fs. This pre-\nvents subsample resolution, which may be important in applications such as room shape\nreconstruction from audio signals [3].\n\u2022 These methods strongly rely on the knowledge of the length L of the \ufb01lters. However, due\n\u2022 The dimension of the search space is M L\u22121, which is much larger in practice than the actual\nnumber 2M K of unknown variables. This makes the methods computationally demanding\nand sometimes intractable for large \ufb01lter lengths (typically in the tens of thousands for\nacoustic applications).\n\nto the sinc kernel (Sec. 2.1.1), the true \ufb01lters are always in\ufb01nite.\n\n3 Off-grid echo retrieval by multichannel annihilation\n\nIn this section, we introduce a novel method for echo recovery that makes use of the discrete-frequency\nmodel (11) and overcomes a number of shortcomings of existing approaches. Namely, it works\ndirectly in the parameter space, it does not rely on the \ufb01lters\u2019 length but on the number of echoes,\nand it enables exact off-grid recovery of echoes\u2019 locations and weights in the noiseless case. The\napproach relies on the \ufb01nite rate of innovation (FRI) sampling paradigm introduced in [23]. This\nis the \ufb01rst time this paradigm is applied to blind channel identi\ufb01cation, to the best of the authors\u2019\nknowledge.\n\nK(cid:88)\n\ncm,ke\u22122\u03c0if \u03c4m,k\n\n3.1 The non-blind case\nWe start by considering the non-blind case where the emitted signal s \u2208 CF in the discrete frequency\ndomain is known. We further assume throughout the paper that this signal is nonzero on the considered\nfrequency grid F = {f1, . . . , fF}. We can then transform the discrete-frequency model (11) by\nwriting:\n\nhm(f ) = xm(f )z(f ) =\n\n(14)\nwhere the Fourier-inverted signal z \u2208 CF is de\ufb01ned by z(f ) = s(f )\u22121. Our goal is to estimate\nk=1 from hm = xm (cid:12) zm, where (cid:12) denotes the Hadamard product. If we take our\n{cm,k, \u03c4m,k}K\nfrequency indexes F to be in arithmetic progression with step \u2206f , then the exponential sequence\n{e\u22122\u03c0ifi\u03c4m,k}F\ni=1 is a geometric progression with ratio rm,k = e\u22122\u03c0i\u2206f \u03c4m,k for each m, k. Hence,\nhm is a weighted sum of geometric progressions. This enables us to use the so called annihilating\n\ufb01lter technique [29]. This technique is based on the observation that\n\nk=1\n\n(15)\nfor any w \u2208 C and F \u2208 N. We deduce that if we de\ufb01ne the \ufb01lter am = [am,0, . . . , am,K] \u2208 CK+1\nas the following discrete convolution1 of K \ufb01lters of size 2:\n\n[1,\u2212w] (cid:63) [w0, w1, w2, . . . , wF\u22121] = 0F\u22121,\n\nam = [1,\u2212rm,1] (cid:63) [1,\u2212rm,2] (cid:63) \u00b7\u00b7\u00b7 (cid:63) [1,\u2212rm,K\u22121] (cid:63) [0K\u22121, 1,\u2212rm,K, 0K\u22121],\n\n(16)\nthen am is an annihilating \ufb01lter for hm, i.e., am (cid:63) hm = 0F\u2212K. Importantly, the number of echoes\nK has to be known upfront in order to de\ufb01ne am. Let us now de\ufb01ne the polynomial representation\nof \ufb01lter am by:\n\nK(cid:88)\n\nPam[y] =\n\nam,kyk.\n\n(17)\n\nBecause am is an annihilating \ufb01lter for hm, it follows from the classical interpretation of convolution\nas polynomial multiplication that Pam has exactly K roots, which are the ratios {rm,k}K\nk=1. Hence,\n\n1The chained discrete convolutions in (16) have to be taken from right to left to be compatible with (6).\n\nk=0\n\n5\n\n\fonce an annihilating \ufb01lter am for hm has been found, the Dirac locations {\u03c4m,k}K\nk=1 can be deduced\nby rooting Pam. Once the roots are known, reconstructing the weights is a simple linear problem\ninvolving a Vandermonde matrix V(rm) \u2208 CF\u00d7K, obtained by writing (14) in matrix form:\n\n\uf8ee\uf8ef\uf8ef\uf8f0 hm(f1)\n\nhm(f2)\n\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\nhm(fF )\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\nr1\nm,1\n...\nrF\u22121\n\nm,1\n\n1\nr1\nm,2\n...\nrF\u22121\n\nm,2\n\n. . .\n. . .\n...\n. . .\n\n1\nr1\nm,K\n\n...\nrF\u22121\n\nm,K\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb Dm\n\n\uf8ee\uf8ef\uf8ef\uf8f0 cm,1\n\ncm,2\n...\ncm,K\n\n\uf8f9\uf8fa\uf8fa\uf8fb = V(rm)Dmcm.\n\n(18)\n\nwhere Dm = Diag(e\u22122\u03c0if1\u03c4 m) \u2208 CK\u00d7K. The least-square solution of this system is given by\nm V(rm)\u2020hm where {\u00b7}\u2020 denotes the Moore-Penrose pseudo inverse. In practice, since\ncm = D\u22121\npositive weights are sought, the phases of this complex vector are discarded. General FRI theory [23]\ntells us that F \u2265 2K + 1 is enough to uniquely recover the exact K Dirac locations and weights\nusing this method. In other words, the original echo retrieval problem has been reduced to that of\n\ufb01nding an annihilating \ufb01lter for hm = xm (cid:12) z. In practice, this can be done by solving the following\nminimization problem for m = 1, . . . , M:\n\na\u2217\nm = argmin\n2=1\n\n(cid:107)am(cid:107)2\n\n(cid:107)Toep(xm (cid:12) z)am(cid:107)2\n2 ,\n\n(19)\n\nwhere the unit norm constraint is used to avoid the trivial solution am = 0K+1. The solution of this\nproblem is the eigenvector associated to the lowest eigenvalue of Toep(xm (cid:12) z). Assuming that the\ntrue z is given, that model (11) holds exactly and that F \u2265 2K + 1, this eigenvalue will be unique\nand equal to 0.\n\n3.2 MULAN: an iterative scheme\n\nIn the blind echo retrieval problem of interest, the emitted signal s and hence z are unknown. To\nsolve for all unknown variables jointly, we introduce the following non-convex optimization problem:\n\nz\u2217, a\u2217\n\n1, . . . , a\u2217\n\nM =\n\n(cid:107)z(cid:107)2\n\n2=(cid:107)a1(cid:107)2\n\nargmin\n2=\u00b7\u00b7\u00b7=(cid:107)aM(cid:107)2\n\n2=1\n\nM(cid:88)\n\nm=1\n\n(cid:107)Toep(xm (cid:12) z)am(cid:107)2\n2 .\n\n(20)\n\nOur strategy to tackle this problem is by alternated minimization with respect to each variable.\nMinimization with respect to each am is already covered by the previous section. Minimization with\nrespect to z is also a minimum eigenvalue problem, since the cost function C(z, a) can be rewritten:\n\nM(cid:88)\n\nM(cid:88)\n\n(cid:107)Toep(xm (cid:12) z)am(cid:107)2\n\nC(z, a) =\nwhere Q = [Toep0(a1)Diag(x1); . . . ; Toep0(aM )Diag(xM )] \u2208 CM (K+1)\u00d7F\n\n2 =\n\nm=1\n\nm=1\n\n(cid:107)Toep0(am)Diag(xm)z(cid:107)2\n\n2 = (cid:107)Qz(cid:107)2\n2 ,\n\n(21)\n\n(22)\n\nand [\u00b7;\u00b7] denotes vertical concatenation. If the algorithm succeeds in bringing down the cost function\nto zero, it means that appropriate annihilating \ufb01lters have been found for all channels for a given\nFourier-inverted signal z, and the locations and weights of all Diracs can be recovered. We call this\nmethod MULAN for MULtichannel ANnihilation. Pseudo-code for the algorithm is given in Alg.\n1. Since (20) is non-convex, the alternate minimization scheme is at best guaranteed to converge to\na stationary point of the cost-function C(z, a). To alleviate this issue, we propose to initialize the\nmethod multiple times with random values of z and only keep the run with lowest \ufb01nal C(z, a).\n\n6\n\n\f// i.i.d. standard complex Gaussian in CF\n\nAlgorithm 1 MULAN (MULtichannel ANnihilation)\nInput: Frequency-domain multichannel measurements {x1:M (f ); f \u2208 F} computed via DFT (10);\nmax_iter; conv_thresh.\nOutput: Echo locations and weights {\u03c4m,k, cm,k}M,K\n1: iter := 0; z := random();\n2: repeat\n3:\n4:\n5:\n6: until iter=max_iter or C(z,a) decreased by less than conv_thresh\n7: for m = 1 \u2192 M do\n8:\n9: end for\n10: return Shifted and scaled {\u03c4m,k, cm,k}M,K\n\niter := iter + 1;\nfor m = 1 \u2192 M do: am := min_eig_vec(Toep(xm (cid:12) z)); end for\nz := min_eig_vec(Q);\n\nrm := roots(Pam); \u03c4 m := \u2212 arg(rm)/(2\u03c0\u2206f ); cm := abs(D\u22121\n\nm V(rm)\u2020hm);\n\nm,k=1;\n\n// See Sec. 3.3\n\n// See eq. (21)\n\n// Sec. 3.1\n\nm,k=1.\n\n// See eq. (22)\n\n3.3\n\nIdenti\ufb01ability and ambiguities\n\nThe identi\ufb01ability of blind channel identi\ufb01cation for general discrete \ufb01lters and signals has been\nstudied some time ago [12]. It is known that the \ufb01lters {\u02c6hm}M\nm=1 cannot be recovered if their\npolynomial representations admit at least a common root or if the polynomial representation of the\nemitted signal \u02c6s has less than 2L + 1 roots. The latter is ruled out if the emitted signal has a rich\nenough spectral content (enough nonzero frequencies) which is usually the case for natural signals.\nThe former has at least one consequence in our case: the problem is unidenti\ufb01able if the observed\nsignals are scaled and delayed versions of each other, which may happen in practice. While other\ncommon roots may appear in the general setting, it is important to note that MULAN restricts the\nsearch of \ufb01lters to those which are linear combinations of geometrical series in the frequency domain.\nThere is no complete theoretical study on common roots in this case, to the best of the authors\u2019\nknowledge. The authors of [30] theoretically studied blind deconvolution of sparse signals, but their\nresults do not apply here since our \ufb01lters are not sparse (see Sec. 2.2). Another well-known ambiguity\nis that the \ufb01lters can only be recovered up to a global time-shift and scaling, because a converse\nshifting and scaling of the emitted signal yields the same observations. We handle this by adopting\nthe convention \u03c41,1 = 0 and c1,1 = 1. Additionally, we assume that all echoes are located in the \ufb01rst\nhalf of temporal \ufb01lters to avoid time-wrapping ambiguities. Finally, the proposed MULAN algorithm\nhas an extra speci\ufb01c ambiguity. It can be easily shown that multiplying the roots of all polynomials\n{Pam}M\nm=1 by a complex scalar \u03b3 while dividing the Fourier-inverted signal z element-wise by\na geometric series of ratio \u03b3 does not change the cost function C(z, a). This can be handled by\nrescaling the roots of all annihilating \ufb01lters to have unit modulus at each iteration. However, since\nonly the complex arguments of the roots are used in the end, this appeared to be unnecessary in our\nexperiments.\n\n4 Experiments\n\n4.1 On-grid vs. off-grid echo retrieval\n\nWe \ufb01rst emphasize the speci\ufb01c ability of the proposed method to recover echo locations off-grid\nby comparing it to conventional on-grid methods on a simulated room-acoustic scenario and on an\narti\ufb01cial scenario with truly sparse discrete \ufb01lters for reference. For the room-acoustic scenario,\nthere is a point source emitting speech from the TIMIT dataset [31], and M = 2 microphones\nare randomly placed inside 100 random shoe-box rooms whose sizes vary from 4m \u00d7 6m \u00d7 8m to\n5m \u00d7 7m \u00d7 9m. Simulations were performed using the pyroomacoustics library [32]. The absorption\ncoef\ufb01cient of each surface of the room is set to 0.2. Only \ufb01rst-order re\ufb02ections on the 6 surfaces\nand the direct path are simulated, resulting in K = 7 echoes per channel and \ufb01lters shorter than\n50 ms. For each experiment, it was ensured that the minimum separation of echoes was 1ms. The\n\ufb01lters are simulated in the continuous-time domain using the image-source method [33]. They are\nthen smoothed, sampled and convolved with the source signal at Fs = 16kHz according to the\nmeasurement model described in Sec. 2.1. The ground-truth echo locations and weights are saved in\nthe time-domain before smoothing and are hence off-grid. The M-channel input signals used are\n\n7\n\n\f0.25s long, i.e., N = 0.25Fs = 4000 samples. On the other hand, for the arti\ufb01cial scenario, the\nspeech source was discretely convolved with sparse \ufb01lters of similar length with K = 7 nonzero\nelements each resulting in N = 4000 samples of M-channel observations. The ground-truth echo\nlocations and weights are hence on-grid in this case. All weights take values between 0 and 1.\nFor MULAN, the DFT (eq. 10) is applied to each input signal using a grid F of F = 401 regularly\nspaced frequencies between 200 Hz and 2000 Hz. Such a choice of the frequency range avoids\nlow-frequency bands which are often noisy in real scenario, while focusing on a typical spectral range\nfor speech, but it can be easily adapted depending on the application. An odd number of frequencies\nwas chosen, since it has proven to be a good practice [24]. We use 20 random initializations as a\ngood compromise between global convergence and computational complexity, max_iter= 1000 and\nconv_thresh= 0.1%. The two baseline methods chosen are CR [12] as described in (12) and its\nLASSO-type extension [16] as described in (13). The \ufb01lters lengths L were always set to the true\nlengths (which never exceed 0.05Fs) and the sparsity parameter \u03bb for LASSO was manually set to\n\u03bb = 10\u22123, which empirically showed best performance among the choices {10\u22126, 10\u22125, . . . , 102},\nalthough any value below 10\u22122 showed similar performance.\nWe used two distinct metrics to evaluate Dirac location estimation and Dirac weight estimation. For\nthe \ufb01rst one, a test is counted as successful if the root mean squared error (RMSE) of the 7 \u00d7 2 = 14\nDirac locations is below 1 sample (1/Fs seconds), and the success rate out of 100 tests is provided.\nThis metric only counts fully successful channel recovery and penalizes tests where some Diracs\nare missed or completely off. For the second one, we provide the weight RMSE of successful tests\nonly. This is to avoid counting weights estimated at wrong Dirac locations. These metrics for 100 on-\nand off-grid tests and all three methods are showed in Table 1. We can see that for the on-grid case,\nboth CR and MULAN perform well, CR even achieving more location recoveries than MULAN.\nThis is not too surprising since CR is based on the on-grid arti\ufb01cial model, while MULAN uses an\noff-grid model. We observed that LASSO struggles with the proximity of Diracs and did not perform\nas well. In terms of weight estimation MULAN yields errors which are 2 to 3 orders of magnitudes\nsmaller than the two competing methods, which is very encouraging. In the more realistic off-grid\nscenario, we observed that localization errors of CR and LASSO drastically degrades with almost no\nsuccessful channel estimation. Meanwhile, MULAN achieves near-exact full recovery of locations\nand weights in 70 out of 100 tests.\n\n4.2\n\nIn\ufb02uence of K, M, F on recovery rate\n\nWe now conduct further experiments to check the in\ufb02uence of parameters K, M and F on the ability\nof MULAN to fully recover Dirac locations and weights off-grid. We show results with 20 random\ninitializations, F = 201 or F = 401 in the same frequency range as before, M \u2208 {2, . . . , 7} and\nK \u2208 {2, . . . , 7}. The following RMSE thresholds were de\ufb01ned for success of recovery: 1 sample\nfor locations as before and 10\u22122 for weights. 100 experiments were performed for every parameter\nset. Results for F = 201 can be seen in Figures 2 and 3, and for F = 401 in Figures 4 and 5. As\ncan be seen, a higher recovery rate is generally observed when fewer echoes are present and more\nfrequencies are used. On the other hand, the number of sensors does not signi\ufb01cantly affect recovery\nperformance. This is expected since O(KM ) parameters are estimated from O(M F ) observations.\nIncreasing the number of random initializations also showed to increase success by alleviating the\nnon-convexity of the problem, at the cost of an increased computational requirement.\n\ncase\n\non-grid\n\noff-grid\n\nmethod\nCR [12]\nLASSO [16]\nMULAN (proposed)\nCR [12]\nLASSO [16]\nMULAN (proposed)\n\nfull location recovery weight RMSE\n\n92 %\n13 %\n59 %\n1%\n2%\n70 %\n\n0.0390\n0.155\n0.00016\n0.0442\n0.0346\n0.00048\n\nTable 1: Ratio of full Dirac location recovery (RMSE < 1 sample = 1/Fs seconds) and weight RMSE\n(successful cases only) for three methods over 100 on-grid and 100 off-grid tests. Weights take values\nbetween 0 and 1.\n\n8\n\n\fFigure 2: Rate of location retrieval for F = 201.\n\nFigure 3: Rate of weight retrieval for F = 201.\n\nFigure 4: Rate of location retrieval for F = 401.\n\nFigure 5: Rate of weight retrieval for F = 401.\n\n5 Conclusion\n\nThis paper introduced the \ufb01rst method enabling blind and off-grid recovery of echo locations and\nweights from discrete-time multichannel measurements, to the best of the authors\u2019 knowledge. Future\nwork will include alternative initialization schemes and convex relaxations in the spirit of [22] for\nthe proposed cost function, extensions to sparse-spectrum signals and noisy measurements, and\napplications to dereverberation and audio-based room shape reconstruction. A better theoretical\nunderstanding of recovery guarantees as a function of M, K, F and N will also be sought. The code\nfor this submission can be found at: https://github.com/epfl-lts2/mulan.\n\nReferences\n[1] Lindsay Kleeman and Roman Kuc. Sonar sensing. In Springer handbook of robotics, pages\n\n753\u2013782. Springer, 2016.\n\n[2] Haruo Sato, Michael C. Fehler, and Takuto Maeda. Seismic wave propagation and scattering in\n\nthe heterogeneous earth, volume 496. Springer, 2012.\n\n9\n\n\f[3] Ivan Dokmani\u00b4c, Reza Parhizkar, Andreas Walther, Yue M. Lu, and Martin Vetterli. Acoustic\nechoes reveal room shape. Proceedings of the National Academy of Sciences, 110(30):12186\u2013\n12191, 2013.\n\n[4] Marco Crocco, Andrea Trucco, Vittorio Murino, and Alessio Del Bue. Towards fully uncali-\nbrated room reconstruction with sound. In Proceedings of the 22nd European Signal Processing\nConference (EUSIPCO), pages 910\u2013914. IEEE, 2014.\n\n[5] Marco Crocco and Alessio Del Bue. Estimation of TDOA for room re\ufb02ections by iterative\nweighted L1 constraint. In IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 3201\u20133205. IEEE, 2016.\n\n[6] Alin Achim, Benjamin Buxton, George Tzagkarakis, and Panagiotis Tsakalides. Compressive\nsensing for ultrasound RF echoes using a-stable distributions. In Engineering in Medicine and\nBiology Society (EMBC), 2010 Annual International Conference of the IEEE, pages 4304\u20134307.\nIEEE, 2010.\n\n[7] Niccol\u00f3 Antonello, Toon van Waterschoot, Marc Moonen, and Patrick A. Naylor. Identi\ufb01cation\nIn 14th\nof surface acoustic impedances in a reverberant room using the FDTD method.\nInternational Workshop on Acoustic Signal Enhancement (IWAENC), pages 114\u2013118. IEEE,\n2014.\n\n[8] Nancy Bertin, Sr \u00afdan Kiti\u00b4c, and R\u00e9mi Gribonval. Joint estimation of sound source location\nand boundary impedance with physics-driven cosparse regularization. In IEEE International\nConference on Acoustics, Speech and Signal Processing (ICASSP), pages 6340\u20136344. IEEE,\n2016.\n\n[9] Ivan Dokmani\u00b4c, Robin Scheibler, and Martin Vetterli. Raking the cocktail party. IEEE Journal\n\nof Selected Topics in Signal Processing, 9(5):825\u2013836, 2015.\n\n[10] Robin Scheibler, Diego Di Carlo, Antoine Deleforge, and Ivan Dokmani\u00b4c. Separake: Source\nseparation with a little help from echoes. In IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP). IEEE, 2018.\n\n[11] Robert Price and Paul E. Green. A communication technique for multipath channels. Proceed-\n\nings of the IRE, 46(3):555\u2013570, 1958.\n\n[12] Guanghan Xu, Hui Liu, Lang Tong, and Thomas Kailath. A least-squares approach to blind\n\nchannel identi\ufb01cation. IEEE Transactions on signal processing, 43(12):2982\u20132993, 1995.\n\n[13] Yingbo Hua. Fast maximum likelihood for blind identi\ufb01cation of multiple FIR channels. IEEE\n\ntransactions on Signal Processing, 44(3):661\u2013672, 1996.\n\n[14] Karim Abed-Meraim, Philippe Loubaton, and Eric Moulines. A subspace algorithm for certain\nblind identi\ufb01cation problems. IEEE transactions on information theory, 43(2):499\u2013511, 1997.\n\n[15] Abdeldjalil Aissa-El-Bey and Karim Abed-Meraim. Blind SIMO channel identi\ufb01cation using\na sparsity criterion. In IEEE 9th International Workshop on Signal Processing Advances in\nWireless Communications(SPAWC), pages 271\u2013275. IEEE, 2008.\n\n[16] Yuanqing Lin, Jingdong Chen, Youngmoo Kim, and Daniel D. Lee. Blind channel identi\ufb01cation\nfor speech dereverberation using l1-norm sparse learning. In J. C. Platt, D. Koller, Y. Singer, and\nS. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 921\u2013928.\nCurran Associates, Inc., 2008.\n\n[17] Abla Kammoun, Abdeljalil Aissa El Bey, Karim Abed-Meraim, and So\ufb01ene Affes. Robustness\nof blind subspace based techniques using Lp quasi-norms. In IEEE 11th International Workshop\non Signal Processing Advances in Wireless Communications (SPAWC), pages 1\u20135. IEEE, 2010.\n\n[18] Konrad Kowalczyk, Emanu\u00ebl A.P. Habets, Walter Kellermann, and Patrick A. Naylor. Blind\nsystem identi\ufb01cation using sparse learning for tdoa estimation of room re\ufb02ections. IEEE Signal\nProcessing Letters, 20(7):653\u2013656, 2013.\n\n10\n\n\f[19] Xiaofei Li, Sharon Gannot, Laurent Girin, and Radu Horaud. Multichannel identi\ufb01cation and\nnonnegative equalization for dereverberation and noise reduction based on convolutive transfer\nfunction. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1755\u2013\n1768, 2018.\n\n[20] Kiryung Lee, Ning Tian, and Justin Romberg. Fast and guaranteed blind multichannel deconvolu-\ntion under a bilinear system model. IEEE Transactions on Information Theory, 64(7):4792\u20134818,\n2018.\n\n[21] Emmanuel J. Cand\u00e8s and Michael B. Wakin. An introduction to compressive sampling. IEEE\n\nsignal processing magazine, 25(2):21\u201330, 2008.\n\n[22] Yuejie Chi. Guaranteed blind sparse spikes deconvolution via lifting and convex optimization.\n\nIEEE Journal of Selected Topics in Signal Processing, 10(4):782\u2013794, 2016.\n\n[23] Martin Vetterli, Pina Marziliano, and Thierry Blu. Sampling signals with \ufb01nite rate of innovation.\n\nIEEE transactions on Signal Processing, 50(6):1417\u20131428, 2002.\n\n[24] Thierry Blu, Pier L. Dragotti, Martin Vetterli, Pina Marziliano, and Lionel Coulot. Sparse\nsampling of signal innovations. IEEE Signal Processing Magazine, 25(2):31\u201340, March 2008.\n\n[25] Jong C. Ye, Jong M. Kim, Kyong H. Jin, and Kiryung Lee. Compressive sampling using\nannihilating \ufb01lter-based low-rank interpolation. IEEE Transactions on Information Theory,\n63(2):777\u2013801, Feb 2017.\n\n[26] Rein van denBoomgaard and Rik van derWeij. Gaussian convolutions numerical approximations\nbased on interpolation. In Scale-Space and Morphology in Computer Vision: Third International\nConference, Scale-Space 2001 Vancouver, Canada, July 7\u20138, 2001 Proceedings 3, pages 205\u2013\n214. Springer, 2001.\n\n[27] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[28] Yuejie Chi, Louis L. Scharf, Ali Pezeshki, and A. Robert Calderbank. Sensitivity to basis\nmismatch in compressed sensing. IEEE Transactions on Signal Processing, 59(5):2182\u20132195,\n2011.\n\n[29] Petre Stoica and Randolph L Moses. Introduction to spectral analysis. Prentice Hall Upper\n\nSaddle River, NJ, 1997.\n\n[30] Sunav Choudhary and Urbashi Mitra. On the properties of the rank-two null space of non-\nsparse and canonical-sparse blind deconvolution. IEEE Transactions on Signal Processing,\n66(14):3696\u20133709, 2018.\n\n[31] John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, and David S. Pallett.\nDARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1.\nNASA STI/Recon technical report n, 93, 1993.\n\n[32] Robin Scheibler, Eric Bezzam, and Ivan Dokmanic. Pyroomacoustics: A python package for\n\naudio room simulations and array processing algorithms. CoRR, abs/1710.04196, 2017.\n\n[33] Jont B. Allen and David A. Berkley. Image method for ef\ufb01ciently simulating small-room\n\nacoustics. The Journal of the Acoustical Society of America, 65(4):943\u2013950, 1979.\n\n11\n\n\f", "award": [], "sourceid": 1106, "authors": [{"given_name": "Helena", "family_name": "Peic Tukuljac", "institution": "\u00c9cole polytechnique f\u00e9d\u00e9rale de Lausanne"}, {"given_name": "Antoine", "family_name": "Deleforge", "institution": "Inria"}, {"given_name": "Remi", "family_name": "Gribonval", "institution": "INRIA"}]}