{"title": "DeepWave: A Recurrent Neural-Network for Real-Time Acoustic Imaging", "book": "Advances in Neural Information Processing Systems", "page_first": 15300, "page_last": 15312, "abstract": "We propose a recurrent neural-network for real-time reconstruction of acoustic camera spherical maps. The network, dubbed DeepWave, is both physically and algorithmically motivated: its recurrent architecture mimics iterative solvers from convex optimisation, and its parsimonious parametrisation is based on the natural structure of acoustic imaging problems.\nEach network layer applies successive filtering, biasing and activation steps to its input, which can be interpreted as generalised deblurring and sparsification steps. To comply with the irregular geometry of spherical maps, filtering operations are implemented efficiently by means of graph signal processing techniques.\nUnlike commonly-used imaging network architectures, DeepWave is moreover capable of directly processing the complex-valued raw microphone correlations, learning how to optimally back-project these into a spherical map. We propose moreover a smart physically-inspired initialisation scheme that attains much faster training and higher performance than random initialisation.\nOur real-data experiments show DeepWave has similar computational speed to the state-of-the-art delay-and-sum imager with vastly superior resolution. While developed primarily for acoustic cameras, DeepWave could easily be adapted to neighbouring signal processing fields, such as radio astronomy, radar and sonar.", "full_text": "DeepWave: A Recurrent Neural-Network\n\nfor Real-Time Acoustic Imaging\n\nMatthieu Simeoni \u2217\n\nIBM Zurich Research Laboratory\n\nmeo@zurich.ibm.com\n\nPaul Hurley\n\nWestern Sydney University\n\npaul.hurley@westernsydney.edu.au\n\nSepand Kashani \u2020\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL)\n\nsepand.kashani@epfl.ch\n\nMartin Vetterli\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL)\n\nmartin.vetterli@epfl.ch\n\nAbstract\n\nWe propose a recurrent neural-network for real-time reconstruction of acoustic\ncamera spherical maps. The network, dubbed DeepWave, is both physically and\nalgorithmically motivated: its recurrent architecture mimics iterative solvers from\nconvex optimisation, and its parsimonious parametrisation is based on the natural\nstructure of acoustic imaging problems. Each network layer applies successive\n\ufb01ltering, biasing and activation steps to its input, which can be interpreted as gener-\nalised deblurring and sparsi\ufb01cation steps. To comply with the irregular geometry of\nspherical maps, \ufb01ltering operations are implemented ef\ufb01ciently by means of graph\nsignal processing techniques. Unlike commonly-used imaging network architec-\ntures, DeepWave is moreover capable of directly processing the complex-valued\nraw microphone correlations, learning how to optimally back-project these into\na spherical map. We propose moreover a smart physically-inspired initialisation\nscheme that attains much faster training and higher performance than random ini-\ntialisation. Our real-data experiments show DeepWave has similar computational\nspeed to the state-of-the-art delay-and-sum imager with vastly superior resolution.\nWhile developed primarily for acoustic cameras, DeepWave could easily be adapted\nto neighbouring signal processing \ufb01elds, such as radio astronomy, radar and sonar.\n\n1\n\nIntroduction\n\nMotivation An acoustic camera (AC) [26, 8, 18, 24] is a multi-modal imaging device that allows\none to visualise in real-time sound emissions from every direction in space. This is typically achieved\nby overlaying on the live video from an optical camera a heatmap representing the intensity of the\nambient directional sound \ufb01eld, recovered from the simultaneous recordings of a microphone array\n[3, 42]. Most commercial acoustic cameras recover the sound intensity \ufb01eld by combining linearly\nthe correlated microphone recordings with a Delay-And-Sum (DAS) beamformer [42, Chapter 5]. The\nbeamformer acts as an angular \ufb01lter [20, 21], steering sequentially the array sensitivity pattern \u2013or\nbeamshape\u2013 towards various directions where the sound intensity \ufb01eld is probed. Acoustic images\nobtained this way are cheap to compute, but are blurred by the beamshape of the microphone array,\n\u2217Corresponding author. Matthieu Simeoni is also af\ufb01liated to the \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\u2020Matthieu Simeoni and Sepand Kashani have contributed equally to this work. Sepand Kashani was in part\nsupported by the Swiss National Science Foundation grant number 200021 181978/1, \u201cSESAM - Sensing and\nSampling: Theory and Algorithms\".\n\n(EPFL), with email address matthieu.simeoni@epfl.ch\n\nPreprint. Under review.\n\n\fand hence exhibit poor angular resolution [49, 7, 48]. The severity of this blur can be shown [53] to\nbe proportional to the ratio \u03bb/D, where D is the diameter of the microphone array and \u03bb the sound\nwavelength. Because of the relatively large wavelengths of acoustic waves in the audible range, this\nblur can be signi\ufb01cant in practice: a 30 cm diameter microphone array has an angular resolution at\n5 kHz (an E(cid:91)) of approximately 10 degrees, against 7\u00b710\u22124 degrees for a standard optical camera\nat 790 THz (violet). Moreover, acoustic cameras are often deployed in con\ufb01ned environments [34],\nrequiring them to be as compact and portable as possible, which limits3 further the achievable angular\nresolution.\nThe advent of compressed sensing techniques [14, 44] \u2013and their wide adoption in imaging sciences\n[54, 4, 32]\u2013 have inspired algorithmic solutions [48, 7, 11, 12] to the acoustic imaging problem,\npromising vastly improved angular resolutions. Unfortunately, these methods proved ill-suited for\nreal-time purposes. Indeed, they often rely on iterative solvers, such as proximal gradient descent\n(PGD) [37] or its accelerated variants [2, 31]. While exhibiting a fast convergence rate [2], such\nmethods still require on the order of a few dozen iterations to converge in practice, making them\nunable to cope with the high refresh-rate4 of acoustic cameras. For this reason, and despite their\nclear superiority in terms of resolving power, nonlinear imaging methods have not yet replaced the\nsuboptimal DAS imager in the software stack of commercial acoustic cameras.\nThe recent eruption of deep learning [33, 56, 10] in the \ufb01eld of imaging sciences may however seal\nthe fate of DAS for good. Indeed, this new imaging paradigm leverages neural-networks [28] to\nreduce dramatically the image formation time. Unlike compressed-sensing methods which proceed\niteratively, neural networks encode the image reconstruction process in a cascade of linear and\nnonlinear transformations trained on a very large number of input/output example pairs. Once\nproperly trained, a neural-network can be ef\ufb01ciently evaluated for some input data to produce\nimages of high quality, with similar accuracy and resolution as state-of-the-art compressed-sensing\nmethods [33]. Network architectures used for inverse imaging [22, 17, 56, 10, 40] are most often\nconvolutional neural-networks (CNNs), directly adapted from generic architectures developed for\nimage classi\ufb01cation and segmentation [45]. While suitable for image processing tasks such as\ndenoising, super-resolution or deblurring [38, 6], such architectures are ill-suited [33] for more\ncomplex image reconstruction problems where the input data may not consist of an image, as is\nthe case in biomedical imagery [4, 32], interferometry [54] or acoustic imaging. Moreover, and\nparticularly limiting for our current purposes, standard convolutional architectures cannot handle\nimages with non-Euclidean domains [13] such as spherical maps [41] produced by omnidirectional\nacoustic or optical cameras.\nTo overcome these limitations, recurrent architectures [16, 50, 30, 33] have been proposed, by\nunrolling iterative convex optimisation algorithms. Such networks are not only able to handle non-\nimage inputs, but also have greater interpretability than generic CNNs. For example, Gregor and\nLeCun proposed in their pioneering work [16] a recurrent neural-network (RNN) dubbed LISTA5,\ninspired from the popular iterative soft-thresholding algorithm (ISTA)[2].6 Their network can be seen\nas generalising ISTA, allowing for the normally \ufb01xed gradient and proximal steps occurring at each\niteration of the algorithm to be learnt from the data: update steps of ISTA are replaced by a cascade\nof recurrent layers with trainable parameters. The depth of the resulting RNN is typically much\nsmaller than the number of iterations required for ISTA to converge. Roughly speaking, the network\nis learning shortcuts in the reconstruction space, allowing it to achieve a prescribed reconstruction\naccuracy faster than gradient-based iterative methods.7\nWhile the effectiveness of LISTA was veri\ufb01ed on small images from the MNIST dataset (784 pixels)\n[16], its application to large-scale imaging problems remains challenging. This is mainly due to the\nhuge number of weights parametrising the network which, in the fully-connected case, grows as\nthe number of pixels to the square. Storing8 \u2013let alone learning\u2013 all those weights quickly becomes\nintractable for increasing resolutions. As a potential \ufb01x, Gregor and LeCun recommended sparsifying\nthe network by pruning layer connections. While they showed that such a pruning could reduce the\nnumber of parameters in the network by 80% without affecting too much the performance of the\n\n3Remember that the blur spread is inversely proportional to the microphone array diameter.\n4An acoustic camera typically updates the acoustic image a dozen times per second.\n5LISTA stands for learned iterative soft-thresholding algorithm.\n6ISTA is an instance of proximal gradient descent for penalised basis pursuit problems [52].\n7Of course, such shortcuts will most likely only be valid for the distribution of inputs and outputs implicitly\nde\ufb01ned by the training set, which should hence be carefully crafted for the network to generalise well in practice.\n\n8For a 1 megapixel image, the weights parametrising the network would be approximately 8 Gb in size.\n\n2\n\n\fFigure 1: DeepWave\u2019s recurrent architecture (1) for L = 2 layers and random initialisation. Learnable\nparameters of the network are denoted by dashed boxes. Af\ufb01ne operations are denoted by white\nboxes and nonlinear activations by grey boxes.\n\nlatter, this is still insuf\ufb01cient for large-scale problems, and additional structure must be considered on\nnetwork layers. Such structure is however often very dependent on the problem at hand.\n\nContributions\nIn this work, we propose the \ufb01rst realistic architecture of a LISTA neural-network\nadapted to acoustic imaging. Our custom architecture, dubbed DeepWave, is capable of rendering\nhigh-resolution spherical maps of real-life sound intensity \ufb01elds in milliseconds. DeepWave is\ntailored to the acoustic imaging problem, leveraging fully its underlying structure so as to minimise\nthe number of network parameters. The latter is easy to train, with a typical training time of less\nthan an hour on a general-purpose CPU. Unlike most state-of-the-art neural-network architectures, it\nmoreover readily supports complex-valued input vectors, making it capable of directly processing\nthe raw correlated microphone recordings. Assuming a microphone array with M microphones,\nthe instantaneous covariance matrix \u02c6\u03a3 \u2208 CM\u00d7M of the microphone recordings is processed by the\nnetwork as follows (see also \ufb01g. 1):\n\n(cid:16)\n\nP\u03b8 (L) xl\u22121 +(cid:2)B \u25e6 B(cid:3)H\n\nxl = \u03c3\n\n(cid:17)\n\nvec( \u02c6\u03a3) \u2212 \u03c4\n\n,\n\nl = 1, . . . , L,\n\n(1)\n\nwhere vec : CM\u00d7M \u2192 CM 2 is the vectorisation operator and \u25e6 denotes the Khatri-Rao product\n(see appendix A9 for de\ufb01nitions). The neurons {x1, . . . , xL} \u2282 RN\n+ at the output of each layer l\nof the depth L neural-network correspond to the acoustic image as it is processed by the network,\nwith N the number of pixels. The neuron x0 \u2208 RN\n+ de\ufb01nes the initial state of the network. The\nnonlinear activation function10 \u03c3 : R \u2192 R induces sparsity in the acoustic image, and is inspired by\nthe proximal operator of an elastic-net penalty [37]. The remaining quantities, namely P\u03b8(L), B\nand \u03c4 are trainable parameters of the network, with various roles:\n\n\u2022 Deblurring: the matrix P\u03b8(L) :=(cid:80)K\n\nk=0 \u03b8kLk \u2208 RN\u00d7N can be interpreted as a deblurring\nmatrix, cleaning potential artefacts from the array beamshape. Following the approach\nof [41], it is de\ufb01ned as a polynomial of the graph Laplacian L \u2208 RN\u00d7N based on the\nconnectivity graph of the spherical tessellation in use, with learnable coef\ufb01cients \u03b8 =\n[\u03b80, . . . , \u03b8K] \u2208 RK+1. Such parametrisation permits notably the interpretation of P\u03b8(L) as\na \ufb01nite-support \ufb01lter de\ufb01ned on the tessellation graph. Moreover, fast graph convolution\nalgorithms are available for such \ufb01lters [13].\n\n\u2022 Back-projection: the operation (cid:2)B \u25e6 B(cid:3)H\n\n(A.8) is a back-\nprojection, mapping the raw microphone correlations to the image domain. Thanks to\nthe convenient Khatri-Rao structure, this linear operation depends only on the matrix\nB \u2208 CM\u00d7N .\n\nvec( \u02c6\u03a3) = diag\n\nBH \u02c6\u03a3B\n\n(cid:16)\n\n(cid:17)\n\n9In all that follows, labels pre\ufb01xed with roman letters refer to elements of the supplementary material.\n10Typi\ufb01ed by a rectilinear unit.\n\n3\n\n\f\u2022 Bias: the vector \u03c4 \u2208 RN is a non-uniform bias, boosting or shrinking the neurons of\nthe network. Since only positive neurons are activated by the nonlinearity \u03c3, this biasing\noperation helps sparsify the \ufb01nal acoustic image.\n\nThe total number of learnable coef\ufb01cients in DeepWave is linear in the number of pixels. The\nrationale behind DeepWave\u2019s architecture is detailed in section 2, with theoretical justi\ufb01cations\nfor the structures of the deblurring and back-projection linear operators. In section 3, we discuss\nnetwork training, including initialisation and regularisation. We moreover derive the forward- and\nbackward-propagation recursions11 for our custom architecture, required for forming gradient steps.\nFinally, we test the architecture on synthetic as well as real data acquired with the Pyramic array\n[5, 46]. DeepWave is shown to have similar resolving power as state-of-the-art compressed-sensing\nmethods, with a computational overhead similar to the DAS imager. To our knowledge, this is the \ufb01rst\ntime a nonlinear imager of the kind achieves real-time performance on a standard computing platform.\nWhile developed primarily for acoustic cameras, DeepWave can easily be applied in neighbouring\narray signal processing \ufb01elds [27], including radio astronomy, radar and sonar technologies.\n\n2 Network architecture\n\nIn this section, we proceed similarly to [16, 50, 30] and construct DeepWave by studying the update\nequations of an iterative solver, namely proximal gradient descent applied to acoustic imaging.\n\n2.1 Proximal gradient descent for acoustic imaging\n\nIn all that follows, we model the sound intensity \ufb01eld as a discrete spherical map with resolution\nN, speci\ufb01ed by an intensity vector x \u2208 RN\n+ and a tessellation \u0398 = {r1, . . . , rN} \u2282 S2. Spherical\ntessellations [19, 15] can be viewed as pixelation schemes for spherical geometries (see appendix B.1).\nAs is customary in compressed-sensing, we propose to recover the sound intensity map by solving a\nconvex optimisation problem (see appendix C):\n\n(cid:13)(cid:13)(cid:13) \u02c6\u03a3 \u2212 A diag(x)AH(cid:13)(cid:13)(cid:13)2\n\nF\n\n+ \u03bb(cid:2)\u03b3(cid:107)x(cid:107)1 + (1 \u2212 \u03b3)(cid:107)x(cid:107)2\n\n(cid:3) ,\n\n2\n\n\u02c6x = arg min\nx\u2208RN\n\n+\n\n1\n2\n\n(2)\n\nwhere (cid:107)\u00b7(cid:107)F denotes the Frobenius norm, \u03b3 \u2208]0, 1[ and \u03bb > 0 are hyperparameters, and \u02c6\u03a3 \u2208 CM\u00d7M is\nthe empirical covariance matrix of the microphone recordings. In a far-\ufb01eld context, the forward map\nA \u2208 CM\u00d7N \u2013linking the intensity vector to the microphone recordings\u2013 is commonly modelled by\nthe so-called steering matrix [27]: [A]mn := exp (\u22122\u03c0j(cid:104)pm, rn(cid:105)/\u03bb0) , where {p1, . . . , pM} \u2282 R3\nare the microphone locations and \u03bb0 > 0 the sound wavelength. Using properties (A.5) and (A.6)\nof the vectorisation operator and the Frobenius norm [23, 53], problem (2) can be re-written in\nvectorised form as:\n\n(cid:13)(cid:13)(cid:13)vec\n\n(cid:16) \u02c6\u03a3\n\n(cid:13)(cid:13)(cid:13)2\n(cid:17) \u2212(cid:0)A \u25e6 A(cid:1) x\n\n1\n2\n\n+ \u03bb(cid:2)\u03b3(cid:107)x(cid:107)1 + (1 \u2212 \u03b3)(cid:107)x(cid:107)2\n\n(cid:3),\n\n+\n\n\u02c6x = arg min\nx\u2208RN\n\n(3)\nwhere \u25e6 denotes the Khatri-Rao product (see de\ufb01nition A.3). Problem (3) is an elastic-net penalised\nleast-squares problem [57], which seeks an optimal12 trade-off between data-\ufb01delity and group-\nsparsity. Group-sparsity is in this context better suited than traditional sparsity since acoustic sources\nare often diffuse. It is worth noting that, since the elastic-net functional is strictly convex for \u03b3 \u2208 [0, 1[,\nproblem (3) admits a unique solution. The latter can moreover be approximated by means of proximal\ngradient descent (PGD) [2], whose update equations are given here by (see appendix D):\n\n2\n\n2\n\n\uf8eb\uf8ed xk\u22121 \u2212 \u03b1(cid:0)A \u25e6 A(cid:1)H(cid:104)(cid:0)A \u25e6 A(cid:1) xk\u22121 \u2212 vec\n\n2\u03bb\u03b1(1 \u2212 \u03b3) + 1\n\n(cid:16) \u02c6\u03a3\n(cid:17)(cid:105) \u2212 \u03bb\u03b1\u03b3\n\n\uf8f6\uf8f8 ,\n\nk \u2265 1,\n\n(4)\n\nxk = ReLu\n\nwhere x0 \u2208 RN is arbitrary, \u03b1 \u2264 1/(cid:13)(cid:13)A \u25e6 A(cid:13)(cid:13)2\n\n2 is the step size and ReLu(x) := max(x, 0) is the\nrecti\ufb01ed linear unit [29], applied element-wise to a real vector.13 The sequence of iterates {xk}k\u2208N\n\n11DeepWave implementation can be found on https://github.com/imagingofthings/DeepWave.\n12The notion of optimality is de\ufb01ned here by the penalty parameter \u03bb.\n13Note that with x0 \u2208 RN , every gradient step produces a real vector.\n\n4\n\n\f(cid:16) \u02c6\u03a3\n(cid:17)(cid:105) \u2212 \u03bb\u03b1\u03b3\n\nde\ufb01ned in (4) reduces the objective function in (3) at a rate O(1/k) [2]. Accelerated variants of\nproximal gradient descent have been proposed [2], which modify (4) with an extra momentum term:\n\n\uf8eb\uf8ed xk\u22121 \u2212 \u03b1(cid:0)A \u25e6 A(cid:1)H(cid:104)(cid:0)A \u25e6 A(cid:1) xk\u22121 \u2212 vec\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3yk = ReLu\nxk = yk + \u03c9k(cid:0)yk \u2212 yk\u22121(cid:1)\ngence rate o(1/k2) [31]. Finally, we leverage the formulae(cid:0)A \u25e6 A(cid:1) x = vec(A diag(x)AH ) (A.5),\nand(cid:0)A \u25e6 A(cid:1)H\n\nwhere the momentum sequence {\u03c9k}k\u2208N can be designed in various ways [31, 9]. In our experiments,\nwe will use (5) as a baseline for speed comparisons, where \u03c9k is updated according to Chambolle and\nDossal\u2019s strategy [9]: \u03c9k = (k \u2212 1)/(k + d),\nk \u2265 0, with d = 50 [31]. The accelerated proximal\ngradient descent (APGD) method thus obtained is the fastest reported in the literature, with conver-\n\nvec(R) = diag(AH RA) (A.8), to compute gradient steps ef\ufb01ciently in (5).\n\n2\u03bb\u03b1(1 \u2212 \u03b3) + 1\n\n\uf8f6\uf8f8\n\nk \u2265 1,\n\n(5)\n\n,\n\n2.2 DeepWave : a PGD-inspired RNN for fast acoustic imaging\n\n(cid:17) \u2212 \u03c4\n(cid:16) \u02c6\u03a3\n\n(cid:17)\n\n,\n\nIn practice PGD is terminated according to some stopping criterion. The intensity map xL obtained\nafter L iterations of (4) can then be seen as the output of an RNN with depth L and intermediate\nneurons linked by the recursion formula:\n\n(cid:16)Dxl\u22121 + B vec\nI \u2212 \u03b1(cid:0)A \u25e6 A(cid:1)H(cid:0)A \u25e6 A(cid:1)(cid:105)\n\n,\n\n\u03b2\n\n1\n\u03b2\n\n\u03b1\n\u03b2\n\n\u03c4 =\n\n\u03bb\u03b1\u03b3\n\n1N ,\n\n(cid:104)\n\nD =\n\nl = 1, . . . , L.\n\nxl = ReLu\n\n(cid:0)A \u25e6 A(cid:1)H\n\n(6)\nWe call this RNN the oracle RNN, since its weights D \u2208 RN\u00d7N , B \u2208 CN\u00d7M 2 and \u03c4 \u2208 RN are not\nlearnt but simply given to us by identifying (6) with (4):\n, B =\n\n(7)\nwhere \u03b2 = 2\u03bb\u03b1(1 \u2212 \u03b3) + 1. An analysis of (7) allows us moreover to interpret physically the af\ufb01ne\noperations performed by the oracle RNN. The matrix B \ufb01rst is a back-projection operator, mapping\nthe vectorised correlation matrix into a spherical map by applying the adjoint of the forward operator\nused in (3). The resulting spherical map is called a dirty map, and is equivalent to the DAS image\n[53, Section 5.2][55]. The matrix D then is a deblurring operator, which subtracts at each iteration a\nfraction of the array beamshape from the spherical map, hence cleaning the latter of blur artefacts.\nThe vector \u03c4 \ufb01nally is an af\ufb01ne shrinkage operator, which biases uniformly the spherical map. The\nlatter permits \u2013in conjunction with the recti\ufb01ed linear unit\u2013 the sparsi\ufb01cation of the spherical map\nand hence improve its angular resolution.\nSince the oracle RNN is merely a reinterpretation of PGD, it inherits all its properties. In particular, it\nis capable of solving (3) with high accuracy for arbitrary input correlation matrices. Unfortunately,\nthis great generalisability is typically obtained at the price of a very large number14 of layers L,\nresulting in impractical reconstruction times. If one is however willing to sacri\ufb01ce some of this\ngeneralisability, it is possible to reduce drastically the network depth by unfreezing the weights D,\nB, \u03c4 in (6), and allowing them to be learnt for some speci\ufb01c input distribution. This idea was \ufb01rst\nexplored in the context of sparse coding by Gregor and LeCun [16], resulting in the LISTA network.\nA fully-connected architecture, corresponding to unconstrained D, B and \u03c4 , would however result\nin O(N 2) weights to be learnt, which is unfeasible in large-scale acoustic imaging problems. To\novercome this issue, we propose in the next paragraphs a parsimonious parametrisation of D and B.\nThe resulting RNN architecture, dubbed DeepWave, is given in (1) and depicted in \ufb01g. 1.\nParametrisation of D Our parametrisation of D is motivated by the following result, characteris-\ning the oracle deblurring kernel for spherical microphone arrays[42] (see proof in appendix E).\nProposition 1. Consider a spherical microphone array, with diameter D and microphone directions\n{\u02dcp1, . . . , \u02dcpM} \u2282 S2, forming a near-regular tessellation of the sphere. Then, we have\n\n(cid:104)\nI \u2212 \u03b1(cid:0)A \u25e6 A(cid:1)H(cid:0)A \u25e6 A(cid:1)(cid:105)\n\n(cid:20)\n\n(cid:39)\n\nij\n\n(cid:18) D\n\n\u03bb0\n\n(cid:19)(cid:21)\n\n\u03b4ij \u2212 \u03b1M 2 sinc2\n\n(cid:107)ri \u2212 rj(cid:107)\n\n, \u2200i, j \u2208 {1, . . . , N} (8)\n\n14Even with momentum acceleration, PGD typically requires more than 50 iterations to converge. The oracle\n\nRNN obtained by unrolling PGD will consequently be very deep.\n\n5\n\n\fAlgorithm 1 DeepWave forward propagation\n1: Input: \u02c6\u03a3t, x0\n\n2: Output: Lt \u2208 R+,(cid:8)sl\n\nt , \u02c6xt, \u03b8, B, \u03c4 , \u03c3\n\nl=1,...,L \u2282 RN\n\n(cid:9)\n\nt\n\n3:\n4: yt \u2190 diag(BH \u02c6\u03a3tB) \u2212 \u03c4\n5: for l in [1, . . . , L] do\nt \u2190 P\u03b8(L)xl\u22121\nsl\n6:\nt \u2190 \u03c3(sl\nxl\n7:\nt)\n8: Lt \u2190 1\n\n(cid:13)(cid:13)\u02c6xt \u2212 xL\n\nt + yt\n2 /(cid:107)\u02c6xt(cid:107)2\n\n(cid:13)(cid:13)2\n\n2\n\n2\n\nt\n\nt\n\nl=1,...,L\n\n(cid:9)\n(cid:1) /(cid:107)\u02c6xt(cid:107)2\n\nAlgorithm 2 DeepWave backward propagation\n1: Input: \u02c6\u03a3t, x0\n2: Output: \u2202\u03b8 \u2208 RK+1, \u2202B \u2208 CM\u00d7N , \u2202\u03c4 \u2208 RN\n\nt , \u02c6xt, \u03b8, B, \u03c3, (cid:8)sl\n\n3: (\u2202x, \u2202\u03b8, \u2202\u03c4 ) \u2190 ((cid:0)\u03c3(sL\nt)(cid:1) \u2202x\n\u2202s \u2190 diag(cid:0)\u03c3(cid:48)(sl\n\n4: for l in [L, . . . , 1] do\n5:\n\u2202x \u2190 P\u03b8(L)\u2202s\n6:\n\u2202\u03c4 \u2190 \u2202\u03c4 \u2212 \u2202s\n7:\n[\u2202\u03b8]k \u2190 [\u2202\u03b8]k + \u2202sT Tk(L) \u03c3(sl\u22121\n8:\n9: \u2202B \u2190 \u22122 \u02c6\u03a3tB diag (\u2202\u03c4 )\n\nt ) \u2212 \u02c6xt\n\n2 , 0, 0)\n\n)\n\nt\n\nFigure 2: Forward and backward algorithms to compute gradients of Lt with respect to \u03b8, B, \u03c4 . For\nnotational simplicity we use the shorthand \u2202\u03b1 = \u2202Lt/\u2202\u03b1, and assume \u03c3(s0\n\nt .\nt ) = x0\n\nwhere \u03bb0 is the wavelength, \u03b4ij denotes the Kronecker delta and sinc(x) := sin(\u03c0x)/\u03c0x is the\ncardinal sine. Moreover, the approximation (8) is extremely good for M \u2265 3(cid:98) 2\u03c0D\n\n(cid:99)2.\n\n\u03bb0\n\nconsider the following parametrisation (see appendix B.3 for details): D = P\u03b8(L) :=(cid:80)K\n\nProposition 1 tells us that, for spherical arrays with suf\ufb01cient number of microphones15, the oracle\ndeblurring operator D in (7) corresponds actually to a sampled zonal kernel [35]: [D]ij = \u03ba((cid:107)ri \u2212\nrj(cid:107)) for some \u03ba : R+ \u2192 R. Since zonal kernels are used to de\ufb01ne spherical convolutions [35], D\ncan hence be seen as a discrete convolution operator over the tessellation in use \u0398 = {r1, . . . , rN}.\nIts bandwidth is moreover essentially \ufb01nite, since coef\ufb01cients [D]ij decay as 1/(cid:107)ri \u2212 rj(cid:107)2. As\ndiscussed in [41, 13], discrete spherical convolution operators with \ufb01nite scope can be ef\ufb01ciently\nrepresented and implemented by means of graph signal processing [47] techniques. This leads us to\nk=0 \u03b8kLk,\nwhere \u03b8 = [\u03b80, . . . , \u03b8K] \u2208 RK+1, K controls the scope of the discrete convolution and L \u2208 RN\u00d7N\nis the Laplacian [47] associated to the convex-hull graph of \u0398. Note that with this parametrisation,\nthe number of parameters characterising D drops from N 2 to K + 1, with K (cid:28) N.\nParametrisation of B The oracle back-projection operator (7) admits a factorisation in terms of the\nKhatri-Rao product. We decide hence to equip B with a similar structure: B = (B \u25e6 B)H for some\nlearnable matrix B \u2208 CM\u00d7N . With such a parametrisation, the number of parameters characterising\nB drops from N M 2 to N M. The Khatri-Rao structure guarantees moreover real-valued \u2013and hence\nphysically-interpretable\u2013 dirty maps.\n\n3 Network training\n\nTo facilitate the description of the training procedure, we adopt the following shorthand notations.\n\u2022 DeepWave(\u2126, L) denotes a speci\ufb01c instance of the DeepWave network (1) with parameters\n\u2022 APGD(\u03b1, \u03bb, \u03b3) denotes an instance of APGD (5), with tuning parameters (\u03b1, \u03bb, \u03b3) \u2208 R3\n+.\n\n\u2126 := {\u03b8, B, \u03c4} and depth L.\n\n\u02c6\u2126 \u2208 arg min\n\u03b8\u2208RK+1\nB\u2208CM\u00d7N\n\u03c4\u2208RN\n\nThe network parameters are chosen as minimisers of the following optimisation problem:\n\n(cid:13)(cid:13)\u02c6xt \u2212 xL\nt (\u2126)(cid:13)(cid:13)2\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n\n(cid:13)(cid:13)(cid:13)L1/2\u03c4\n(cid:13)(cid:13)(cid:13)2\n(cid:123)(cid:122)\n(cid:125)\n(cid:80)T\n(9)\nt (\u2126)}t and {\u02c6xt}t in (9) correspond respectively to the outputs of DeepWave(\u2126, L)\nt=1 Lt is a\n\nThe quantities {xL\nand APGD(\u03b1, \u03bb, \u03b3) with identical example input data {( \u02c6\u03a3t, x0\n\nt )}t. The \ufb01rst term 1\n15For a spherical array with diameter D = 30 cm operating at 1 kHz, M \u2265 90 is suf\ufb01cient.\n\n2(cid:107)\u02c6xt(cid:107)2\n:=Lt\n\n2(K + 1)\n:=L\u03b8\n\nT(cid:88)\n\n(cid:107)B(cid:107)2\n\n(cid:107)\u03b8(cid:107)2\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n\u03bb\u03c4\n2N\n\n:=LB\n\n2M N\n\n:=L\u03c4\n\n(cid:124)\n\n(cid:124)\n\n(cid:124)\n\n(cid:125)\n\n2\n\n(cid:125)\n\n.\n\n2\n\n2\n\n+\n\n+\n\nF\n\n\u03bbB\n\n+\n\nt=1\n\n2\n\nT\n\n1\nT\n\n\u03bb\u03b8\n\n6\n\n\ft (\u2126) as close as possible from one another.16 The\ndata-\ufb01delity term, which attempts to bring \u02c6xt and xL\nadditional terms L\u03b8,LB,L\u03c4 are smoothing regularisers, \ufb01ghting against over\ufb01tting, a common issue\nin deep learning. Since the shrinkage operator \u03c4 is de\ufb01ned over an irregular spherical tessellation,\nthe smoothing term L\u03c4 is de\ufb01ned via the Laplacian L \u2208 RN\u00d7N associated to the connectivity graph\nof the tessellation, as is customary in graph signal processing (see appendix B.3).\nOptimisation of (9) is carried out by stochastic gradient descent (SGD) with momentum acceleration\n[51]. Gradients of Lt with respect to \u03b8, B, \u03c4 are ef\ufb01ciently evaluated using reverse-mode algorithmic\ndifferentiation [1, 25] and are given in algorithms 1 and 2 (see appendix F for a derivation). While\nrandom initialisation of neural-networks is a common practice in deep learning [51], this strategy\nfailed for our speci\ufb01c architecture, leading to poor validation loss and considerably increased training\ntimes. Instead, we hence use the oracle parameters (7) to initialise SGD:\n\n\u03b80 := arg min\n\u03b8\u2208RK+1\n\n(cid:107)P\u03b8(L) \u2212 D(cid:107)2\n\nF , B0 :=\n\nA,\n\n\u03c4 0 :=\n\n\u03bb\u03b1\u03b3\n\n\u03b2\n\n1N .\n\n(10)\n\n\u03b2\n\n(cid:114) \u03b1\n\n\ufb01lter as P\u03b8(\u02dcL) =(cid:80)K\n\nFor greater numerical stability during training, we follow [41] and reparameterise the deblurring\nk=0 \u03b8kTk(\u02dcL), where Tk(\u00b7) is the Chebychev polynomial of order k and \u02dcL is the\nnormalised Laplacian with spectrum in [\u22121, 1] (see appendix B.3 for implementation details). Finally,\nwe substitute the ReLu activation function by a scaled recti\ufb01ed tanh to avoid the exploding gradient\nproblem [39].17\n\n4 Experimental results\n\nIn this section, we compare the accuracy, resolution and runtime performance of DeepWave to\nDAS and APGD on real-world (RW) and simulated (SIM) datasets. More comprehensive dataset\ndescriptions and additional results, including an ablation study, are provided in appendices G to I.\n\nDataset 1 [36] (RW)\nreproduces a conference room setup depicted in \ufb01gs. 3a and 3b, where 8\npeople18 are gathered around a table and speak either in turns or simultaneously (with at most 3\nconcurrent speakers). Recordings of the conversation are collected by the 48-element Pyramic array\n[46] (\ufb01g. 3f) positioned at the centre of the table. Since human speech is wide-band, the audible range\n[1500, 4500] Hz in the latter are pre-processed every 100 ms and split into 9 uniform bins to form\nt )}t of 2760 data points per frequency band for DeepWave (with\na suitable training set {( \u02c6\u03a3t, \u02c6xt, x0\nN = 2234). (See appendix G.2.) Frequency channels are processed independently by each algorithm.\nDeepWave is trained by splitting the data points into a training and validation set (respectively 80%\nand 20% in size). For each frequency band, we chose an architecture with 5 layers.\nIn \ufb01g. 3, \ufb01gs. G.4 and H.3 respectively, we compare the accuracy and runtime of DeepWave, DAS\nand APGD. A video showing the evolution in time of DeepWave and DAS azimuthal sound \ufb01elds (as\nin \ufb01gs. 3a and 3b) is also available.19 In terms of resolution, DeepWave and APGD perform similarly,\noutperforming DAS by approximately 27%. The mean contrast scores for DeepWave and DAS over\nthe test set of Dataset 1 are 0.99 (\u00b10.0081) and 0.89 (\u00b10.07), respectively. Note that since the metrics\nused for assessing resolution and contrast20 are not perfectly re\ufb02ective of human-eye perception, the\nreported image quality improvements appear even more striking through visual inspection of the\nsound intensity \ufb01elds (see for example \ufb01g. 3).\n\nDataset 2 [43] (RW)\nconsists of 2700 template recordings from the Pyramic array taken in an\nanechoic chambre at an angular resolution of 2 degrees in azimuth and three different elevations (-15,\n0, 15 degrees). Recordings contain both male and female speech samples to cover a wide audible\nrange. The audio samples can be combined to simulate complex multi-source sound \ufb01elds, hence we\nleverage this property to augment the dataset to 5700 distinct recordings with one, two, or three active\nspeakers simultaneously. The raw time-series are then pre-processed as for Dataset 1 to obtain a\n\n16in a mean relative squared-error sense.\n17An alternative is to use a truncated ReLu. Given initialisation strategy 10, network training will still\n\nconverge with similar step sizes as those used with tanh non-linearities.\n\n18The 8 people are represented in the experiment by loadspeakers playing male and female speech samples.\n19Available online: https://www.youtube.com/watch?v=PwB3CS2rHdI\n20As is customary, resolution is measured as the width at half-maximum of the impulse response of the\n\nalgorithms. Contrast is measured as the difference between the maximum and mean of the greyscale image.\n\n7\n\n\f(a) DAS azimuthal sound \ufb01eld.\n\n(b) DeepWave azimuthal sound \ufb01eld.\n\n(c) DAS spherical sound \ufb01eld (resolution: 25.3\u25e6 , RMS contrast: 0.78).\n\n(d) Frequency-colour mapping.\n\n(e) DeepWave spherical sound \ufb01eld (resolution: 18.5\u25e6 , contrast: 0.97).\n\n(f) Pyramic array.\n\nFigure 3: Snapshots at time t = 1.7 s of the sound intensity \ufb01elds produced by DeepWave and DAS\nfor the Pyramic recordings with speakers 2, 6 and 16 active. Sound frequencies range from 1.5 to 4.5\nkHz and were mapped to true colours (see \ufb01g. 3d, colour shades correspond to lower intensities). The\nspherical maps of DAS and DeepWave are plotted in \ufb01gs. 3c and 3e, respectively. In \ufb01gs. 3a and 3b\nwe plot the azimuthal projections of \ufb01gs. 3c and 3e, respectively.\n\ntraining set of 151980 data points per frequency band (with N = 1568). Network training is identical\nto that of Dataset 1, except that 10 azimuth directions are also witheld from the training set to assess\nhow well the network generalises to emissions from unseen directions.\nFigures 4a and 4b show sample DAS and DeepWave reconstructions with real sources from directions\nwithheld from the training set. Similarly, \ufb01g. 4c shows sample reconstructions when the network is\ntrained on real data but tested on synthetic narrow-band covariance matrices induced by sources from\ndirections absent from the training set. In both cases we see that DeepWave outperforms DAS in\nresolution and contrast (i.e. sharper blobs and darker background).\n\nDataset 3 (SIM) \ufb01nally is a dataset with recordings from a spherical microphone array using a\nnarrow-band point-source data-model at 2 kHz [53]. The sources are randomly positioned over a\n120\u25e6 \ufb01eld-of-view, with up to 10 concurrent sources per recording. Experiment results available in\n\ufb01g. H.1 corroborate the real-data results, hence showing that DeepWave generalises well to a large\nnumber of sources with unconstrained positions. We further investigated in \ufb01g. H.2 the in\ufb02uence of\nnetwork depth, and concluded that 5 or 6 layers are generally suf\ufb01cient for the investigated dataset.\n\n8\n\n\f(a) DAS/DeepWave sound \ufb01elds for Dataset 2.\n\n(b) DAS/DeepWave sound \ufb01elds for Dataset 2.\n\n(c) DAS/DeepWave sound \ufb01elds for synthetic data trained on Dataset 2.\n\nFigure 4: Snapshots of the sound intensity \ufb01elds produced by DeepWave and DAS when trained\non Dataset 2 (with 10 held-out source directions). Each subplot contains a DAS image (top) and a\nDeepWave image (bottom). The frequency color mapping is identical to \ufb01g. 3d. Figures 4a and 4b\nshow azimuthal sound \ufb01eld slices on [\u221220\u25e6, 150\u25e6] using real-world covariance matrices with sources\nfrom unseen directions during training. Figure 4c shows a full 360\u25e6 sound \ufb01eld on a synthetic\ncovariance matrix from unseen directions during training. Elevations span [\u221215\u25e6, +15\u25e6].\n\nIn terms of runtimes \ufb01nally, DeepWave and DAS both reach real-time requirements (6.5 ms and 2.0\nms respectively), largely outperforming APGD (211 ms). (See \ufb01g. H.3 for more details.)\n\n5 Conclusion\n\nWe introduced DeepWave, the \ufb01rst recurrent neural-network for real-time and high resolution acoustic\nimaging. It mimics iterative solvers from convex optimisation, while using the natural structure\nof acoustic imaging problems for ef\ufb01cient training and operation. Our real and simulated data\nexperiments show DeepWave has similar computational speed to the state-of-the-art DAS imager\nwith vastly superior resolution and contrast.\nFor future work, one of our goals is to make DeepWave time-aware, by training it on sequences of\nconsecutive measurements in time. To this end, we plan to connect multiple DeepWave networks\ntogether, one for each time, and train them end-to-end. In such an architecture, the output neurons\nfrom one network would serve as initial neural state x0 for the next network in line. This can be\ninterpreted as warm-starting the network with the sound \ufb01eld estimated at the previous time instant.\nAdditionally, we would like to propose a frequency-invariant DeepWave architecture, allowing to\ntrain a single network for all frequency bands. Properties of the oracle weights (7) suggest that this\nshould be possible. This would considerably facilitate the training of the network, since the training\nset would be augmented and the number of trainable parameters reduced.\n\n9\n\n\fAcknowledgments We thank Erwan Zerhouni for useful discussions regarding network training\nand implementation details; and Ivan Dokmani\u00b4c for insights on related works that inspired our\napproach. Finally we express our gratitude towards Robin Scheibler and Hanjie Pan for their\nopenly-accessible real-world datasets [36, 43].\n\nReferences\n[1] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark\nSiskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning\nResearch, 18:1\u201343, 2018.\n\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[3] Jacob Benesty, Jingdong Chen, and Yiteng Huang. Microphone array signal processing,\n\nvolume 1. Springer Science & Business Media, 2008.\n\n[4] Adrien Besson, Lucien Roquette, Dimitris Perdios, Matthieu Simeoni, Marcel Arditi, Paul\nHurley, Yves Wiaux, and Jean-Philippe Thiran. A physical model of non-stationary blur in\nultrasound imaging. IEEE Transactions on Computational Imaging, 2019.\n\n[5] Eric Bezzam, Robin Scheibler, Juan Azcarreta, Hanjie Pan, Matthieu Simeoni, Ren\u00e9 Beuchat,\nPaul Hurley, Basile Bruneau, Corentin Ferry, and Sepand Kashani. Hardware and software for\nreproducible research in audio array signal processing. In 2017 IEEE International Conference\non Acoustics, Speech and Signal Processing (ICASSP), pages 6591\u20136592. Ieee, 2017.\n\n[6] Jan Biemond, Reginald L Lagendijk, and Russell M Mersereau. Iterative methods for image\n\ndeblurring. Proceedings of the IEEE, 78(5):856\u2013883, 1990.\n\n[7] Thomas F Brooks and William M Humphreys. A deconvolution approach for the mapping of\nacoustic sources (damas) determined from phased microphone arrays. Journal of Sound and\nVibration, 294(4-5):856\u2013879, 2006.\n\n[8] Leon Brusniak, James Underbrink, and Robert Stoker. Acoustic imaging of aircraft noise\nsources using large aperture phased arrays. In 12th AIAA/CEAS Aeroacoustics Conference (27th\nAIAA Aeroacoustics Conference), page 2715, 2006.\n\n[9] Antonin Chambolle and Ch Dossal. On the convergence of the iterates of the \u201cfast itera-\ntive shrinkage/thresholding algorithm\u201d. Journal of Optimization theory and Applications,\n166(3):968\u2013982, 2015.\n\n[10] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A \ufb02exible framework\nfor fast and effective image restoration. IEEE transactions on pattern analysis and machine\nintelligence, 39(6):1256\u20131272, 2017.\n\n[11] Ning Chu, Jos\u00e9 Picheral, Ali Mohammad-Djafari, and Nicolas Gac. A robust super-resolution\napproach with sparsity constraint in acoustic imaging. Applied Acoustics, 76:197\u2013208, 2014.\n\n[12] Zhigang Chu and Yang Yang. Comparison of deconvolution methods for the visualization of\nacoustic sources based on cross-spectral imaging function beamforming. Mechanical Systems\nand Signal Processing, 48(1-2):404\u2013422, 2014.\n\n[13] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in neural information processing\nsystems, pages 3844\u20133852, 2016.\n\n[14] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Bull.\n\nAm. Math, 54:151\u2013165, 2017.\n\n[15] Krzysztof M Gorski, Eric Hivon, Anthony J Banday, Benjamin D Wandelt, Frode K Hansen,\nMstvos Reinecke, and Matthia Bartelmann. Healpix: a framework for high-resolution discretiza-\ntion and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759,\n2005.\n\n10\n\n\f[16] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings\nof the 27th International Conference on International Conference on Machine Learning, pages\n399\u2013406. Omnipress, 2010.\n\n[17] Harshit Gupta, Kyong Hwan Jin, Ha Q Nguyen, Michael T McCann, and Michael Unser.\nCnn-based projected gradient descent for consistent ct image reconstruction. IEEE transactions\non medical imaging, 37(6):1440\u20131453, 2018.\n\n[18] RK Hansen and PA Andersen. A 3d underwater acoustic camera\u2014properties and applications.\n\nIn Acoustical Imaging, pages 607\u2013611. Springer, 1996.\n\n[19] Doug P Hardin, Timothy Michaels, and Edward B Saff. A comparison of popular point\n\ncon\ufb01gurations on s\u02c6 2. Dolomites Research Notes on Approximation, 9(1), 2016.\n\n[20] Paul Hurley and Matthieu Simeoni. Flexibeam: analytic spatial \ufb01ltering by beamforming. In\n2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 2877\u20132880. Ieee, 2016.\n\n[21] Paul Hurley and Matthieu Simeoni. Flexarray: Random phased array layouts for analytical\nspatial \ufb01ltering. In 2017 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 3380\u20133384. Ieee, 2017.\n\n[22] Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolu-\ntional neural network for inverse problems in imaging. IEEE Transactions on Image Processing,\n26(9):4509\u20134522, 2017.\n\n[23] KG Jinadasa. Applications of the matrix operators vech and vec. Linear Algebra and its\n\nApplications, 101:73\u201379, 1988.\n\n[24] Charles H Jones and George A Gilmour. Acoustic camera apparatus, August 5 1975. US Patent\n\n3,898,608.\n\n[25] Sepand Kashani. Optimization notes. page 6, 2019.\n\n[26] K Kim, N Neretti, and N Intrator. Mosaicing of acoustic camera images. IEE Proceedings-Radar,\n\nSonar and Navigation, 152(4):263\u2013270, 2005.\n\n[27] Hamid Krim and Mats Viberg. Two decades of array signal processing research. IEEE signal\n\nprocessing magazine, 1996.\n\n[28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[29] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[30] Huan Li, Yibo Yang, Dongmin Chen, and Zhouchen Lin. Optimization algorithm inspired deep\n\nneural network structure design. arXiv preprint arXiv:1810.01638, 2018.\n\n[31] Jingwei Liang and Carola-Bibiane Sch\u00f6nlieb. Faster \ufb01sta. arXiv preprint arXiv:1807.04005,\n\n2018.\n\n[32] Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed\nsensing for rapid mr imaging. Magnetic Resonance in Medicine: An Of\ufb01cial Journal of the\nInternational Society for Magnetic Resonance in Medicine, 58(6):1182\u20131195, 2007.\n\n[33] Michael T McCann, Kyong Hwan Jin, and Michael Unser. Convolutional neural networks for\ninverse problems in imaging: A review. IEEE Signal Processing Magazine, 34(6):85\u201395, 2017.\n\n[34] Andy Meyer and Dirk D\u00f6bler. Noise source localization within a car interior using 3d-\n\nmicrophone arrays. Proceedings of the BeBeC, pages 1\u20137, 2006.\n\n[35] Volker Michel. Lectures on constructive approximation. AMC, 10:12, 2013.\n\n11\n\n\f[36] Hanjie Pan, Robin Scheibler, Eric Bezzam, Ivan Dokmani\u00b4c, and Martin Vetterli. Audio speech\nrecordings used in the paper FRIDA: FRI-based DOA Estimation for Arbitrary Array Layout,\nMarch 2017. This work was supported by the Swiss National Science Foundation grant 20FP-1\n151073, LABEX WIFI under references ANR-10-LABX-24 and ANR-10-IDEX-0001-02 PSL*\nand by Agence Nationale de la Recherche under reference ANR-13-JS09-0001-01.\n\n[37] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R(cid:13) in Optimiza-\n\ntion, 1(3):127\u2013239, 2014.\n\n[38] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang. Super-resolution image reconstruction: a\n\ntechnical overview. IEEE signal processing magazine, 20(3):21\u201336, 2003.\n\n[39] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\n\nneural networks. In International conference on machine learning, pages 1310\u20131318, 2013.\n\n[40] Dimitris Perdios, Adrien Besson, Marcel Arditi, and Jean-Philippe Thiran. A deep learning\napproach to ultrasound image recovery. In 2017 IEEE International Ultrasonics Symposium\n(IUS), pages 1\u20134. Ieee, 2017.\n\n[41] Nathana\u00ebl Perraudin, Micha\u00ebl Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere:\nEf\ufb01cient spherical convolutional neural network with healpix sampling for cosmological appli-\ncations. Astronomy and Computing, 2019.\n\n[42] Boaz Rafaely. Fundamentals of spherical array processing, volume 8. Springer, 2015.\n\n[43] Scheibler Robin. Pyramic Dataset : 48-Channel Anechoic Audio Recordings of 3D Sources,\nMarch 2018. The author would like to acknowledge Juan Azcarreta Ortiz, Corentin Ferry, and\nRen\u00e9 Beuchat for their help in the design and usage of the Pyramic array. Hanjie Pan, Miranda\nKrekovi\u00b4c, Mihailo Kolundzija, and Dalia El Badawy for lending a hand, or even two, during\nexperiments. Finally, Juan Azcarreta Ortiz, Eric Bezzam, Hanjie Pan and Ivan Dokmani\u00b4c for\nfeedback on the documentation and dataset organization.\n\n[44] Justin Romberg.\n\nImaging via compressive sampling.\n\n25(2):14\u201320, 2008.\n\nIEEE Signal Processing Magazine,\n\n[45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\nbiomedical image segmentation. In International Conference on Medical image computing and\ncomputer-assisted intervention, pages 234\u2013241. Springer, 2015.\n\n[46] Robin Scheibler, Juan Azcarreta, Ren\u00e9 Beuchat, and Corentin Ferry. Pyramic: Full stack open\nmicrophone array architecture and dataset. In 2018 16th International Workshop on Acoustic\nSignal Enhancement (IWAENC), pages 226\u2013230. IEEE, 2018.\n\n[47] David Shuman, Sunil Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The\nemerging \ufb01eld of signal processing on graphs: Extending high-dimensional data analysis to\nnetworks and other irregular domains. IEEE Signal Processing Magazine, 3(30):83\u201398, 2013.\n\n[48] Pieter Sijtsma. Clean based on spatial source coherence. International journal of aeroacoustics,\n\n6(4):357\u2013374, 2007.\n\n[49] Matthieu Martin Jean-Andre Simeoni and Paul Hurley. Graph spectral clustering of convolution\n\nartefacts in radio interferometric images. Technical report, 2019.\n\n[50] Jian Sun, Huibin Li, Zongben Xu, et al. Deep admm-net for compressive sensing mri. In\n\nAdvances in neural information processing systems, pages 10\u201318, 2016.\n\n[51] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[52] Michael Unser, Julien Fageot, and Harshit Gupta. Representer theorems for sparsity-promoting\n\nl1 regularization. IEEE Transactions on Information Theory, 62(9):5167\u20135180, 2016.\n\n[53] Alle-Jan van der Veen and Stefan J Wijnholds. Signal processing tools for radio astronomy. In\n\nHandbook of Signal Processing Systems, pages 421\u2013463. Springer, 2013.\n\n12\n\n\f[54] Yves Wiaux, Laurent Jacques, Gilles Puy, Anna MM Scaife, and Pierre Vandergheynst. Com-\npressed sensing imaging techniques for radio interferometry. Monthly Notices of the Royal\nAstronomical Society, 395(3):1733\u20131742, 2009.\n\n[55] Stefan J Wijnholds and Alle-Jan van der Veen. Fundamental imaging limits of radio telescope\n\narrays. IEEE Journal of Selected Topics in Signal Processing, 2(5):613\u2013623, 2008.\n\n[56] Li Xu, Jimmy SJ Ren, Ce Liu, and Jiaya Jia. Deep convolutional neural network for image\ndeconvolution. In Advances in Neural Information Processing Systems, pages 1790\u20131798, 2014.\n\n[57] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n13\n\n\f", "award": [], "sourceid": 8774, "authors": [{"given_name": "Matthieu", "family_name": "SIMEONI", "institution": "IBM Research / EPFL"}, {"given_name": "Sepand", "family_name": "Kashani", "institution": "EPFL"}, {"given_name": "Paul", "family_name": "Hurley", "institution": "Western Sydney University"}, {"given_name": "Martin", "family_name": "Vetterli", "institution": "EPFL"}]}