{"title": "Joint Sub-bands Learning with Clique Structures for Wavelet Domain Super-Resolution", "book": "Advances in Neural Information Processing Systems", "page_first": 165, "page_last": 175, "abstract": "Convolutional neural networks (CNNs) have recently achieved great success in single-image super-resolution (SISR). However, these methods tend to produce over-smoothed outputs and miss some textural details. To solve these problems, we propose the Super-Resolution CliqueNet (SRCliqueNet) to reconstruct the high resolution (HR) image with better textural details in the wavelet domain. The proposed SRCliqueNet firstly extracts a set of feature maps from the low resolution (LR) image by the clique blocks group. Then we send the set of feature maps to the clique up-sampling module to reconstruct the HR image. The clique up-sampling module consists of four sub-nets which predict the high resolution wavelet coefficients of four sub-bands. Since we consider the edge feature properties of four sub-bands, the four sub-nets are connected to the others so that they can learn the coefficients of four sub-bands jointly. Finally we apply inverse discrete wavelet transform (IDWT) to the output of four sub-nets at the end of the clique up-sampling module to increase the resolution and reconstruct the HR image. Extensive quantitative and qualitative experiments on benchmark datasets show that our method achieves superior performance over the state-of-the-art methods.", "full_text": "Joint Sub-bands Learning with Clique Structures\n\nfor Wavelet Domain Super-Resolution\n\nZhisheng Zhong1 Tiancheng Shen1,2 Yibo Yang1,2 Chao Zhang1,\u2217 Zhouchen Lin1,3\n\n1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University\n\n2Academy for Advanced Interdisciplinary Studies, Peking University\n\n3Cooperative Medianet Innovation Center, Shanghai Jiao Tong University\n{zszhong, tianchengshen, ibo, c.zhang, zlin}@pku.edu.cn\n\nAbstract\n\nConvolutional neural networks (CNNs) have recently achieved great success in\nsingle-image super-resolution (SISR). However, these methods tend to produce\nover-smoothed outputs and miss some textural details. To solve these problems,\nwe propose the Super-Resolution CliqueNet (SRCliqueNet) to reconstruct the high\nresolution (HR) image with better textural details in the wavelet domain. The\nproposed SRCliqueNet \ufb01rstly extracts a set of feature maps from the low resolution\n(LR) image by the clique blocks group. Then we send the set of feature maps\nto the clique up-sampling module to reconstruct the HR image. The clique up-\nsampling module consists of four sub-nets which predict the high resolution wavelet\ncoef\ufb01cients of four sub-bands. Since we consider the edge feature properties of\nfour sub-bands, the four sub-nets are connected to the others so that they can\nlearn the coef\ufb01cients of four sub-bands jointly. Finally we apply inverse discrete\nwavelet transform (IDWT) to the output of four sub-nets at the end of the clique\nup-sampling module to increase the resolution and reconstruct the HR image.\nExtensive quantitative and qualitative experiments on benchmark datasets show\nthat our method achieves superior performance over the state-of-the-art methods.\n\n1\n\nIntroduction\n\nSingle image super-resolution (SISR) is to reconstruct a high-resolution (HR) image from a single low-\nresolution (LR) image, which is an ill-posed inverse problem. SISR has gained increasing research\ninterest for decades. Recently, convolutional neural networks (CNNs) [6, 25, 32] signi\ufb01cantly improve\nthe peak signal-to noise ratio (PSNR) in SISR. These networks commonly use an extraction module\nto extract a series of feature maps from the LR image, cascaded with an up-sampling module to\nincrease resolution and reconstruct the HR image.\nThe quality of extracting features will seriously affect the performance of the HR image reconstruction.\nThe main part of extraction module used in modern SR networks can be primarily divided into three\ntypes: conventional convolution layers [23], residual blocks [9] and dense blocks [10].\nConventional convolution has been widely considered by scholars since AlexNet [20] won the \ufb01rst\nprize of ILSVRC in 2012. The \ufb01rst model using conventional convolution to solve the SR problem\nis SRCNN [6]. After that, many improved networks such as FSRCNN [7], SCN [36], ESPCN [28]\nand DRCN [18] also use conventional convolution and achieve great results. Residual block [9] is\nan improved version of the convolutional layer, which exhibits excellent performance in computer\nvision problems. Since it can enhance the feature propagation in networks and alleviate the vanishing-\n\n\u2217Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The architecture of the proposed Super-Resolution CliqueNet (SRCliqueNet).\n\ngradient problem, many SR networks such as VDSR [17], LapSRN[22], EDSR [25] and SRResNet\n[24] import residual blocks and exhibit improved performances.\nTo make use of the skip connections used in residual blocks, Huang et al. proposed the dense block\n[10] further. A dense block builds more connections among layers to enlarge the information \ufb02ow.\nTong et al. [35] proposed SRDenseNet using dense blocks, which boosts the performance further\nmore.\nRecently, Yang et al proposed a novel block called the clique block [38], where the layers in a block\nare constructed as a clique and are updated alternately in a loop manner. Any layer is both the input\nand the output of another one in the same block so that the information \ufb02ow is maximized. The\npropagation of a clique block contains two stages. The \ufb01rst stage does the same thing as a dense\nblock. The second stage distills the feature maps by using the skip connections between any layers,\nincluding connections between subsequent layers.\nA suitable up-sampling module can further improve image reconstruction performance. The up-\nsampling modules used in modern SR networks to increase the resolution can also be primarily divided\ninto three types: interpolation up-sampling, deconvolution up-sampling and sub-pixel convolution\nup-sampling.\nInterpolation up-sampling was \ufb01rst used in SRCNN [6]. At that time, there were no effective\nimplementations of module that can make the output size larger than the input size. So SRCNN used\npre-de\ufb01ned bicubic interpolation on input images to get the desired size \ufb01rst. Following SRCNN\nusing pre-interpolation, VDSR [17], IRCNN [42], DRRN [31] and MemNet [32] used different\nextraction modules. However, this pre-processing step increases computation complexity because the\nsize of feature maps is multiple.\nDeconvolution proposed in [39, 40] can be seen as multiplication of each input pixel by a \ufb01lter, which\ncould increase the input size if the stride s > 1. Many modern SR networks such as FSRCNN [7],\nLapCNN [22], DBPN [8] and IDN [14] got better results by using deconvolution as the up-sampling\nmodule. However, the computation complexity of forward and back propagation of deconvolution is\nstill a major concern.\nSub-pixel convolution proposed in [28] aims at accelerating the up-sampling operation. Unlike\nprevious up-sampling methods that change the height and width of the input feature maps, sub-pixel\nconvolution implements up-sampling by increasing the number of channels. After that sub-pixel\nconvolution uses a periodic shuf\ufb02ing operation to reshape the output feature map to the desired height\nand width. ESPCN [28], EDSR [25] and SRMD [42] used sub-pixel convolution to achieve good\nperformances on benchmark datasets.\nThese above-mentioned networks tend to produce blurry and overly-smoothed HR images, lacking\nsome texture details. Wavelet transform (WT) has been shown to be an ef\ufb01cient and highly intuitive\ntool to represent and store images in a multi-resolution way [26, 30]. WT can describe the contextual\nand textural information of an image at different scales. WT for super-resolution has been applied\nsuccessfully to the multi-frame SR problem [4, 16, 27].\nMotivated by the remarkable properties of clique block and WT, we propose a novel network for SR\ncalled SRCliqueNet to address the above-mentioned challenges. We design the res-clique block as\nthe main part of the extraction module to improve the network\u2019s performance. We also design a novel\nup-sampling module called clique up-sampling. It consists of four sub-nets which use to predict the\nhigh resolution wavelet coef\ufb01cients of four sub-bands. Since we consider the edge feature properties\nof four sub-bands, four sub-nets can learn the coef\ufb01cients of four sub-bands jointly. For magni\ufb01cation\nfactors greater than 2, we design a progressive SRCliqueNet upon image pyramids [1]. Our proposed\nnetwork achieves superior performance over the state-of-the-art methods on benchmark datasets.\n\n2\n\nConv + PReLUConv + PReLUCliqueNet Block GroupClique Up-samplingConvLR imageHR imageFeature Embedding NetImage Reconstruction Net\ud835\udc08LR\ud835\udc051\ud835\udc052\ud835\udc05CBG\ud835\udc05FEN\ud835\udc05CU \ud835\udc08HR\fFigure 2: The illustrations of the res-clique block (left) and the clique block group (right).\n\n2 Super-Resolution CliqueNet\n\nIn this section, we \ufb01rst overview the proposed SRCliqueNet architecture, then we introduce the\nfeature embedding net (FEN) and the image reconstruction net (IRN), which are the key parts of\nSRCliqueNet.\n\n2.1 Network architecture\n\nAs shown in Figure 1, our SRCliqueNet mainly consists of two sub-networks: FEN and IRN. FEN\nrepresents the LR input image as a set of feature maps. Note that FEN does not change the size (h, w)\nof the input image, where h and w are the height and the width, respectively. IRN up-samples the\nfeature map got by FEN and reconstructs the HR image. Here we denote ILR \u2208 R3\u00d7h\u00d7w as the input\nLR image and IHR \u2208 R3\u00d7rh\u00d7rw as the ground truth HR image, where r is the magni\ufb01cation factor.\n\n2.2 Feature Embedding Net\n\nAs shown in the left part of Figure 1, FEN starts with two convolutional layers. The \ufb01rst convolutional\nlayer tries to increase the number of channels of input, which can be added with the output of the clique\nblock group via the skip connection. The clique block group will be introduced immediately. The\nskip connection after the \ufb01rst convolutional layer has been widely used in SR networks [14, 24, 25].\nThe output of the \ufb01rst convolutional layer is F1 \u2208 Rnlg\u00d7h\u00d7w, where n is the number of clique blocks\nthat follow, l is the number of layers in each clique block and g is the growth rate of each clique\nblock. The second convlutional layer tries to change the number of channels so that they can \ufb01t the\ninput of clique block group. The output of the second convolutional layer is F2 \u2208 Rlg\u00d7h\u00d7w.\nThe illustrations of res-clique block and clique block group are shown in Figure 2. We choose\nclique block as our main feature extractor for the following reasons. First, a clique block\u2019s forward\npropagation contains two stages. The propagation of \ufb01rst stage does the same things as dense\nblock, while the second stage distills the feature further. Second, a clique block contains more skip\nconnections compared with a dense block, so the information among layers can be more easily\npropagated. We add a residual connection to the clique block, since the input feature contains plenty\nof useful information in terms of SR problem. We call such kind of clique block as the res-clique\nblock.\nSuppose a res-clique block has l layers and the input and the output of the res-clique block are\ndenoted by X0 \u2208 Rlg\u00d7h\u00d7w and Y \u2208 Rlg\u00d7h\u00d7w, respectively. The weight between layer i and layer j\nis represented by Wij. The feed-forward pass of the clique block can be mathematically described as\nk + W0i \u2217 X0), where \u2217 is the\nthe following equations. For stage one, X(1)\nk=1 Wki \u2217 X(2)\nconvolution operation, \u03c3 is the activation function. For stage two, X(2)\nk +\n] + X0, where [\u00b7]\n2 , X(2)\n\n(cid:80)l\nk=i+1 Wki \u2217 X(1)\n\ni = \u03c3((cid:80)i\u22121\n\nk ). For residual connection, Y = [X(2)\n\n1 , X(2)\n\nrepresents the concatenation operation.\nThen we combine nB res-clique blocks into a clique block group. The output of a clique block\ngroup makes use of features from all preceding res-clique blocks and can be represented as\nBi = HRCBi(Bi\u22121), i = 1, 2, 3, ..., nB, Bi \u2208 Rlg\u00d7h\u00d7w, where Bi is the output and HRCBi is\nthe underlying mapping of the i-th res-clique block. Since F2 is the input of the \ufb01rst res-clique block,\nwe have B0 = F2. FCBG = [B1, B2, ..., Bn] \u2208 Rnlg\u00d7h\u00d7w is the output of clique block group.\nFinally, the output of FEN is a summation of FCBG and F1, that is FFEN = FCBG + F1.\n\ni = \u03c3((cid:80)i\u22121\n\nk=1 Wki \u2217 X(1)\n\n3 , ..., X(2)\n\nl\n\n3\n\nConcat\u2026\u2026Clique Block GroupRes-Clique Block 1 \ud835\udc010(\ud835\udc052)\ud835\udc011\ud835\udc012\ud835\udc01\ud835\udc5bB\ud835\udc05CBGRes-Clique BlockStage-IStage-IIResidual connectionRes-Clique Block 2 Res-Clique Block 3 Res-Clique Block \f2.3\n\nImage Reconstruction Net\n\nNow we present details about IRN. As shown in the right part of Figure 1, IRN consists of two parts:\na clique up-sampling module and a convolutional layer which is used to reduce the number of feature\nmaps to reconstruct the HR image with 3 channels (RGB).\nThe clique up-sampling module showing in Figure 3 is the most signi\ufb01cant part of IRN. It is motivated\nby discrete wavelet transformation (DWT) and clique block. It contains four sub-nets, representing\nfour sub-bands denoted by LL, HL, LH and HH in the wavelet domain, respectively. Previous CNNs\nfor wavelet domain SR [11, 21] ignore the relationship among the four sub-bands. The LL block\nrepresents low-pass \ufb01ltering of the original image at half the resolution. The output feature maps of\nFEN encode the essential information in the original LR image. So we use the output feature FFEN\nto learn the LL block \ufb01rstly. We represent the number of channels of input feature maps by c, then\nFFEN \u2208 Rc\u00d7h\u00d7w, c = nlg. This process can be written as\nLL(FFEN),\n\nLL = H(1)\nF(1)\n\n(1)\n\nwhere H(1)\nLL denotes the learnable non-linear function of the LL block for the \ufb01rst step. The HL\nblock shows horizontal edges, mostly. In contrast, the LH block mainly contains vertical edges. As\nillustrated in the left part of Figure 4, we take an image from Set5 [3] as an example. Both the HL\nand LH blocks can be learned from the LL block and the feature FFEN, written as\n\nHL = H(1)\nF(1)\n\nHL([FFEN, F(1)\n\nLL]), F(1)\n\nLH = H(1)\n\nLH([FFEN, F(1)\n\nLL]),\n\n(2)\n\nHL and H(1)\n\nwhere H(1)\nLH denote the learnable function to construct the HL and the LH blocks for the\n\ufb01rst step. The HH block \ufb01nds edges of the original image in the diagonal direction. Also shown in\nthe left part of Figure 4, the HH block looks similar to the LH and the HL blocks, so we suggest that\nusing LL, HL, LH blocks and the output feature map of FEN could learn the HH block easier than\nusing the feature map alone. We formulate it as\n\nHH = H(1)\nF(1)\n\nHH([FFEN, F(1)\n\nLL, F(1)\n\nHL, F(1)\n\nLH]).\n\n(3)\n\nWe name the above-mentioned operations as the sub-band extraction stage. We also plot four\nhistograms at the right part of Figure 4 to prove that the sub-band extraction stage is effective. We\napply DWT to 800 images from DIV2K [25] which we use as our training dataset in our experiments\nand plot histograms of four sub-bands\u2019 DWT coef\ufb01cients of these images. From Figure 4, we \ufb01nd\nthat the distributions of LH, HL and HH blocks are similar to each other. So it is reasonable to use\nthe HL and LH blocks to learn the HH blocks.\nThe four sub-bands are followed by a few residual blocks after the sub-band extraction stage. Due to\nthat high frequency coef\ufb01cients may be more dif\ufb01cult to learn than low frequency coef\ufb01cients, we\nuse different numbers of residual blocks for different sub-bands. We denote the numbers of residual\nblocks of each sub-band as nLL, nHL, nLH and nHH, respectively. we update each sub-band by the\nfollowing equation\nLL = H(2)\nF(2)\nwhere H(2)\nLL,H(2)\nrespectively. We name the above-mentioned operations as the self residual learning stage.\nAfter the operations of the self residual learning stage, IRN enters the sub-band re\ufb01nement stage. At\nthis stage, we use the high frequency blocks to re\ufb01ne the low frequency blocks, which is an inverse\nprocess of the sub-band extraction stage. Concretely, we use the HH block to learn the LH and the\nHL blocks, represented as\nLH = H(3)\nF(3)\n\nHL = H(2)\nHH represent the residual learnable function of for four sub-bands,\n\nLL), F(2)\nLH and H(2)\n\nLL(F(1)\nHL,H(2)\n\nHH = H(2)\n\nHL = H(3)\n\nLH([F(3)\n\nHH, F(2)\n\nHL([F(3)\n\nHH, F(2)\n\nHL(F(1)\n\nHL), F(2)\n\nLH(F(1)\n\nLH), F(2)\n\nLH = H(2)\n\nLH]), F(3)\n\nHH(F(1)\n\nHL]),\n\nHH),\n\n(4)\n\n(5)\n\nLH and H(3)\n\nwhere H(3)\nHL blocks, respectively. For the uni\ufb01cation of representations, we de\ufb01ne F(3)\nway, we update FLL by the following equation\n\nHL represent the learnable function of sub-band re\ufb01nement stage for the LH and\nHH. In a similar\n\nHH = F(2)\n\nLL = H(3)\nF(3)\n\nLL([F(3)\n\nHH, F(3)\n\nLH, F(3)\n\nHL, F(2)\n\nLL]).\n\n(6)\n\n4\n\n\fFigure 3: The architecture of clique up-sampling module and the visualization of its feature maps.\n\nThen we apply IDWT to these four blocks, we choose the simplest wavelet, Haar wavelet, for it can\nbe computed by deconvolution operation easily. The dimensions of all blocks are the same. They\nare all p \u00d7 h \u00d7 w, where p represents the number of feature maps produced by each sub-net. So\nHH]) \u2208 Rp\u00d72h\u00d72w.\nthe output of clique up-sampling module is FCU = IDWT([F(3)\nAt last, the output of clique up-sampling module is sent to a convolutional layer, which is used\nto reduce the number of channels and get the predicted HR image \u02c6IHR. We call the up-sampling\nmodule as clique up-sampling for the following reasons. First, the connection patterns of these two\nmodules are consistent. Both of clique block and clique up-sampling use dense connections among\nsub-bands/layers. Second, the forward propagation mechanisms of these two modules seem to be\nsimilar, that is, both the two modules update the output of sub-bands/layers stage by stage. Since both\nthe extraction module and the up-sampling module relate to clique, we call our network as Super\nResolution CliqueNet (SRCliqueNet in short).\n\nHL, F(3)\n\nLL, F(3)\n\nLH, F(3)\n\n2.4 Comparison between clique block and clique up-sampling\n\nAlthough we call the block and the up-sampling module as clique block and clique up-sampling,\nrespectively, there are many differences between these two modules. Concretely, the number of\nsub-bands/layers of clique up-sampling is \ufb01xed to four because of the formula of IDWT. In contrast,\nthe layer number of clique block is not constrained. Clique up-sampling has three stages to update the\noutput of each sub-band/layer. The clique block, by contrast, does not have a stage that can update\nthe output by its own layer alone. Since we consider the edge feature properties of all sub-bands,\nthe HL block mostly shows horizontal edges. In contrast, the LH block mainly contains vertical\nedges. The outputs of these two blocks seem to be \u201corthogonal\u201d. So there may be no connection\nbetween the second and the third sub-bands/layers in clique up-sampling module. At last, the outputs\nof these two modules are quite different. To be more speci\ufb01c, the output of a clique block is the\nconcatenation form of the output of all layers, which makes it have more channels. The output of\nclique up-sampling is the output of all layers after IDWT, which increases the resolution.\n\n2.5 Architecture for magni\ufb01cation factor 2J\u00d7\nTill now, we have introduced the network architecture for magni\ufb01cation factor 2\u00d7. In this subsection,\nwe propose SRCliqueNet\u2019s architecture for magni\ufb01cation factor 2J\u00d7, where J is the total level of\nthe network. Image pyramid [1] has been widely used in computer vision applications. LAPGAN\n[5] and LapSRN [22] used Laplacian pyramid for SR. Motivated by these works, we import image\npyramid to our proposed network to deal with magni\ufb01cation factors at 2J\u00d7. As shown in the left\npart of Figure 5, our model generates multiple intermediate SR predictions in one feed-forward pass\nthrough progressive reconstruction. Due to our cascaded and progressive architecture, our \ufb01nal loss\nj=1 Lj. We use the bicubic down-sampling to resize the ground truth HR\nimage IHR to Ij at level j. Following [14, 25], we use mean absolute error (MAE) to measure the\nperformance of reconstruction for each level: Lj = mean(|Ij \u2212 \u02c6Ij|), where \u02c6Ij is the predicted HR\nimage at level j.\n\nconsists of J parts: L =(cid:80)J\n\n5\n\n\ud835\udc05FENResidual BlocksResidual BlocksResidual BlocksResidual BlocksIDWT\ud835\udc05CU\ud835\udc05LL1\ud835\udc05LH1\ud835\udc05HL1\ud835\udc05HH1\ud835\udc05LL2\ud835\udc05LH2\ud835\udc05HL2\ud835\udc05HH2\ud835\udc05LL3\ud835\udc05LH3\ud835\udc05HL3\ud835\udc05HH3\ud835\udc05FEN\ud835\udc05LL1\ud835\udc05HL1\ud835\udc05LH1\ud835\udc05HH1\ud835\udc05HH3\ud835\udc05LH3\ud835\udc05HL3\ud835\udc05LL3\ud835\udc05CUSub-band extraction stageSelf residual learning stageSub-band refinement stage\fFigure 4: Left: The illustration of four sub-bands edge features relationships. Right: The histograms\nof four sub-bands\u2019 coef\ufb01cients over 800 images from DIV2K [33]. Top right: Do DWT on original\nimages. Bottom right: Do DWT on images preprocessed with mode 4 described in Section 3.1.\n\nFigure 5: Left: The SRCliqueNet architecture with magni\ufb01cation a factor 4\u00d7. Right: the perfor-\nmances of input images transformed with four modes.\n\n3 Experiments\n\n3.1\n\nImplementation and training details\n\nIn our proposed SRCliqueNet, we set 3 \u00d7 3 as the size of most convolutional\nModel Details.\nlayers. We also pad zeros to each side of the input to keep size \ufb01xed. We also use a few 1 \u00d7 1\nconvolutional layers for feature pooling and dimension reduction. The details of our SRCliqueNet\u2019s\nsetting are presented in Table 1. In Table 1, nB represents the number of clique blocks. l and g\nrepresent the number of layer and the growth rate in each clique block, respectively. The numbers\nof input and output channels of clique up-sampling module are denoted as c and p, respectively.\nnLL, nLH, nHL and nHH represent the number of residual blocks in the four sub-bands. Unlike most\nCNNs for computer vision problems, we avoid dropout [29], batch normalization[15] and instance\nnormalization [13], which are not suitable for the SR problem, because they reduce the \ufb02exibility of\nfeatures [25].\n\nDatasets and training details. We trained all networks using images from DIV2K [33] and Flickr\n[25]. For testing, we used four standard banchmark datasets: Set5 [3], Set14 [41], BSDS100 [2]\nand Urban100 [12]. Following settings of [25], we used a batch size of 16 with size 32 \u00d7 32 for LR\nimages, while the size of HR images changes according to the magni\ufb01cation factor. We randomly\naugmented the patches by \ufb02ipping horizontally or vertically and rotating 90\u25e6. We chose parametric\nrecti\ufb01ed linear units (PReLUs) as the activation function for our networks. The base learning rate\nwas initialized to 10\u22125 for all layers and decreased by a factor of 2 for every 200 epochs. The total\ntraining epoch was set to 500. We used Adam [19] as our optimizer and conducted all experiments\nusing PyTorch.\n\nMagnitude of sub-bands. As mentioned above, our clique up-sampling module has four sub-nets\nand every sub-net has connection with the other sub-nets. Since the feature maps of one sub-band are\nlearned from some other sub-bands\u2019, the magnitude of each sub-band block should be similar to others\nin order to get full use of each sub-net. As shown in Figure 4, the histograms of DWT coef\ufb01cients\nof original images are at the top right part. The coef\ufb01cients\u2019 magnitude of the LL sub-band is quite\ndifferent from the other three\u2019s, which may make training process dif\ufb01cult. So we want to transform\n\n6\n\nLLHLLHHHLLHLLHHHLLHLLHHHMode 1Mode 4FENCliqueUp-samplingConvLR imageCliqueUp-samplingConv\u2026\u2026\ud835\udc3f1for magnification 2\u00d7\ud835\udc3f2for magnification 4\u00d7IRN1IRN212345Iterations1043.544.555.56LossvalueInvestigation of Magnitude of Sub-bandsmode 1mode 2mode 3mode 4\fCBG\n\nModels\n\nSRCliqueNet(2\u00d7)\n\nTable 1: Details of our proposed SRCliqueNet for magni\ufb01cation factors 2\u00d7 and 4\u00d7. CBG represents\nClique Block Group and CU represents Clique Up-sampling.\nCU1\nnLL\n2\nnLH\n3\nnLL\n2\nnLH\n3\n\nSRCliqueNet(4\u00d7)\n\nnLL\n2\nnLH\n3\n\nc\n600\np\n300\n\np\n480\nc\n\nnHH\n\n4\nnHL\n3\n\nnHH\n\nnHL\n3\n\nnHH\n\nnHL\n3\n\nl\n\n4\n\nl\n\n4\n\n2400\n\np\n600\n\nnB\n\n15\n\nnB\n\n15\n\nc\n\n1920\n\ng\n\n32\n\ng\n\n32\n\nCU2\n\n%\n\n4\n\n4\n\nMetric\n\nVary FEN and \ufb01x IRN\n\nRB + CU DB + CU CB + CU\n\nPSNR\n37.75\nSSIM 0.960\n\n37.83\n0.960\n\n37.99\n0.962\n\nVary IRN and \ufb01x FEN\n\nMetric\n\nCB + DC CB + SC CB + CU\u2212 CB + CU\n37.99\nPSNR\n37.87\n0.962\nSSIM 0.960\n\n37.89\n0.961\n\n37.81\n0.960\n\nTable 2: Investigation of FEN.\n\nTable 3: Investigation of IRN.\n\nthe original images to reduce the difference among magnitudes of the four sub-bands. We propose\nfour modes: (1) Original pixel range from 0 to 255. (2) Each pixel divides 255. (3) Each pixel divides\n255 and then subtracts the mean of the training dataset by channel. (4) Each pixel divides 255 and\nthen subtracts the mean of the training dataset by channel, then after DWT the coef\ufb01cients of LL\nblocks divide a scalar which is around 4 to make the magnitude of LL sub-band more similar to other\nsub-bands\u2019. The \ufb01nal histograms are showing in the bottom right part of Figure 4. Under the same\nexperiment setting, we pre-process the input images with the four modes. The performance of four\nmodes are shown in the right part of Figure 5. From the \ufb01gure, we can \ufb01nd that mode 4 gets best\nperformance in terms of loss value. So in the subsequent experiments, we pre-process our input in\nmode 4.\n\n3.2\n\nInvestigation of FEN and IRN\n\nTo verify the power of the res-clique block and the clique up-sampling module, we designed two\ncontrast experiments. In these two experiments, we used a small version of SRCliqueNet which\ncontains eight blocks, each block having four layers and each layer producing 32 feature maps.In\nthe \ufb01rst experiment, we \ufb01xed the clique up-sampling module in IRN and used different blocks, i.e,\nresidual block (RB), dense block (DB) and res-clique block (CB) in FEN. In the second experiment,\nwe \ufb01xed the clique blocks in FEN and changed the up-sampling module, i.e, deconvolution (DC),\nsub-pixel convolution (SC), clique up-sampling without joint learning (CU\u2212) and clique up-sampling\n(CU). We recorded the best performance in terms of PSNR/SSIM [37] on Set5 with magni\ufb01cation\nfactor 2\u00d7 during 400 epochs. The performances of all kind of settings are listed in Table 2 and 3.\nTable 2 and Table 3 show the power of the clique block and the clique up-sampling module. When\nwe combine them, we get the best performances comparing with other settings.\nWe also visualize the feature maps of four sub-bands in two stages. Since the channels\u2019 number of the\ntwo stages is larger than 3, we consider the mean of the feature maps in channel dimension for better\nvisualization, which can be described by mean(F) = 1\ni=1 Fi,:,:. The channel-wise averaged\nc\nfeature maps are shown at the bottom of Figure 3. From Figure 3, we can \ufb01nd that the feature maps of\ninput and stage one do not look like coef\ufb01cients in the wavelet domain. However, the feature maps of\nstage two are close to the coef\ufb01cients of DWT and can reconstruct clear and high resolution images\nafter IDWT. The visualization results demonstrate that it is necessary to add sub-band re\ufb01nement\nstage in the clique upsampling module.\n\n(cid:80)c\n\n3.3 Comparison with other wavelet CNN methods\n\nAs mentioned above, some exist methods such as Wavelet-SRNet [11] and CNNWSR [21] also used\nwavelet and CNN for image super-resolution. we \ufb01rst give a detailed comparison with Wavelet-SRNet\n\n7\n\n\fand SRCliqueNet. There are three main differences between these two models. (1) Wavelet-SRNet\nlearns wavelet coef\ufb01cients independently and directly. Our SRCliqueNet considers the relationship\namong the four sub-bands in the frequency domain. Moreover, Our net applies three stages to learn\nthe coef\ufb01cients of all sub-bands jointly, i.e. sub-band extraction stage, self residual learning stage and\nsub-band re\ufb01nement stage. (2) Wavelet-SRNet uses full wavelet packet decomposition to reconstruct\nSR images with magni\ufb01cation factor 4\u00d7 and larger. SRCliqueNet reconstructs SR images with\nlarge magni\ufb01cation factor progressively by image pyramid. We use the bicubic down-sampling to\nresize the ground truth HR image at each level to assist learning. So our net can take full advantage\nof the supervisory information for HR images. (3) SRCliqueNet is based on clique blocks, which\ncan propagate the information among layers more easily than residual block. We also conduct an\nexperiment to compare these two models on Helen test dataset with magni\ufb01cation factor 4\u00d7. Our\nnetwork is trained with images from Helen training dataset, while Wavelet-SRNet is trained with\nimages from both Helen and CelebA datasets. The results are listed in Table 4 below and we can \ufb01nd\nthat our SRCliqueNet outperforms Wavelet-SRNet.\nIn the following, we list a detailed comparison with CNNWSR and SRCliqueNet. In addition to the\nabove differences between Wavelet-SRNet and SRCliqueNet, CNNWSR is a simpler network with\nonly three layers. CNNWSR supposes that the input LR image is an approximation of LL sub-band.\nSo CNNWSR just tries to learn other three sub-bands by LR image, which is inaccurate. Hence, there\nis no surprise that our model obviously outstrips CNNSWR in the following quantitative experiment.\nIn [21], the authors show four reconstructed images (names: monarch, zebra, baby and bird) chosen\nfrom Set5 and Set14 datasets. The PSNR comparison on these images is shown in Table 5 below.\nmonarch 2\u00d7 zebra 2\u00d7 baby 4\u00d7 bird 4\u00d7\n29.01\n35.84\n\nPSNR SSIM\nWavelet-SRNet [11] 27.94 0.8827\n28.23 0.8844\nTable 4: Results on Helen test set (4\u00d7).\n\nTable 5: PSNR comparisons between CNNWSR and SRCliqueNet.\n\n31.84\n34.71\n\n31.58\n33.90\n\nCNNWSR [21]\nSRCliqueNet\n\n35.74\n40.53\n\nModels\n\nSRCliqueNet\n\nModels\n\n3.4 Comparison with the-state-of-the-arts\n\nTo validate the effectiveness of the proposed network, we performed several experiments and visu-\nalizations. We compared our proposed network with 8 state-of-the-art SR algorithms: DRCN [18],\nLapSRN [22], DRRN [31], MemNet [32], SRMDNF [42], IDN [14], D-DBPN [8] and EDSR [25].\nWe carried out extensive experiments using four benchmark datasets mentioned above. We evaluated\nthe reconstructed images with PSNR and SSIM. Table 6 shows quantitative comparisons on 2\u00d7 and\n4\u00d7 SR. Our SRCliqueNet performs better than existing methods on almost all datasets. In order to\nmaximize the potential performance of our SRCliqueNet, we adopt the self-ensemble strategy similar\nwith [34]. We mark the self-ensemble version of our model as SRCliqueNet+ in Table 6.\nIn Figure 6, we show visual comparisons on Set14, BSDS100 and Urban100 with a magni\ufb01cation\nfactor 4\u00d7. Due to limited space, we show only four images results here. For more SR results, please\nrefer to our supplementary materials. As shown in Figure 6, our method accurately reconstructs\nmore clear and textural details of English letters and more textural stripes on zebras. For structured\narchitectural style images, our method tends to get more legible reconstructed HR images. The\ncomparisons suggest that our method infers the high-frequency details directly in the wavelet domain\nand the results prove its effectiveness. Also, our method gets better quantitative results in terms of\nPSNR and SSIM than other state-of-the-arts.\n\n4 Conclusion\n\nIn this paper, we propose a novel CNN called SRCliqueNet for SISR. We design a new up-sampling\nmodule called clique up-sampling which uses IDWT to change the size of feature maps and jointly\nlearn all sub-band coef\ufb01cients depending on the edge feature property. We design a res-clique block to\nextract features for SR. We verify the necessity of both two modules on benchmark datasets. We also\nextend our SRCliqueNet with a progressive up-sampling module to deal with larger magni\ufb01cation\nfactors. Extensive evaluations on benchmark datasets demonstrate that the proposed network performs\nbetter than the state-of-the-art SR algorithms in terms of quantitative metrics. For visual quality, our\nalgorithm also reconstructs more clear and textual details than other state-of-the-arts.\n\n8\n\n\fTable 6: Quantitative evaluation of state-of-the-art SR algorithms: average PSNR/SSIM for magni\ufb01-\ncation factors 2\u00d7 and 4\u00d7. Red indicates the best and Blue indicates the second best performance.\n(\u2018-\u2019 indicates that the method failed to reconstruct the whole images due to computation limitation.)\n\nModels\n\nBicubic\n\nVDSR [17]\nDRCN [18]\nLapSRN [22]\nDRRN [31]\nMemNet [32]\nSRMDNF [42]\n\nIDN [14]\n\nD-DBPN [8]\nEDSR [25]\nSRCliqueNet\nSRCliqueNet+\n\nBicubic\n\nVDSR [17]\nDRCN [18]\nLapSRN [22]\nDRRN [31]\nMemNet [32]\nSRMDNF [42]\n\nIDN [14]\n\nD-DBPN [8]\nEDSR [25]\nSRCliqueNet\nSRCliqueNet+\n\nMag.\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n2\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n4\u00d7\n\nSet5\n\nSet14\n\nBSDS100\n\nUrban100\n\nPSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM\n0.841\n33.65\n0.914\n37.53\n0.913\n37.63\n37.52\n0.910\n0.919\n37.74\n0.920\n37.78\n0.920\n37.79\n37.83\n0.920\n38.09\n38.11\n38.23\n38.28\n28.42\n31.35\n31.53\n31.54\n31.68\n31.74\n31.96\n31.82\n32.47\n32.46\n32.61\n32.67\n\n0.870\n0.913\n0.913\n0.913\n0.913\n0.914\n0.915\n0.915\n0.919\n0.919\n0.923\n0.924\n0.704\n0.770\n0.770\n0.772\n0.772\n0.772\n0.777\n0.773\n0.786\n0.788\n0.796\n0.797\n\n0.844\n0.896\n0.894\n0.895\n0.897\n0.898\n0.898\n0.898\n0.900\n0.901\n0.905\n0.906\n0.669\n0.726\n0.724\n0.728\n0.728\n0.728\n0.734\n0.730\n0.740\n0.742\n0.752\n0.752\n\n0.930\n0.958\n0.959\n0.959\n0.959\n0.960\n0.960\n0.960\n0.960\n0.960\n0.963\n0.963\n0.810\n0.882\n0.884\n0.885\n0.888\n0.890\n0.893\n0.890\n0.898\n0.897\n0.903\n0.903\n\n30.34\n32.97\n32.98\n33.08\n33.23\n33.28\n33.32\n33.30\n33.85\n33.92\n33.96\n34.03\n26.10\n28.03\n28.04\n28.19\n28.21\n28.26\n28.35\n28.25\n28.82\n28.80\n28.88\n28.95\n\n29.56\n31.90\n31.85\n31.80\n32.05\n32.08\n32.05\n32.08\n32.27\n32.32\n32.36\n32.40\n25.96\n27.29\n27.24\n27.32\n27.38\n27.40\n27.49\n27.41\n27.72\n27.71\n27.77\n27.81\n\n26.88\n30.77\n30.76\n30.41\n31.23\n31.31\n31.31\n31.27\n\n-\n\n32.93\n32.86\n32.95\n23.15\n25.18\n25.14\n25.21\n25.44\n25.50\n25.68\n25.41\n\n-\n\n26.64\n26.69\n26.80\n\n-\n\n0.935\n0.936\n0.937\n0.659\n0.753\n0.752\n0.756\n0.764\n0.763\n0.773\n0.763\n\n-\n\n0.803\n0.808\n0.810\n\nFigure 6: Visual comparisons on images sampled from Set14, BSDS100 and Urban100, with a\nmagni\ufb01cation factor 4\u00d7.\n\n9\n\nIDNHRBicubicSRMDNFEDSRSRCliqueNet (Ours)IDNHRBicubicSRMDNFEDSRSRCliqueNet (Ours)PSNR/SSIMPSNR/SSIM21.31/0.81026.39/0.95526.25/0.94026.21/0.94227.19/0.96721.23/0.61422.72/0.81122.46/0.69522. 46/0.70222.79/0.814IDNHRBicubicSRMDNFEDSRSRCliqueNet (Ours)PSNR/SSIM23.35/0.83527.46/0.94627.37/0.93927.16/0.94228.72/0.956IDNHRBicubicSRMDNFEDSRSRCliqueNet (Ours)PSNR/SSIM24.82/0.82932.05/0.95531.56/0.89731.17/0.89532.65/0.960Set14: ppt3BSDS100: 253027 Urban100: img005Urban100: img025\fAcknowledgments\n\nThis research is partially supported by National Basic Research Program of China (973 Program)\n(grant nos. 2015CB352502 and 2015CB352303), National Natural Science Foundation (NSF) of\nChina (grant nos. 61625301, 61731018 and 61671027), Qualcomm and Microsoft Research Asia.\n\nReferences\n[1] E. H Adelson. Pyramid methods in image processing. Rca Engineer, 29, 1984.\n\n[2] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical\n\nimage segmentation. IEEE TPAMI, (5):898\u2013916, 2011.\n\n[3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie-Line Alberi Morel. Low-complexity\n\nsingle-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.\n\n[4] Raymond H Chan, Tony F Chan, Lixin Shen, and Zuowei Shen. Wavelet algorithms for high-resolution\n\nimage reconstruction. SIAM Journal on Scienti\ufb01c Computing, (4):1408\u20131432, 2003.\n\n[5] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a Laplacian\n\npyramid of adversarial networks. In NIPS, pages 1486\u20131494, 2015.\n\n[6] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for\n\nimage super-resolution. In ECCV, pages 184\u2013199, 2014.\n\n[7] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural\n\nnetwork. In ECCV, pages 391\u2013407, 2016.\n\n[8] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-\n\nresolution. In CVPR, 2018.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, pages 770\u2013778, 2016.\n\n[10] Gao Huang, Zhuang Liu, Laurens Van De Maaten, and Kilian Q. Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017.\n\n[11] Huaibo Huang, Ran He, Zhenan Sun, and Tieniu Tan. Wavelet-SRNet: A wavelet-based cnn for multi-scale\n\nface super resolution. In ICCV, pages 1689\u20131697, 2017.\n\n[12] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed\n\nself-exemplars. In CVPR, pages 5197\u20135206, 2015.\n\n[13] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization.\n\nIn CVPR, pages 1501\u20131510, 2017.\n\n[14] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information\n\ndistillation network. In CVPR, 2018.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[16] Hui Ji and Cornelia Fermuller. Robust wavelet-based super-resolution reconstruction: theory and algorithm.\n\nIEEE TPAMI, (4):649\u2013660, 2009.\n\n[17] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\n\nconvolutional networks. In CVPR, pages 1646\u20131654, 2016.\n\n[18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image\n\nsuper-resolution. In CVPR, pages 1637\u20131645, 2016.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.\n\n[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[21] Neeraj Kumar, Ruchika Verma, and Amit Sethi. Convolutional neural networks for wavelet domain super\n\nresolution. Pattern Recognition Letters, pages 65\u201371, 2017.\n\n10\n\n\f[22] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep Laplacian pyramid networks\n\nfor fast and accurate super-resolution. In CVPR, pages 624\u2013632, 2017.\n\n[23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, (11):2278\u20132324, 1998.\n\n[24] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-\nresolution using a generative adversarial network. In CVPR, pages 4681\u20134690, 2017.\n\n[25] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual\n\nnetworks for single image super-resolution. In CVPR Workshops, 2017.\n\n[26] Stephane Mallat. Wavelets for a vision. Proceedings of the IEEE, (4):604\u2013614, 1996.\n\n[27] M Dirk Robinson, Cynthia A Toth, Joseph Y Lo, and Sina Farsiu. Ef\ufb01cient Fourier-wavelet super-resolution.\n\nIEEE TIP, (10):2669\u20132681, 2010.\n\n[28] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel\nRueckert, and Zehan Wang. Real-time single image and video super-resolution using an ef\ufb01cient sub-pixel\nconvolutional neural network. In CVPR, pages 1874\u20131883, 2016.\n\n[29] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\nA simple way to prevent neural networks from over\ufb01tting. JMLR, (1):1929\u20131958, 2014.\n\n[30] Radomir S Stankovi\u00b4c and Bogdan J Falkowski. The Haar wavelet transform: its status and achievements.\n\nComputers & Electrical Engineering, (1):25\u201344, 2003.\n\n[31] Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution via deep recursive residual network. In\n\nCVPR, 2017.\n\n[32] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. MemNet: A persistent memory network for image\n\nrestoration. In CVPR, pages 4539\u20134547, 2017.\n\n[33] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, Lei Zhang, Bee Lim, Sanghyun Son,\nHeewon Kim, Seungjun Nah, Kyoung Mu Lee, et al. Ntire 2017 challenge on single image super-resolution:\nMethods and results. In CVPR Workshops, pages 1110\u20131121. IEEE, 2017.\n\n[34] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven ways to improve example-based single image\n\nsuper resolution. In CVPR, pages 1865\u20131873, 2016.\n\n[35] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image super-resolution using dense skip connections.\n\nIn ICCV, pages 4809\u20134817, 2017.\n\n[36] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image\n\nsuper-resolution with sparse prior. In ICCV, pages 370\u2013378, 2016.\n\n[37] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error\n\nvisibility to structural similarity. IEEE TIP, (4):600\u2013612, 2004.\n\n[38] Yibo Yang, Zhisheng Zhong, Tiancheng Shen, and Zhouchen Lin. Convolutional neural networks with\n\nalternately updated clique. In CVPR, 2018.\n\n[39] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV,\n\npages 818\u2013833, 2014.\n\n[40] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and\n\nhigh level feature learning. In ICCV, pages 2018\u20132025, 2011.\n\n[41] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations.\n\nIn International conference on curves and surfaces, pages 711\u2013730. Springer, 2010.\n\n[42] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning a single convolutional super-resolution network for\n\nmultiple degradations. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 110, "authors": [{"given_name": "Zhisheng", "family_name": "Zhong", "institution": "Peking University"}, {"given_name": "Tiancheng", "family_name": "Shen", "institution": "Peking University"}, {"given_name": "Yibo", "family_name": "Yang", "institution": "Peking University"}, {"given_name": "Zhouchen", "family_name": "Lin", "institution": "Peking University"}, {"given_name": "Chao", "family_name": "Zhang", "institution": "Peking University"}]}