{"title": "SHE: A Fast and Accurate Deep Neural Network for Encrypted Data", "book": "Advances in Neural Information Processing Systems", "page_first": 10035, "page_last": 10043, "abstract": "Homomorphic Encryption (HE) is one of the most promising security solutions to emerging Machine Learning as a Service (MLaaS). Several Leveled-HE (LHE)-enabled Convolutional Neural Networks (LHECNNs) are proposed to implement MLaaS to avoid the large bootstrapping overhead. However, prior LHECNNs have to pay significant computational overhead but achieve only low inference accuracy, due to their polynomial approximation activations and poolings. Stacking many polynomial approximation activation layers in a network greatly reduces the inference accuracy, since the polynomial approximation activation errors lead to a low distortion of the output distribution of the next batch normalization layer. So the polynomial approximation activations and poolings have become the obstacle to a fast and accurate LHECNN model.\n\nIn this paper, we propose a Shift-accumulation-based LHE-enabled deep neural network (SHE) for fast and accurate inferences on encrypted data. We use the binary-operation-friendly leveled-TFHE (LTFHE) encryption scheme to implement ReLU activations and max poolings. We also adopt the logarithmic quantization to accelerate inferences by replacing expensive LTFHE multiplications with cheap LTFHE shifts. We propose a mixed bitwidth accumulator to expedite accumulations. Since the LTFHE ReLU activations, max poolings, shifts and accumulations have small multiplicative depth, SHE can implement much deeper network architectures with more convolutional and activation layers. Our experimental results show SHE achieves the state-of-the-art inference accuracy and reduces the inference latency by 76.21% ~ 94.23% over prior LHECNNs on MNIST and CIFAR-10.", "full_text": "SHE: A Fast and Accurate Deep Neural Network for\n\nEncrypted Data\u2217\n\nQian Lou\n\nLei Jiang\n\nIndiana University Bloomington\n\nIndiana University Bloomington\n\nlouqian@iu.edu\n\njiang60@iu.edu\n\nAbstract\n\nHomomorphic Encryption (HE) is one of the most promising security solutions to\nemerging Machine Learning as a Service (MLaaS). Leveled-HE (LHE)-enabled\nConvolutional Neural Networks (LHECNNs) are proposed to implement MLaaS\nto avoid large bootstrapping overhead. However, prior LHECNNs have to pay sig-\nni\ufb01cant computing overhead but achieve only low inference accuracy, due to their\npolynomial approximation activations and poolings. Stacking many polynomial\napproximation activation layers in a network greatly reduces inference accuracy,\nsince the polynomial approximation activation errors lead to a low distortion of\nthe output distribution of the next batch normalization layer. So the polynomial\napproximation activations and poolings have become the obstacle to a fast and\naccurate LHECNN model.\nIn this paper, we propose a Shift-accumulation-based LHE-enabled deep neural net-\nwork (SHE) for fast and accurate inferences on encrypted data. We use the binary-\noperation-friendly Leveled Fast Homomorphic Encryption over Torus (LTFHE)\nencryption scheme to implement ReLU activations and max poolings. We also\nadopt the logarithmic quantization to accelerate inferences by replacing expensive\nLTFHE multiplications with cheap LTFHE shifts. We propose a mixed bitwidth\naccumulator to accelerate accumulations. Since the LTFHE ReLU activations, max\npoolings, shifts and accumulations have small multiplicative depth overhead, SHE\ncan implement much deeper network architectures with more convolutional and\nactivation layers. Our experimental results show SHE achieves the state-of-the-art\ninference accuracy and reduces the inference latency by 76.21% \u223c 94.23% over\nprior LHECNNs on MNIST and CIFAR-10. The source code of SHE is available\nat https://github.com/qianlou/SHE.\n\nIntroduction\n\n1\nMachine Learning as a Service (MLaaS) is an effective way for clients to run their computationally\nexpensive Convolutional Neural Network (CNN) inferences [1] on powerful cloud servers. During\nMLaaS, cloud servers can access clients\u2019 raw data, and hence potentially introduce privacy risks. So\nthere is a strong and urgent need to ensure the con\ufb01dentiality of healthcare records, \ufb01nancial data and\nother sensitive information of clients uploaded to cloud servers. Recent works [1, 2, 3, 4, 5] employ\nLeveled Homomorphic Encryption (LHE) to do CNN inferences over encrypted data. During the\nLHE-enabled MLaaS, a client encrypts the sensitive data and sends the encrypted data to a server.\nSince only the client has the private key, the server cannot decrypt the input nor the output. The server\nproduces an encrypted inference output and returns it to the client. The client privacy is preserved in\nthis pipeline for both inputs and outputs.\nHowever, prior LHE-enabled CNNs (LHECNNs) [1, 2, 3, 4, 5] suffer from low inference accuracy\nand long inference latency. TAPAS [5] and DiNN [1] adopt only 1-bit weights, inputs and sign\n\n\u2217\n\nThis work was supported in part by NSF CCF-1908992 and CCF-1909509.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\factivations. They degrade 3% \u223c 6.2% inference accuracy even on tiny hand-written digit dataset\nMNIST. Since HE supports only polynomial computations, CryptoNet [2], NED [3], n-GraphHE [6],\nLola [7] and Faster Cryptonet [4] have to use the low-degree polynomial activations rather than\nReLU, and thus fail to obtain the state-of-the-art inference accuracy. For instance, Faster Cryptonet\nachieves only 76.72% inference accuracy on CIFAR-10 dataset, while the inference accuracy of an\nunencrypted CNN model with ReLU activations is 93.72%. Although it is possible to improve the\nencrypted inference accuracy by enlarging the degree of polynomial approximation activations, the\ncomputing overhead increases exponentially with the degree. With even a degree-2 square activation,\nprior LHECNNs spend hundreds of seconds in doing an inference on an encrypted MNIST image.\nMoreover, the polynomial approximation activation (PAA) is not compatible with a deep network\nconsisting of many convolutional and activation layers. Stacking many convolutional and PAA layers\nin a CNN actually decreases inference accuracy [8], since the PAA approximation errors lead to\na low distortion of the output distribution of the next batch normalization layer. As a result, no\nprior LHECNN fully supports deep CNNs that can infer the large ImageNet. For instance, Faster\nCryptonet [4] can compute only a part (i.e., 3 convolutional layers with square activations) of a CNN\ninferring ImageNet on a server but leave the other part of the CNN to the client.\n2 Background and Motivation\nThreat model. In the MLaaS paradigm, a risk inherent to data transmission exists when clients send\nsensitive input data or servers return sensitive output results back to clients. Although a strongly\nencryption scheme can be used to encrypt data sent to cloud, untrusted servers [2, 3, 4] can make\ndata leakage happen. HE is one of the most promising encryption techniques to enable a server to do\nCNN inference over encrypted data. A client sends encrypted data to a server performing encrypted\ninference without decrypting the data or accessing the client\u2019s private key. Only the client can decrypt\nthe inference result using the private key.\nHomomorphic Encryption. An encryption scheme de\ufb01nes an encryption function \u0001() encoding data\n(plaintexts) to ciphertexts (encrypted data), and a decryption function \u03b4() mapping ciphertexts back to\nplaintexts (original data). In a public-key encryption, the ciphertexts x can be computed as \u0001(x, kpub),\nwhere kpub is the public key. The decryption can be done through \u03b4(\u0001(x, kpub), kpri) = x, where\nkpri is the private key. An encryption scheme is homomorphic in an operation (cid:4) if there is another\noperation \u2295 such that \u0001(x, kpub) \u2295 \u0001(y, kpub) = \u0001(x (cid:4) y, kpub). To prevent threats on untrusted\nservers, fully HE [9] enables an unlimited number of computations over ciphertexts. However, each\ncomputation introduces noise into the ciphertexts. After a certain number of computations, the noise\ngrows so large that the ciphertexts cannot be decrypted successfully. A bootstrapping [9] is required\nto keep the noise in check without decrypting. Unfortunately, the bootstrapping is extremely slow,\ndue to its high computational complexity. Leveled HE (LHE) [2] is proposed to accelerate encrypted\ncomputations without bootstrapping. But LHE can compute polynomial functions of only a maximal\ndegree on the ciphertexts. Before applying LHE, the complexity of the target arithmetic circuit\nprocessing the ciphertexts must be known in advance. Compared to Multiple Party Computation\n(MPC) [10, 11], LHE has much less communication overhead. For example, a MPC-based MLaaS\nDeepSecure [11] has to exchange 722GB data between the client and the server for only a 5-layer\nCNN inference on one MNIST image.\nTFHE. TFHE [10] is a HE cryptosystem that expresses ciphertexts over the torus modulo 1. It sup-\nports both fully and leveled HE schemes. Like the other HE cryptosystems including BFV/BGV [7],\nFV-RNS [4] and HEAAN [6], TFHE is also based on the Ring-LWE, but it can perform very fast\nbinary operations over encrypted binary bits. Therefore, unlike the other HE cryptosystems approxi-\nmating the activation by expensive polynomial operations, TFHE can naturally implement ReLU\nactivations and max poolings by Boolean operations. In this paper, we use TFHE for SHE without\nthe batching technique [10]. Although the ciphertext batching may greatly improve the LHECNN\ninference throughput by packing multiple (e.g. 8K) datasets into a homomorphic operation, the\nbatching has to select more numerous and restricted NTT points, force speci\ufb01c computations away\nfrom NTT, and add large computing overhead [4]. Moreover, it is dif\ufb01cult for a client to batch 8K\nrequests together sharing the same secret key. In fact, a client often needs to do inferences on only\nfew images [7].\nMotivation. As Figure 1 describes, prior LHECNNs suffer from low inference accuracy and\nlong inference latency, because of the polynomial approximation activations (PAAs) and poolings.\nCryptoNet (CNT) [2] and Faster CryptoNet (FCN) [4] add a batch normalization layer before each\n\n2\n\n\f(a) The inference latency and accuracy.\n\n(b) PAA with various degrees (D) on FCN.\n\n(c) FCN with various PAA layer (L) #s.\n\n(d) The output of the 7th BN layer.\n\nFigure 1: The inference latency and accuracy of prior LHECNNs inferring MNIST (Conv: convolu-\ntional layers; FC: fully-connected layers; Pooling: pooling layers; and Activation: batch normalization\n(BN) and activation layers).\n\nName\n\nCryptosys. ReLU MaxPool NoConv DeepNet\n\nYASHE\nBGV\n\nCNT[2]\nNED[3]\nFV-RNS\nFCN[4]\nDiNN[1]\nTFHE\nTAPAS[5] TFHE\nGHE[6]\nHEAAN\nLola[7]\nSHE\n\nBFV\nTFHE\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0013\n\u0013\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0013\n\n0\n\nR[2]\nR[1]\n\nR[0]\n\nX[2]\nX[1]\n\nX[0]\n\n3-bit ReLU \n\nY[1]\nX[1]\nY[0]\nX[1]\nX[0]\nY[0]\nY[1]\nX[0]\n\nX[0:2]\nY[0:2]\n\n]\n2\n:\n0\n[\nZ\n\nMux\n\nGX\n\nX[2]\nY[2]\n\n3-bit Max \n\nTable 1: The comparison of LHECNNs.\n\nFigure 2: A 3-bit ReLU unit (X: input; R: output) and\na 3-bit max pooling unit (X and Y : inputs; Z: output).\n\nactivation layer. They use square (polynomials with degree-2) activations replace ReLUs, and\nled mean poolings to replace max poolings. As Figure 1(a) shows, the square activation reduces\ninference accuracy by 0.82% compared to unencrypted models on MNIST. Although increasing\nthe degree of PAAs improves their inference accuracy, the computing overhead of PAA layers\nexponentially increases as shown in Figure 1(b). NED [3] uses high-degree PAAs to obtain nearly no\naccuracy loss, but its inference latency is much longer than CNT. For CNT, NED, and FCN, PAA\nlayers occupy 65.7% \u223c 81.2% of the total inference latency. Integrating more convolutional, and\nPAA layers in a LHECNN cannot improve inference accuracy neither. As Figure 1(c) shows, an\nunencrypted CNN achieves higher inference accuracy with more layers, but the inference accuracy of\nthe LHE-enabled FCN decreases when having more convolutional and PAA layers. This is because\nthe square approximation errors result in a low distortion of the output distribution of the next batch\nnormalization layer. Figure 1(d) illustrates such an output distribution distortion of the 7th batch\nnormalization layer in a 10-layer LHE-enabled FCN.\nComparison against prior works. Table 1 compares our SHE against prior LHECNN models\nincluding CryptoNet (CNT) [2], NED [3], n-GraphHE (GHE) [6], Lola [7] and Faster CryptoNet\n(FCN) [4], DiNN [1] and TAPAS [5]. Due to the limitation of non-TFHE schemes, CNT, NED, GHE,\nLola and FCN cannot support ReLU and max pooling operations. They are also slowed down by\ncomputationally expensive homomorphic multiplications. Particularly, Lola encodes each weight\nof a LHECNN as a separate message for encryption batches to reduce inference latency, but its\ninference latency is still signi\ufb01cant due to the PAA computations and homomorphic multiplications.\nAlthough DiNN [1] and TAPAS [5] binarize weights, inputs and activations to avoid homomorphic\nmultiplications, their inference accuracy decreases signi\ufb01cantly.\n\n3\n\n\f3 SHE\n3.1 ReLU Activation and Max Pooling\nThe recti\ufb01er, ReLU (x) = max(x, 0), is one of the most well-known activation functions, while the\nmax pooling is a sample-based discretization process heavily used in state-of-the-art CNN models.\nPrior BF/V [2] and FV-RNS [4]-based LHECNNs approximate both ReLU (x) and max pooling\nby linear polynomials leading to signi\ufb01cant inference accuracy degradation and huge computing\noverhead. Prior TFHE-based CNN models [1, 5] use the 1-bit sign() function to implement binarized\nactivations that introduce large inference accuracy loss.\nIn this paper, we propose an accurate homomorphic ReLU function and a homomorphic max pooling\noperation. TFHE can implement any 2-input binary homomorphic operation [9, 10], e.g., AND,\nOR, NOT, and MUX, over encrypted binary data by a deterministic automaton, a Boolean gate or a\nlook-up table. In this way, as Figure 2 exhibits, we can connect the TFHE homomorphic Boolean\ngates to construct a ReLU unit and a max pooling unit. A > 2-input TFHE gate can be divided into\nmultiple 2-input TFHE gates.\n3.2 Logarithmic Quantization\nWhen we implemented Faster CryptoNet (FCN) through TFHE Boolean gates, as Figure 1(a) shows,\nwe observe that the TFHE-version FCN model (TCN) although has the same network topology as\nFCN, its inference latency is much larger. This is because TFHE is not designed and optimized\nfor homomorphic matrix multiplications or other repetitive tasks. So the convolutional and fully-\nconnected layers of TCN have become the new latency bottleneck.\nTo reduce the computing overhead of homomorphic matrix multiplications, we logarithmi-\ncally quantize the \ufb02oating-point weights into their power-of-2 representations [12], so that we\ncan replace all homomorphic multiplications in a LHECNN inference by homomorphic shifts\nand homomorphic accumulations. Prior works [13, 12] suggest the logarithmic quantization\neven with weight pruning still achieves the same inference accuracy as full-precision models.\nIn a logarithmically quantized CNN model, weightT \u2217 input is approximately equivalent to\n(cid:2)\ni=1 inputi \u00d7 2wegihtQi, and hence can be converted to\ni=1 binaryshif t(inputi, weightQi),\nwhere weightQi = Quantize(log2(weighti)), Quantize(x) quantizes x to the closest integer and\nbinaryshif t(a, b) shifts a by b bits in \ufb01xed-point arithmetic.\n\n(cid:2)\n\nn\n\nn\n\nClient \n1 1 1\n\n...\n\n0 1 0\n\nServer \nC7 C6 C5\n\n...\n\nC2\n\nC1\n\nC0\n\n1 1 1 1 1\n\n...\n\n0\n\nC7 C7 C7 C6 C5\n\n...\n\nC2\n\nClient\n\n7\n\n28\n\nServer \n(Ct0, Ct1)\nX 4\n\nb+2\n\nb+1\n\nb+1\n\n[4Ct0, 4Ct1]q\n\nb\n\nb\n\nb\n\nb\n\n(a) A TFHE shift.\n\n(b) A shift in the others.\n\n(c) A mixed-bitwidth acc.\n\nFigure 3: The logarithmic quantization in different HE cryptosystems.\n\nAs Figure 3(a) highlights, a TFHE arithmetic shift operation is computationally cheap. TFHE\nencrypts the plaintext bit by bit. Moreover, it keeps the order of the ciphertext the same as that of\nthe plaintext. A TFHE arithmetic shift just copies the encrypted sign bit and the encrypted data\nto the shifted positions. It takes only \u223c 100ns for a CPU core to complete a TFHE arithmetic\nshift operation. On the contrary, in the other HE schemes, e.g., BFV/BGV [7], FV-RNS [4] and\nHEAAN [6], a homomorphic arithmetic shift is equivalent to a homomorphic multiplication, as they\nencrypt each \ufb02oating point number of plaintext as a whole shown in Figure 3(b). Compared to TFHE,\na homomorphic arithmetic shift in the other cryptosystems are much more computationally expensive.\n3.3 Mixed Bitwidth Accumulator\nBesides homomorphic shift operations, the logarithmically quantized CNN models also require\naccumulations to do inferences. The computing overhead of a TFHE adder is proportional to\nits bitwidth. For instance, compared to its 4-bit counterpart, an 8-bit TFHE adder doubles the\ncomputing overhead. We quantize the weights into their 5-bit power-of-2 representations, since recent\nworks [12] show a 5-bit logarithmically quantized AlexNet model degrades the inference accuracy\nby only < 0.6%. However, the accumulation intermediate results have to be represented by 16 bits.\nOtherwise, the inference accuracy may dramatically decrease. Accumulating 5-bit shifted results\nthrough 16-bit TFHE adders is computationally expensive.\n\n4\n\n\fTopology\n\nMD overhead (K)\n\nTotal Acc(%)\n\nFCN\n21K 98.71\n1.4-0.3-3-2.4-1.4-2.4-3.1-0.3-4.6-2\nTCN\n1.4-0.3-0.1-0.2-0.2-0.2-3.1-0.3-0.1-2\n9.0K 99.54\nSHE\n0.2-0.3-0.1-0.2-0.2-0.2-0.4-0.3-0.1-0.3 2.0K 99.54\nDSHE [C-B-A-C-B-A-P]\u00d74-F-F [0.2-0.3-0.1-0.2-0.3-0.1-0.1]x4-0.7-0.5 6.8K 99.77\n\nC-B-A-P-C-P-F-B-A-F\nC-B-A-P-C-P-F-B-A-F\nC-B-A-P-C-P-F-B-A-F\n\nTable 2: The multiplicative depth (MD) overhead of prior LHECNNs. In the column of topology,\nC means a convolution layer; B is a batch normalization layer; A indicates an activation layer; P\ndenotes a pooling layer; and F is a fully-connected layer. Acc is the inference accuracy.\n\nTherefore, we propose a mixed bitwidth accumulator shown in Figure 3(c) to avoid the unnecessary\ncomputing overhead. A mixed bitwidth accumulator is an adder tree, where each node is a TFHE\nadder. And TFHE adders at different levels of the tree have distinctive bitwidths. At the bottom level\n(layer0) of the tree, we use b-bit TFHE adders, where b is 5. Instead of 16-bits adders, layer1 of the\ntree has (b + 1)-bit TFHE adders, since the sum of two 5-bit integers can have at most 6 bits. The nth\nlevel of the mixed bitwidth accumulator consists of (b + n)-bit TFHE adders.\n3.4 Deeper Neural Network\nA LHE scheme with a set of prede\ufb01ned parameters allows only a \ufb01xed multiplicative depth (MD)\nbudget, which is the maximal number of consecutively executed homomorphic AND gates in the\nLHE Boolean circuit [14]. The total MD overhead of an n-layer LHECNN is the sum of the MD\nn\noverhead of each layer, i.e.,\ni=0 LM Di, where LM Di is the MD overhead of the ith layer. LM Di\ncan be de\ufb01ned as\n\n(cid:2)\n\nIN \u00b7 log2(KC 2) \u00b7 (DA[a] + DM [b]) + DR[c] + log2(KP 2) \u00b7 DM [d]\n\n(1)\nwhere IN is the input channel #; KC means the weight kernel size; DA[a] is the MD overhead of\nan a-bit adder; DM [b] indicates the MD overhead of a b-bit multiplier; DR[c] is the MD overhead\nof a c-bit ReLU unit; KP denotes the pooling kernel size; DM [d] is the MD overhead of a d-bit\nmax pooling unit. The LHECNN model must guarantee its MD budget no smaller than its total MD\noverhead, otherwise, its inference output cannot be successfully decrypted. Although it is possible\nto enlarge the LHECNN MD budget by re-designing LHE parameters, the new LHE parameters\nincrease the ciphertext size and prolong the overall inference latency of the LHECNN. Considering\nthe inference latency of prior LHECNNs has already been hundreds of seconds, enlarging the\nLHECNN MD budget is not an attractive way to achieve a deeper network architecture. In fact, prior\nLHECNNs have huge MD overhead in each layer, since they perform matrix multiplications and\nsquare activations, i.e., DM [b] and DR[c] are large. As Table 2 shows, in a FCN inferring MNIST,\neach convolutional or fully-connected layer has 1K \u223c 2K MD overhead. In contrast, the activation\nor pool layer has 2 \u223c 5K MD overhead. As a result, FCN can support only shallow architectures\nwith less layers. Moreover, FCN has to reduce the number of activation layers by using larger weight\nkernel sizes and adding more fully-connected layers. As a result, FCN achieves only 98.71% accuracy\nwhen inferring MNIST.\nTo obtain a deep network architecture with more layers under a \ufb01xed MD budget, we need to reduce\nthe MD overhead of each layer (LM Di). We created a network with the same architecture as\nFCN (TCN) by the LTFHE cryptosystem, i.e., we used our proposed ReLU units, max pooling\nunits, TFHE adders and multipliers to implement the network architecture of FCN in TCN. The\ntotal MD overhead of TCN inferring on MNIST is reduced to 9.0K in Table 2, since the MD\noverhead of each activation or pooling layer is greatly reduced by TFHE. When we further applied\nthe logarithmic quantization, TFHE shifters and mixed bitwidth accumulators, the total MD overhead\nof SHE decreases to only 2032. This is because the TFHE shifter has 0 MD overhead, while our\nmixed bitwidth accumulator also has small MD overhead along the carry path. We further propose a\ndeeper SHE (DSHE) architecture by adding more convolutional and activation layers. Compared to\nthe 10-layer FCN, our 30-layer DSHE increases the MNIST inference accuracy to 99.77% with a\nMD overhead of 6.8K.\n4 Experimental Methodology\nTFHE setting and security analysis. Our techniques are implemented by TFHE [9]. We used all\n3 levels of TFHE in the LHE mode. The level-0 TLWE has the minimal noise standard variation\n\u03b1 = 6.10 \u00b7 10\u22125, the count of coef\ufb01cients n = 500, and the security degree \u03bb = 194 [9]. The\nlevel-1 TRLWE con\ufb01gures the minimal noise standard variation \u03b1 = 3.29 \u00b7 10\u221210, the count of\ncoef\ufb01cients n = 1024, and the security degree \u03bb = 152. The level-3 TRGSW sets the minimal\n\n5\n\n\fnoise standard variation \u03b1 = 1.42 \u00b7 10\u221210, the count of coef\ufb01cients n = 2048, the security degree\n\u03bb = 289. We adopted the same key-switching parameters as [9]. Therefore, SHE allows 32K depth\nof homomorphic AND primitive in the LHE mode [9]. The security degree of SHE is \u03bb = 152.\nSimulation, benchmark and dataset. We ran all experiments on an Intel Xeon E7-4850 CPU\nwith 1TB DRAM. Our datasets include MNIST, CIFAR-10, ImageNet and the diabetic retinopathy\ndataset [4] (denoted by medicare). Medicare comprises retinal fundus images labeled by the condition\nseverity including \u2018none\u2019, \u2018mild\u2019, \u2018moderate\u2019, \u2018severe\u2019, or \u2018proliferative\u2019. We scaled the image of\nmedicare to the same size of ImageNet, 224\u00d7224. We also adopted the Penn Treebank dataset [15]\nto evaluate a LTFHE-enabled LSTM model.\n\nDataset\n\nMNIST\nCIFAR-10\nPenn Treebank\nImageNet & Medical AlexNet, ResNet-18 and Shuf\ufb02eNet [16]\n\nNetwork Topology\nSHE: C-B-A-P-C-P-F-B-A-F [2]; DSHE: [C-B-A-C-B-A-P]\u00d74-F-F [8]\nSHE:[C-B-A-C-B-A-P]\u00d73-F-F [8]; DSHE: ResNet-18\nLSTM: time-step 25, 1 300-unit layer; ReLU; [15]\n\nTable 3: Network topology (C means a convolution layer; A is an activation layer; B is a batch\nnormalization layer; P indicates a pooling layer; and F is a fully-connected layer).\n\nNetwork architecture. The network architectures we estimated for various datasets are summarized\nin Table 3. For MNIST, we set SHE the same as CNT [2] and DSHE the same as [8]. For CIFAR-10,\nwe adopted the architecture of SHE from [8] and ResNet-18 as DSHE. To evaluate Penn Treebank,\nwe used the LSTM architecture from [15], where the activations of all LSTM gates are converted to\nReLU. Compared to the original LSTM with different activation functions, e.g., ReLU, sigmoid\nand tanh, the LSTM with all ReLU has only < 1% accuracy degradation [15]. For ImageNet and\nMedical datasets, we adopted AlexNet, ResNet-18 and Shuf\ufb02eNet [16]. For all models, we quantized\nweights into 5-bit power-of-2 representations, and converted inputs and activations to 5-bits \ufb01xed\npoint numbers with little accuracy loss [12].\n5 Results and Analysis\nWe report the inference latency and accuracy of LHECNN models, the numbers of homomorphic\noperations (HOPs), and homomorphic gate operations (HGOPs) in Table 4\u223c 7. Each homomorphic\noperation can be divided into multiple homomorphic gate operations in the TFHE cryptosystem.\nHOPs and HGOPs are independent of the hardware setting and software parallelization. Speci\ufb01cally,\nthe HOPs include additions, multiplications, shifts and comparisons between ciphertext and plaintext.\nThe multiplication between two ciphertexts (CCmult) is the most computationally expensive operation\namong all HOPs, since it requires the largest number of homomorphic gate operations (HGOPs).\nWe compared SHE against eight the state-of-the-art LHECNN schemes including CryptoNets\n(CNT) [2], NED [3], Faster CryptoNets (FCN) [4], DiNN [1], nGraph-HE1 (GHE) [6] and Lola [7].\nMore details can be viewed in Table 1.\n\nScheme HOPs CCAdd P CM ult CCM ult P CShif t CCCom HGOPs Depth MSize EDL(s) FIL(s) LIL(s) Acc(%)\n\nCNT\nFCN\nGHE\nLola\nDiNN\nNED\nTCN\nSHE\nDSHE\n\n312K\n612K\n38K\n67K\n612K\n312K\n24.6K 12.5K\n8K\n16K\n4.7M 2.3M\n19K\n42K\n19K\n23K\n613K\n304K\n\n296K\n24K\n296K\n10.5K\n\n8K\n2.3M\n19K\n945\n5K\n\n945\n945\n945\n1.6K\n40\n1.6K\n\n0\n0\n0\n\n0\n0\n0\n0\n0\n0\n0\n\n19K\n304K\n\n0\n0\n0\n0\n0\n0\n3K\n3K\n5K\n\n47.5\n6.7\n47.5\n4.4\n\n-\n-\n80\n\n-\n-\n\n172K 337MB\n\n-\n-\n-\n-\n\n134K 368MB\n21K 411MB\n\n-\n250\n-\n39.1\n-\n135\n-\n2.2\n-\n40K\n0.49\n-\n-\n320\n8.3M 8.8K 123MB 0.0014 108K\u2020\n88.6\n11K\u2020\n9.3\n856K\n11.6M 6.2K 123MB 0.0014 149K\u2020 124.9\n\n66MB 0.0002\n16.7\n\n2.0K 123MB 0.0014\n\n98.95\n98.71\n98.95\n98.95\n93.71\n99.52\n99.54\n99.54\n99.77\n\nTable 4: The MNIST results (HOPs: homomorphic operations; HGOPs: homomorphic gate op-\nerations; Depth: multiplicative gate depth; MSize: ciphertext message size; EDL: encryption and\ndecryption latency; Acc: inference accuracy; FIL: fully TFHE inference latency; LIL: leveled TFHE\ninference latency; CCAdd: # of additions between two ciphertexts; P CM ult: # of multiplications be-\ntween a plaintext and a ciphertext; CCM ult: # of multiplications between two ciphertexts; P CShif t:\n# of shifts between a plaintext and a ciphertext; CCCom: # of comparisons (including ReLU and\nmax pooling) between two ciphertexts. \u2020 denotes that we ran only its \ufb01rst 3 layers and made FIL\nvalues by projections).\n\n6\n\n\f5.1 MNIST\nAs Table 4 shows, CNT obtains 98.95% accuracy by degree-2 polynomial approximation activations\n(PAAs). The PAA errors prevent CNT from approaching the unencrypted state-of-the-art inference\naccuracy on MNIST. FCN slightly degrades the inference accuracy by 0.24% but shortens the\ninference latency by 84.3% by quantizing CNT and pruning its redundant weights. The weight\npruning of FCN signi\ufb01cantly decreases the numbers of CCAdd and P CM ult. In contrast, FCN\nkeeps the same number of activations, so it has the same number of CCM ult occurring during only\nactivations. Both GHE and Lola reduce the inference latency by batching optimizations, but they\nstill have to use PAAs and achieve the same inference accuracy as CNT. DiNN uses the TFHE\ncryptosystem and quantizes all network parameters (i.e., weights, inputs, and activations) to 1-\nbit, so it reduces the inference accuracy by 5.29%. Through increasing the degree of polynomial\napproximation activations, NED improves the inference accuracy by 0.58% over CNT. However,\ncompared to CNT, it prolongs the inference latency by 28.2%, because it has much more CCAdd,\nP CM ult and CCM ult to compute during an inference.\nWe used the TFHE cryptosystem to implement the network architecture of FCN (denoted by TCN)\nby using LTFHE-based ReLU activations, max poolings and matrix multiplications. Due to the\nReLU activations and max poolings, TCN enhances the inference accuracy by 0.6% over FCN. But\ncompared to FCN, it slows down the inference by 126.6%, since TFHE is not suitable to implement\nmassive repetitive matrix multiplications. To create the SHE scheme, we further use power-of-2\nquantized weights and replace matrix multiplications by LTFHE-based shift-accumulation operations\nin TCN. As a result, SHE maintains the same inference accuracy but greatly reduces the inference\nlatency to 9.3s. We noticed that SHE requires only 2.0K multiplicative depth (MD) which is much\nsmaller than our LTFHE MD budget, so we can use a deeper network architecture (DSHE) with\nmore convolutional and activation layers. DSHE obtains 99.77% accuracy and spends 124.9s in an\ninference. We also reported the inference latency values of fully-TFHE-based TCN, SHE and DSHE.\nCompared to leveled-TFHE-based counterparts, they obtain much longer inference latency because\nof the computationally expensive bootstrapping operations.\nThe size of encrypted message that a client sends to cloud can be calculated by M Size = P N \u00d7\nSN \u00d7 P S, where P N is the pixel number of the input image; SN means the polynomial number\nin a ciphertext; and P S indicates the size of a polynomial. P N is dependent on the dataset, while\nSN and P S are related to the cryptosystem parameters. For a MNIST image, P N = 28 \u00d7 28 =\n784. We quantized the inputs to 5-bit, so SHE encrypts each pixel in one MNIST image by 5\npolynomials (SN = 5). In LTFHE, P S = 32KB. Totally, one MNIST image is encrypted into\n784 \u00d7 5 \u00d7 32KB = 122.5M B. Compared to other non-TFHE cryptosystems, SHE transfers much\nless message data between the client and the server. Typically, the encryption and decryption latency\nis proportional to the encrypted message size [4].\n\nScheme HOPs CCAdd P CM ult CCM ult P CShif t CCCom HGOPs Depth MSize EDL(s) FIL(s) LIL(s) Acc(%)\n\n6.4M 3.2M\nGHE\n61K\n137K\nLola\n2.4G\n1.2G\nNED\n610M 350M\nFCN\n8.7M 4.4M\nTCN\nSHE\n4.4M 4.4M\nDSHE 68M 68M\n\n3.2M\n61K\n1.2G\n350M\n4.4M\n13K\n98K\n\n15K\n15K\n212K\n64K\n\n0\n0\n0\n\n0\n0\n0\n0\n0\n\n4.4M\n68M\n\n0\n0\n0\n0\n\n16K\n16K\n131K\n\n-\n-\n\n-\n-\n-\n\n-\n-\n-\n-\n\n1.8GB\n69.8K 1.6GB\n\n1628\n62.1\n-\n730\n74.1\n-\n127K\u2020 91.50\n0\n39K\u2020\n76.72\n0\n37M\u2020\n31K\u2020\n92.54\n2.8G 25.1K 160MB 0.0018\n211M 5.2K 160MB 0.0018 2.7M\u2020\n92.54\n2258\n3.3G 13.7K 160MB 0.0018 42.5M\u2020 12041\n94.62\n\n61.6\n5.7\n21.8\n8.8\n\nTable 5: The CIFAR-10 results (Abbreviations are the same as Table 4; \u2020 denotes that we ran only its\n\ufb01rst 3 layers, while FIL and LIL values are made by projections).\n\n5.2 CIFAR-10\nOnly NED, GHE, Lola and FCN can support CIFAR-10. As Table 5 shows, although GHE and\nLola obtain shorter inference latency by shallow CNN architectures on CIFAR-10, compared to a\nfull-precision unencrypted model, they degrade the inference accuracy by > 20%, due to their PAA\nlayers. Compared to NED with high degree polynomial approximation activations, FCN reduces the\ninference accuracy by 16.2% but shortens the inference latency by 69.1%. With the same architecture,\nTCN reduces the activation computing overhead, but it requires longer convolution latency. Overall, it\nreduces the inference latency by 20.5%. However, the ReLU activations and max poolings increase\nthe inference accuracy to 92.54%. Compared to TCN, SHE reduces the number of P CM ult by\n99.7%. As a result, it improves the inference latency by 92.7% over TCN by performing only LTFHE\nshift-accumulation operations. Because SHE requires only 5.2K MD which is much smaller than our\n\n7\n\n\fLTFHE MD budget (32K), we can use a deeper network, DSHE, to increase the inference accuracy\nto 96.62% and the inference latency to 12041s.\n\nNetwork Scheme HOPs CCAdd P CM ult P CShif t CCCom HGOPs Depth MSize EDL(s) FIL(s) LIL(s) AccI (%) AccM (%)\nAlexNet TCN\nSHE\nResNet TCN\nSHE\nShuf\ufb02e TCN\nSHE\n\n0.66M 38G 42.3K 7.7GB 0.07\n-\n72M\u2020 89K\u2020\n0.34M 5.5G 6.8K 7.7GB 0.07\n2.47M 100G 96.4K 7.7GB 0.07\n-\n0.49M 15G 13.7K 7.7GB 0.07 195M\u2020 0.23M\u2020\n1.37M 7.9G 27.1K 7.7GB 0.07 102M\u2020 126K\u2020\n18K\n275K\n\n0.3G 0.14G\n0.14G 0.14G\n0.7G 0.36G\n0.36G 0.36G\n56M 0.14G\n28M 28M\n\n0.14G\n0.4M\n0.36G\n1.1M\n0.14G\n83K\n\n1.1G 3.9K 7.7GB 0.07\n\n67.29\n71.32\n71.32\n\n66.8\n69.4\n69.4\n\n14M\u2020\n\n0.14G\n\n0.36G\n\n54.17\n\n63.24\n\n0\n\n0\n\n0\n\n28M\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\nTable 6: The ImageNet and Medical dataset results (AccI means the inference accuracy of ImageNet,\nwhile AccM denotes the inference accuracy of the medical dataset. The other abbreviations are the\nsame as Table 4; \u2020 denotes that we ran only its \ufb01rst 3 layers, while FIL and LIL values are made by\nprojections).\n\nImageNet and Medical Datasets\n\n5.3\nNo prior LHECNN model can infer an entire ImageNet image, because of its prohibitive computing\noverhead. FCN [4] can compute only the last 3 layers of the model when inferring ImageNet. In\nTable 6, SHE uses the architectures of AlexNet, ResNet-18 and Shuf\ufb02eNet for inferences on ImageNet.\nFor AlexNet and ResNet-18, TCNs need >32K MD overhead which is larger than our LTFHE MD\nbudget. Therefore, TCN cannot work on them. On the contrary, SHE needs 1 day and 2.5 days to test\nan ImageNet image by AlexNet and ResNet-18, respectively. Particularly, SHE requires only 5 hours\nto infer an ImageNet image and achieves 69.4% top-1 accuracy by the Shuf\ufb02eNet topology. For the\nmedical dataset, it obtains 71.32% inference accuracy. Besides shorter latency and higher accuracy,\nSHE enables a much deeper architecture under a \ufb01xed LTFHE MD budget, since the LTFHE shifts\nincrease little MD.\n\nScheme HOPs CCAdd P CM ult P CShif t CCCom HGOPs Depth MSize EDL(s) FIL(s) LIL(s) PPW\n\nTCN\nSHE\n\n576K 270K\n336K 270K\n\n270K\n30.4K\n\n0\n\n243K\n\n36K\n36K\n\n75.7M 143K 9.4MB 0.014\n24.4M 30K 9.4MB 0.014 318K\u2020\n\n-\n\n-\n576\n\n-\n\n89.8\n\nTable 7: The Penn Treebank results (Abbreviations are the same as Table 4; \u2020 denotes that we ran\nonly its \ufb01rst 3 layers, while FIL values are made by projections).\n\n5.4 Penn Treebank\n: Element-wise Multiplication + : Addition << : Shift\n* : Matrix Multiplication\nJt\nJt\n+\n+\nHt-1\nHt-1\nsig\nsig\nCt\nCt\ntanh\ntanh\nHt\nHt\n*\n*\nJt\nJt\nCt\nCt\n+\n+\nIt\nIt\nHt-1\nHt-1\n+\n+\nsig\nsig\nsig\nsig\nCt\nCt\ntanh\ntanh\nHt\nHt\ntanh\ntanh\nHt\nHt\n*\n*\n*\n*\n+\n+\nsig\nsig\nCt\nCt\nJt\nJt\ntanh\ntanh\nHt\nHt\n*\n*\nIt\nIt\n+\n+\nIt\nIt\n+\n+\nHt-1\nHt-1\nsig\nsig\nsig\nsig\nHt\nHt\ntanh\ntanh\nCt\nCt\ntanh\ntanh\nHt\nHt\nXt\nXt\n*\n*\n*\n*\n+\n+\nIt\nIt\nsig\nsig\nJt\nJt\ntanh\ntanh\nHt\nHt\nCt\nCt\nXt\nXt\n*\n*\nIt\nIt\n0.8K 0.3K 0.3K\n0.8K 0.3K 0.3K\n0.3K\n0.3K\n+\n+\nHt-1\nHt-1\nIt\nIt\n0.3K\n0.3K\n+\n+\n1.3K\n1.3K\nsig\nsig\nsig\nsig\nCt\nCt\ntanh\ntanh\nHt\nHt\nHt\nHt\ntanh\ntanh\nXt\nXt\n*\n*\n*\n*\nIt\nIt\n+\n+\nsig\nsig\nCt\nCt\ntanh\ntanh\nHt\nHt\nXt\nXt\n*\n*\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\n0.8K 0.3K 0.3K\n0.8K 0.3K 0.3K\n0.3K\n0.3K\nIt\nIt\n0.3K\n0.3K\n+\n+\n1.3K 0.8K\n1.3K 0.8K\nIt\nIt\n0.3K\n0.3K\n1.3K\n1.3K\nsig\nsig\ntanh\ntanh\nHt\nHt\nXt\nXt\n*\n0.8K 0.3K 0.3K\n0.8K 0.3K 0.3K\n0.3K\n0.3K\n+\n+\n0.3K\n0.3K\nIt\nIt\n<< +\n1.3K\n1.3K\nReLU\nReLU\nReLU\nReLU\nHt\nHt\nXt\nXt\n*\n0.8K 0.3K 0.3K\n0.8K 0.3K 0.3K\n0.3K\n0.3K\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\nIt\nIt\n0.3K\n0.3K\n1.3K\n1.3K\n0.3K\n0.3K\n1.3K 0.8K\n1.3K 0.8K\nXt\nXt\n0.8K 0.3K 0.3K\n0.8K 0.3K 0.3K\n0.3K\n0.3K\nIt\nIt\n0.3K\n0.3K\n1.3K\n1.3K\nXt\nXt\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\n0.3K\n0.3K\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\n1.3K 0.8K\n1.3K 0.8K\n0.3K\n0.3K\n1.3K 0.8K\n1.3K 0.8K\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\n0.3K\n0.3K\n1.3K 0.8K\n1.3K 0.8K\n0.3K 0.3K\n0.3K 0.3K\n0.3K\n0.3K\n0.3K\n0.3K\n1.3K 0.8K\n1.3K 0.8K\n0.4K 0.1K\n0.3K 0.1K\n0.3K Depth\n0.4K Depth\n0.1K\n0.1K\n4.7K 0.1K\n0.5K 0.1K\n(a) TCN\n\n(b) SHE\n\nFigure 4: The MD overhead breakdown of the ReLU-based LSTM [15], and each LSTM layer repeats\nfor 25 timesteps.\n\nNo prior LHECNN model supports LSTM, since it has a huge MD overhead. As Figure 4(a)\nshows, the MD overhead of the matrix multiplication in all time steps between Ht\u22121 and Xt is\n4.7K \u00d7 25 = 117.5K, which is already larger than the current 32K LTFHE MD budget. So TCN\ncannot successfully implement the LSTM architecture. As Figure 4(b) shows, We use SHE to build a\nLSTM network to predict the next word on Penn Treebank. SHE uses TFHE shifts and accumulations\nwith small MD overhead to replace computationally expensive matrix multiplications, so it costs only\n0.5K MD overhead to multiply Ht\u22121 with Xt. In Table 7, SHE costs totally 30K MD overhead, which\nis smaller than our LTFHE MD budget. The inference accuracy of LSTM on Penn Treebank is 89.8\nPerplexity Per Word (PPW). Compared to the full-precision LSTM on plaintext data, it degrades the\ninference accuracy by only 2.1%. It takes 576s for SHE to conduct an inference on Penn Treebank.\n6 Conclusion\nIn this paper, we propose SHE, a fast and accurate LTFHE-enable deep neural network, which\nconsists of a ReLU unit, a max pooling unit and a mixed bitwidth accumulator. Our experimental\n\n8\n\n\fresults show SHE achieves the state-of-the-art inference accuracy and reduces the inference latency\nby 76.21% \u223c 94.23% over prior LHECNNs on various datasets. SHE is the \ufb01rst LHE-enabled model\nthat can support deep CNN architectures on ImageNet and LSTM architectures on Penn Treebank.\nReferences\n[1] Florian Bourse et al. Fast Homomorphic Evaluation of Deep Discretized Neural Networks. In\n\nAdvances in Cryptology, 2018.\n\n[2] Nathan Dowlin et al. CryptoNets: Applying Neural Networks to Encrypted Data with High\n\nThroughput and Accuracy. In International Conference on Machine Learning, 2016.\n\n[3] Ehsan Hesamifard et al. Deep neural networks classi\ufb01cation over encrypted data. In ACM\n\nConference on Data and Application Security and Privacy, 2019.\n\n[4] Edward Chou et al. Faster CryptoNets: Leveraging Sparsity for Real-World Encrypted Inference.\n\narXiv, abs/1811.09953, 2018.\n\n[5] Amartya Sanyal et al. TAPAS: Tricks to Accelerate (encrypted) Prediction As a Service. arXiv,\n\nabs/1806.03461, 2018.\n\n[6] Fabian Boemer et al. nGraph-HE: A Graph Compiler for Deep Learning on Homomorphically\n\nEncrypted Data. In ACM International Conference on Computing Frontiers, 2019.\n\n[7] Alon Brutzkus et al. Low Latency Privacy Preserving Inference. In International Conference\n\non Machine Learning, 2019.\n\n[8] Herv\u00e9 Chabanne et al. Privacy-Preserving Classi\ufb01cation on Deep Neural Network. IACR\n\nCryptology, 2017.\n\n[9] Ilaria Chillotti et al. TFHE: Fast Fully Homomorphic Encryption over the Torus.\n\nCryptology, 2018.\n\nIACR\n\n[10] Christina Boura et al. Chimera: A Uni\ufb01ed Framework for B/FV, TFHE and HEAAN Fully\n\nHomomorphic Encryption and Predictions for Deep Learning. IACR Cryptology, 2018.\n\n[11] Bita Darvish Rouhani et al. DeepSecure: Scalable Provably-Secure Deep Learning.\n\nACM/IEEE Design Automation Conference, 2018.\n\nIn\n\n[12] E. H. Lee et al. LogNet: Energy-ef\ufb01cient neural networks using logarithmic computation. In\n\nIEEE International Conference on Acoustics, Speech and Signal Processing, 2017.\n\n[13] Song Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding. In International Conference on Learning Representations,\n2016.\n\n[14] M. Sadegh Riazi et al. Chameleon: A Hybrid Secure Computation Framework for Machine\nLearning Applications. In ACM Asia Conference on Computer and Communications Security,\n2018.\n\n[15] Quoc V. Le et al. A Simple Way to Initialize Recurrent Networks of Recti\ufb01ed Linear Units.\n\nArXiv, abs/1504.00941, 2015.\n\n[16] X. Zhang et al. Shuf\ufb02eNet: An Extremely Ef\ufb01cient Convolutional Neural Network for Mobile\n\nDevices. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n9\n\n\f", "award": [], "sourceid": 5299, "authors": [{"given_name": "Qian", "family_name": "Lou", "institution": "Indiana University Bloomington"}, {"given_name": "Lei", "family_name": "Jiang", "institution": "Indiana University Bloomington"}]}