{"title": "Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 3757, "page_last": 3769, "abstract": "Classification of personal text messages has many useful applications in surveillance, e-commerce, and mental health care, to name a few. Giving applications access to personal texts can easily lead to (un)intentional privacy violations. We propose the first privacy-preserving solution for text classification that is provably secure. Our method, which is based on Secure Multiparty Computation (SMC), encompasses both feature extraction from texts, and subsequent classification with logistic regression and tree ensembles. We prove that when using our secure text classification method, the application does not learn anything about the text, and the author of the text does not learn anything about the text classification model used by the application beyond what is given by the classification result itself. We perform end-to-end experiments with an application for detecting hate speech against women and immigrants, demonstrating excellent runtime results without loss of accuracy.", "full_text": "Privacy-Preserving Classi\ufb01cation of Personal Text\nMessages with Secure Multi-Party Computation: An\n\nApplication to Hate-Speech Detection\n\nDevin Reich1, Ariel Todoki1, Rafael Dowsley2, Martine De Cock1\u21e4, Anderson Nascimento1\n\n1 School of Engineering and Technology\n\nUniversity of Washington Tacoma\n\nTacoma, WA 98402\n\nrafael@dowsley.net\n\n{dreich,atodoki,mdecock,andclay}@uw.edu\n\n2Department of Computer Science\n\nBar-Ilan University, 5290002, Ramat-Gan, Israel\n\nAbstract\n\nClassi\ufb01cation of personal text messages has many useful applications in surveil-\nlance, e-commerce, and mental health care, to name a few. Giving applications\naccess to personal texts can easily lead to (un)intentional privacy violations. We\npropose the \ufb01rst privacy-preserving solution for text classi\ufb01cation that is provably\nsecure. Our method, which is based on Secure Multiparty Computation (SMC),\nencompasses both feature extraction from texts, and subsequent classi\ufb01cation with\nlogistic regression and tree ensembles. We prove that when using our secure text\nclassi\ufb01cation method, the application does not learn anything about the text, and\nthe author of the text does not learn anything about the text classi\ufb01cation model\nused by the application beyond what is given by the classi\ufb01cation result itself.\nWe perform end-to-end experiments with an application for detecting hate speech\nagainst women and immigrants, demonstrating excellent runtime results without\nloss of accuracy.\n\n1\n\nIntroduction\n\nThe ability to elicit information through automated scanning of personal texts has signi\ufb01cant economic\nand societal value. Machine learning (ML) models for classi\ufb01cation of text such as e-mails and SMS\nmessages can be used to infer whether the author is depressed [46], suicidal [42], a terrorist threat\n[1], or whether the e-mail is a spam message [2, 49]. Other valuable applications of text message\nclassi\ufb01cation include user pro\ufb01ling for tailored advertising [32], detection of hate speech [6], and\ndetection of cyberbullying [51]. Some of the above are integrated in parental control applications2\nthat monitor text messages on the phones of children and alert their parents when content related to\ndrug use, sexting, suicide etc. is detected. Regardless of the clear bene\ufb01ts, giving applications access\nto one\u2019s personal text messages and e-mails can easily lead to (un)intentional privacy violations.\nIn this paper, we propose the \ufb01rst privacy-preserving (PP) solution for text classi\ufb01cation that is\nprovably secure. To the best of our knowledge, there are no existing Differential Privacy (DP) or\nSecure Multiparty Computation (SMC) based solutions for PP feature extraction and classi\ufb01cation of\nunstructured texts; the only existing method is based on Homomorphic Encryption (HE) and takes 19\nminutes to classify a tweet [15] while leaking information about the text being classi\ufb01ed. In our SMC\n\n\u21e4Guest Professor at Dept. of Applied Mathematics, Computer Science, and Statistics, Ghent University\n2https://www.bark.us/, https://kidbridge.com/, https://www.webwatcher.com/\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Roles of Alice and Bob in SMC based text classi\ufb01cation\n\nbased solution, there are two parties, nick-named Alice and Bob (see Fig. 1). Bob has a trained ML\nmodel that can automatically classify texts. Our secure text classi\ufb01cation protocol allows to classify a\npersonal text written by Alice with Bob\u2019s ML model in such a way that Bob does not learn anything\nabout Alice\u2019s text and Alice does not learn anything about Bob\u2019s model. Our solution relies on PP\nprotocols for feature extraction from text and PP machine learning model scoring, which we propose\nin this paper.\nWe perform end-to-end experiments with an application for PP detection of hate speech against\nwomen and immigrants in text messages. In this use case, Bob has a trained logistic regression (LR)\nor AdaBoost model that \ufb02ags hateful texts based on the occurrence of particular words. LR models\non word n-grams have been observed to perform comparably to more complex CNN and LSTM\nmodel architectures for hate speech detection [35]. Using our protocols, Bob can label Alice\u2019s texts\nas hateful or not without learning which words occur in Alice\u2019s texts, and Alice does not learn which\nwords are in Bob\u2019s hate speech lexicon, nor how these words are used in the classi\ufb01cation process.\nMoreover, classi\ufb01cation is done in seconds, which is two orders of magnitude better than the existing\nHE solution despite the fact we use over 20 times more features and do not leak any information\nabout Alice\u2019s text to the model owner (Bob). The solution based on HE leaks which words in the text\nare present in Bob\u2019s lexicon [15].\nWe build our protocols using a privacy-preserving machine learning (PPML) framework based on\nSMC developed by us3 . All the existing building blocks can be composed within themselves or\nwith new protocols added to the framework. On top of existing building blocks, we also propose\na novel protocol for binary classi\ufb01cation over binary input features with an ensemble of decisions\nstumps. While some of our building blocks have been previously proposed, the main contribution of\nthis work consists of the careful choice of the ML techniques, feature engineering and algorithmic\nand implementation optimizations to enable end-to-end practical PP text classi\ufb01cation . Additionally,\nwe provide security de\ufb01nitions and proofs for our proposed protocols.\n\n2 Preliminaries\n\nWe consider honest-but-curious adversaries, as is common in SMC based PPML (see e.g. [19, 21]).\nAn honest-but-curious adversary follows the instructions of the protocol, but tries to gather additional\ninformation. Secure protocols prevent the latter.\nWe perform SMC using additively secret shares to do computations modulo an integer q. A value x\nis secret shared over Zq = {0, 1, . . . , q 1} between parties Alice and Bob by picking xA, xB 2 Zq\nuniformly at random subject to the constraint that x = xA + xB mod q, and then revealing xA to\nAlice and xB to Bob. We denote this secret sharing by [[x]]q, which can be thought of as a shorthand\nfor (xA, xB). Secret-sharing based SMC works by \ufb01rst having the parties split their respective inputs\nin secret shares and send some of these shares to each other. Naturally, these inputs have to be mapped\nappropriately to Zq. Next, Alice and Bob represent the function they want to compute securely as a\ncircuit consisting of addition and multiplication gates. Alice and Bob will perform secure additions\nand multiplications, gate by gate, over the shares until the desired outcome is obtained. The \ufb01nal\nresult can be recovered by combining the \ufb01nal shares, and disclosed as intended, i.e. to one of the\nparties or to both. It is also possible to keep the \ufb01nal result distributed over shares.\nIn SMC based text classi\ufb01cation, as illustrated in Fig. 1, Alice\u2019s input is a personal text x and Bob\u2019s\ninput is an ML model M for text classi\ufb01cation. The function that they want to compute securely is\n\n3https://bitbucket.org/uwtppml\n\n2\n\n\ff (x,M) = M(x), i.e. the class label of x when classi\ufb01ed by M. To this end, Alice splits the text\nin secret shares while Bob splits the ML model in secret shares. Both parties engage in a protocol\nin which they send some of the input shares to each other, do local computations on the shares, and\nrepeat this process in an iterative fashion over shares of intermediate results (Step 1). At the end\nof the joint computations, Alice sends her share of the computed class label to Bob (Step 2), who\ncombines it with his share to learn the classi\ufb01cation result (Step 3). As mentioned above, the protocol\nfor Step 1 involves representing the function f as a circuit of addition and multiplication gates.\nGiven two secret sharings [[x]]q and [[y]]q, Alice and Bob can locally compute in a straightforward way\na secret sharing [[z]]q corresponding to z = x + y or z = x y by simply adding/subtracting their\nlocal shares of x and y modulo q. Given a constant c, they can also easily locally compute a secret\nsharing [[z]]q corresponding to z = cx or z = x + c: in the former case Alice and Bob just multiply\ntheir local shares of x by c; in the latter case Alice adds c to her share of x while Bob keeps his\noriginal share. These local operations will be denoted by [[z]]q [[x]]q + [[y]]q, [[z]]q [[x]]q [[y]]q,\n[[z]]q c[[x]]q and [[z]]q [[x]]q + c, respectively. To allow for ef\ufb01cient secure multiplication of\nvalues via operations on their secret shares (denoted by [[z]]q [[x]]q[[y]]q), we use a trusted initializer\nthat pre-distributes correlated randomness to the parties participating in the protocol before the start\nof Step 1 in Fig. 1.4 The initializer is not involved in any other part of the execution and does not\nlearn any data from the parties. This can be straightforwardly extended to ef\ufb01ciently perform secure\nmultiplication of secret shared matrices. The protocol for secure multiplication of secret shared\nmatrices is denoted by \u21e1DMM and for the special case of inner-product computation by \u21e1IP. Details\nabout the (matrix) multiplication protocol can be found in [19]. We note that if a trusted initializer is\nnot available or desired, Alice and Bob can engage in pre-computations to securely emulate the role\nof the trusted initializer, at the cost of introducing computational assumptions in the protocol [19].\n\n3 Secure text classi\ufb01cation\n\nOur general protocol for PP text classi\ufb01cation relies on several building blocks that are used together\nto accomplish Step 1 in Fig. 1: a secure equality test, a secure comparison test, private feature\nextraction, secure protocols for converting between secret sharing modulo 2 and modulo q > 2, and\nprivate classi\ufb01cation protocols. Several of these building blocks have been proposed in the past.\nHowever, to the best of our knowledge, this is the very \ufb01rst time they are combined in order to achieve\nef\ufb01cient text classi\ufb01cation with provable security.\nWe assume that Alice has a personal text message, and that Bob has a LR or AdaBoost classi\ufb01er that\nis trained on unigrams and bigrams as features. Alice constructs the set A = {a1, a2, . . . , am} of\nunigrams and bigrams occurring in her message, and Bob constructs the set B = {b1, b2, . . . , bn} of\nunigrams and bigrams that occur as features in his ML model. We assume that all aj and bi are in the\nform of bit strings. To achieve this, Alice and Bob convert each unigram and bigram on their end to a\nnumber N using SHA 224 [44], strictly for its ability to map the same inputs to the same outputs in a\npseudo-random manner. Next Alice and Bob map each N on their end to a number between 0 and\n2l 1, i.e. a bit string of length l, using a random function in the universal hash family proposed by\nCarter and Wegman [12].5 In the remainder we use the term \u201cword\u201d to refer to a unigram or bigram,\nand we refer to the set B = {b1, b2, . . . , bn} as Bob\u2019s lexicon.\nBelow we outline the protocols for PP text classi\ufb01cation. A correctness and security analysis of the\nprotocols is provided as an appendix. In the description of the protocols in this paper, we assume that\nBob needs to learn the result of the classi\ufb01cation, i.e. the class label, at the end of the computations. It\nis important to note that the protocols described below can be straightforwardly adjusted to a scenario\nwhere Alice instead of Bob has to learn the class label, or even to a scenario where neither Alice\nnor Bob should learn what the class label is and instead it should be revealed to a third party or kept\nin a secret sharing form. All these scenarios might be relevant use cases of PP text classi\ufb01cation,\ndepending on the speci\ufb01c application at hand.\n\n4This technique for secure multiplication was originally proposed by Beaver [7] and is regularly used to\nenable very ef\ufb01cient solutions both in the context of PPML [20, 17, 33, 19] as well as in other applications, e.g.,\n[48, 28, 27, 38, 50, 18].\n5The hash function is de\ufb01ned as ((a \u00b7 N + b) mod p) mod 2l 1 where p is a prime and a and b are\n\nrandom numbers less than p. In our experiments, p = 1, 301, 081, a = 972, and b = 52, 097.\n\n3\n\n\f3.1 Cryptographic building blocks\nSecure Equality Test: At the start of the secure equality test protocol, Alice and Bob have secret\nshares of two bit strings x = x` . . . x1 and y = y` . . . y1 of length `. x corresponds to a word from\nAlice\u2019s message and y corresponds to a feature from Bob\u2019s model. The bit strings x and y are secret\nshared over Z2. Alice and Bob follow the protocol to determine whether x = y. The protocol \u21e1EQ\noutputs a secret sharing of 1 if x = y and of 0 otherwise.\nProtocol \u21e1EQ:\n\u2022 For i = 1, . . . ,` , Alice and Bob locally compute [[ri]]2 [[xi]]2 + [[yi]]2 + 1.\n\u2022 Alice and Bob use secure multiplication to compute a secret sharing of z = r1 \u00b7 r2 \u00b7 . . . \u00b7 r`. If\nx = y, then ri = 1 for all bit positions i, hence z = 1; otherwise some ri = 0 and therefore z = 0.\nThe result is the secret sharing [[z]]2, which is the desired output of the protocol.\n\nThis protocol for equality test is folklore in the \ufb01eld of SMC. The l 1 multiplications can be\norganized in as binary tree with the result of the multiplication at the root of the tree. In this way,\nthe presented protocol has log(l) rounds. While there are equality test protocols that have a constant\nnumber of rounds, the constant is prohibitively large for the parameters used in our implementation.\n\nSecure Feature Vector Extraction: At the start of the feature extraction protocol, Alice has a set\nA = {a1, a2, . . . , am} and Bob has a set B = {b1, b2, . . . , bn}. A is a set of bit strings that represent\nAlice\u2019s text, and B is a set of bit strings that represent Bob\u2019s lexicon. Bob would like to extract words\nfrom Alice\u2019s text that appear in his lexicon. At the end of the protocol, Alice and Bob have secret\nshares of a binary feature vector x which represents what words in Bob\u2019s lexicon appear in Alice\u2019s\ntext. The binary feature vector x of length n is de\ufb01ned as\n\nxi =\u21e2 1\n\n0\n\nif bi 2 A\notherwise\n\n(1)\n\nProtocol \u21e1FE:\n\u2022 Alice and Bob secret share each aj (j = 1, . . . , m) and each bi (i = 1, . . . , n) with each other.\n\u2022 For i = 1 . . . n: // Computation of secret shares of xi as de\ufb01ned in Equation (1).\n\nFor j = 1 . . . m:\n\nAlice and Bob run the secure equality test protocol \u21e1EQ to compute secret shares\n\nxij = 1 if aj = bi; xij = 0 otherwise\n\n(2)\n\nAlice and Bob locally compute the secret share [[xi]]2 Pm\n\nThe secure feature vector extraction can be seen as a private set intersection where the intersection is\nnot revealed but shared [13, 31]. Our solution \u21e1FE is tailored to be used within our PPML framework\n(it uses only binary operations, it is secret sharing based, and is based on pre-distributed binary\nmultiplications). In principle, other protocols could be used here. The ef\ufb01ciency of our protocol can\nbe improved by using hashing techniques [45] at the cost of introducing a small probability of error.\nThe improvements due to hashing are asymptotic and for the parameters used in our fastest running\nprotocol these improvements were not noticeable. Thus, we restricted ourselves to the original\nprotocol without hashing and without any probability of failure.\n\nj=1[[xij]]2.\n\nSecure Comparison Test:\nIn our privacy-preserving AdaBoost classi\ufb01er we will use a secure\ncomparison protocol as a building block. At the start of the secure comparison test protocol, Alice\nand Bob have secret shares over Z2 of two bit strings x = x` . . . x1 and y = y` . . . y1 of length `.\nThey run the secure comparison protocol \u21e1DC of Garay et al. [34] with secret sharings over Z2 and\nobtain a secret sharing of 1 if x y and of 0 otherwise.\nSecure Conversion between Zq and Z2: Some of our building blocks perform computations using\nsecret shares over Z2 (secure equality test, comparison and feature extraction), while the secure inner\nproduct works over Zq for q > 2. In order to be able to integrate these building blocks we need:\n\u2022 A secure bit-decomposition protocol for secure conversion from Zq to Z2: Alice and Bob have as\ninput a secret sharing [[x]]q and without learning any information about x they should obtain as\noutput secret sharings [[xi]]2, where x` \u00b7\u00b7\u00b7 x1 is the binary representation of x. We use the secure\nbit-decomposition protocol \u21e1decomp from De Cock et al. [19].\n\n4\n\n\fx1\n\nx1 = 0\n\nx1 = 1\n\nxi\n\nxn\n\nxi = 0\n\nxi = 1\n\n...\n\n...\n\nxn = 0\n\nxn = 1\n\n0 : y1,0\n1 : z1,0\n\n0 : y1,1\n1 : z1,1\n\n0 : yi,0\n1 : zi,0\n\n0 : yi,1\n1 : zi,1\n\n0 : yn,0\n1 : zn,0\n\n0 : yn,1\n1 : zn,1\n\nFigure 2: Ensemble of decision stumps. Each root corresponds to a feature xi. The leaves contain\nweights yi,k for the votes for class label 0 and weights zi,k for the votes for class label 1.\n\n\u2022 A protocol for secure conversion from Z2 to Zq: Alice and Bob have as a input a secret sharing\n[[x]]2 of a bit x and need to obtain a secret sharing [[x]]q of the binary value over a larger \ufb01eld Zq\nwithout learning any information about x. To this end, we use protocol \u21e12toQ:\n\u2013 For the input [[x]]2, let xA 2{ 0, 1} denote Alice\u2019s share and xB 2{ 0, 1} denote Bob\u2019s share.\n\u2013 Alice creates a secret sharing [[xA]]q by picking uniformly random shares that sum to xA and\n\u2013 Alice and Bob compute [[y]]q [[xA]]q[[xB]]q.\n\u2013 The output is computed as [[z]]q [[xA]]q + [[xB]]q 2[[y]]q.\n\ndelivers Bob\u2019s share to him, and Bob proceeds similarly to create [[xB]]q.\n\nSecure Logistic Regression (LR) Classi\ufb01cation: At the start of the secure LR classi\ufb01cation\nprotocol, Bob has a trained LR model M that requires a feature vector x of length n as its input, and\nproduces a label M(x) as its output. Alice and Bob have secret shares of the feature vector x which\nrepresents what words in Bob\u2019s lexicon appear in Alice\u2019s text. At the end of the protocol, Bob gets\nthe result of the classi\ufb01cation M(x). We use an existing protocol \u21e1LR for secure classi\ufb01cation with\nLR models [19].6\n\nSecure AdaBoost Classi\ufb01cation: The setting is the same as above, but the model M is an Ad-\naBoost ensemble of decision stumps instead of a LR model. While ef\ufb01cient solutions for secure\nclassi\ufb01cation with tree ensembles were previously known [33], we can take advantage of speci\ufb01c\nfacts about our use case to obtain a more ef\ufb01cient solution. In more detail, in our use case: (1) all the\ndecision trees have depth 1 (i.e., decision stumps); (2) each feature xi is binary and therefore when it\nis used in a decision node, the left and right children correspond exactly to xi = 0 and xi = 1; (3)\nthe output class is binary; (4) the feature values were extracted in a PP way and are secret shared so\nthat no party alone knows their values. We can use the above facts in order to perform the AdaBoost\nclassi\ufb01cation by computing two inner products and then comparing their values.\nProtocol \u21e1AB:\n\u2022 Alice and Bob hold secret sharings [[xi]]q of each of the n binary features xi. Bob holds the trained\nAdaBoost model which consists of two weighted probability vectors y = (y1,0, y1,1, . . . , yn,0, yn,1)\nand z = (z1,0, z1,1, . . . , zn,0, zn,1). For the i-th decision stump: yi,k is the weighted probability\n(i.e., a probability multiplied by the weight of the i-th decision stump) that the model assigns to\nthe output class being 0 if xi = k, and zi,k is de\ufb01ned similarly for the output class 1 (see Fig. 2).\n\u2022 Bob secret shares the elements of y and z, and Alice and Bob locally compute secret sharings [[w]]q\nof the vector w = (1 x1, x1, 1 x2, x2, . . . , 1 xn, xn).\n\u2022 Using the secure inner product protocol \u21e1IP, Alice and Bob compute secret sharings of the inner\nproduct p0 between y and w, and of the inner product p1 between z and w. p0 and p1 are the\naggregated votes for class label 0 and 1 respectively.\n\nc, which is then open towards Bob.\n\n\u2022 Alice and Bob use \u21e1decomp to compute bitwise secret sharings of p0 and p1 over Z2.\n\u2022 Alice and Bob use \u21e1DC to compare p1 and p0, getting as output a secret sharing of the output class\nTo the best of our knowledge, this is the most ef\ufb01cient provably secure protocol for binary classi\ufb01ca-\ntion over binary input features with an ensemble of decisions stumps.\n\n6In our case the result of the classi\ufb01cation is disclosed to Bob (the party that owns the model) instead of\nAlice (who has the original input to be classi\ufb01ed) as in [19], however it is trivial to modify their protocol so that\nthe \ufb01nal secret share is open towards Bob instead of Alice. Note also that in our case, the feature vector that is\nused for the classi\ufb01cation is already secret shared between Alice and Bob, while in their protocol Alice holds the\nfeature vector, which is then secret shared in the \ufb01rst step of the protocol. This modi\ufb01cation is also trivial and\ndoes not affect the security of the protocol.\n\n5\n\n\fTable 1: Accuracy (Acc) results using 5-fold cross-validation over the corpus of 10,000 tweets. Total\ntime (Tot) needed to securely classify a text with our framework, broken down in time needed for\nfeature vector extraction (Extr) and time for feature vector classi\ufb01cation (Class).\n\nAda; 50 trees; depth 1\nAda; 200 trees; depth 1\nAda; 500 trees; depth 1\nLogistic regression (50 feat.)\nLogistic regression (200 feat.)\nLogistic regression (500 feat.)\nLogistic regression (all feat.)\n\nUnigrams\n\nAcc\n\nTime (in sec)\n\nExtr Class\n6.4\n0.8\n71.6%\n6.4\n2.8\n73.0%\n6.7\n6.6\n73.9%\n0.8\n3.7\n72.4%\n3.7\n2.8\n73.3%\n3.8\n73.4%\n6.6\n73.1% 318.0\n6.1\n\nTot\n7.2\n9.2\n13.3\n4.5\n6.5\n10.4\n324.1\n\nUnigrams+Bigrams\nTime (in sec)\n\nAcc\n\nExtr Class\n6.6\n1.5\n73.3%\n6.6\n9.4\n74.2%\n6.7\n21.6\n74.4%\n1.5\n3.8\n73.8%\n3.8\n9.4\n73.7%\n4.1\n74.2%\n21.6\n73.8% 5,371.9\n24.9\n\nTot\n8.1\n16.0\n28.3\n5.3\n13.2\n25.7\n5,396.8\n\n3.2 Privacy-preserving classi\ufb01cation of personal text messages\n\nWe now present our novel protocols for PP text classi\ufb01cation. They result from combining the\ncryptographic building blocks we introduced previously. The PP protocol \u21e1TCLR for classifying the\ntext using a logistic regression model works as follows:\nProtocol \u21e1TCLR:\n\u2022 Alice and Bob execute the secure feature extraction protocol \u21e1FE with input sets A and B in order\n\nto obtain the secret sharesJxiK2 of the feature vector x.\n\u2022 They run the protocol \u21e12toQ to obtain sharesJxiKq over Zq.\n\n\u2022 Alice and Bob run the secure logistic regression classi\ufb01cation protocol \u21e1LR in order to get the\nresult of the classi\ufb01cation. The LR model M is given as input to \u21e1LR by Bob, and the secret shared\nfeature vector x by both of them. Bob gets the result of the classi\ufb01cation M(x).\nThe privacy-preserving protocol \u21e1TCAB for classifying the text using AdaBoost works as follows:\nProtocol \u21e1TCAB:\n\u2022 Alice and Bob execute the secure feature extraction protocol \u21e1FE with input sets A and B in order\n\nto obtain the secret sharesJxiK2 of the feature vector x.\n\u2022 They run the protocol \u21e12toQ to obtain sharesJxiKq over Zq.\n\n\u2022 Alice and Bob run the secure AdaBoost classi\ufb01cation protocol \u21e1AB to obtain the result of the clas-\nsi\ufb01cation. The secret shared feature vector x is given as input to \u21e1AB by both of them, and the two\nweighted probability vectors y = (y1,0, y1,1, . . . , yn,0, yn,1) and z = (z1,0, z1,1, . . . , zn,0, zn,1)\nthat constitute the model are speci\ufb01ed by Bob. Bob gets the output class c.\n\nDetailed proofs of security are presented in the appendix.\n\n4 Experimental results\n\nWe evaluate the proposed protocols in a use case for the detection of hate speech in short text\nmessages, using data from [6]. The corpus consists of 10,000 tweets, 60% of which are annotated\nas hate speech against women or immigrants. We convert all characters to lowercase, and turn each\ntweet into a set of word unigrams and bigrams. There are 29,853 distinct unigrams and 93,629\ndistinct bigrams in the dataset, making for a total of 123,482 features.\nAccuracy results for a variety of models trained to classify a tweet as hate speech vs. non-hate speech\nare presented in Table 1. The models are evaluated using 5-fold cross-validation over the entire\ncorpus of 10,000 tweets. The top rows in Table 1 correspond to tree ensemble models consisting\nof 50, 200, and 500 decision stumps respectively; the root of each stump corresponds to a feature.\nThe bottom rows contain results for an LR model trained on 50, 200, and 500 features (preselected\nbased on information gain), and an LR model trained on all features. We ran experiments for feature\nsets consisting of unigrams and bigrams, as well as for feature sets consisting of unigrams only,\nobserving that the inclusion of bigrams leads to a small improvement in accuracy. Note that designing\na model to obtain the highest possible accuracy is not the focus of this paper. Instead, our goal is to\ndemonstrate that PP text classi\ufb01cation based on SMC is feasible in practice.\n\n6\n\n\fWe implemented the protocols from Section 3 in Java and ran experiments on AWS c5.9xlarge\nmachines with 36 vCPUs, 72.0 GiB Memory.7 Each of the parties ran on separate machines (connected\nwith a Gigabit Ethernet network), which means that the results in Table 1 cover communication time\nin addition to computation time. Each runtime experiment was repeated 3 times and average results\nare reported. In Table 1 we report the time (in sec) needed for converting a tweet into a feature vector\n(Extr), for classi\ufb01cation of the feature vector (Class), and for the overall process (Tot).\n\n4.1 Analysis\n\nThe best running times were obtained using unigrams, 50 features and logistic regression (4.5 s) with\nan accuracy of 72.4%. The highest accuracy (74.4%) was obtained by using unigram and bigrams,\n500 features and AdaBoost with a running time equal to 28.3s. From these results, it is clear that\nfeature engineering plays a major role in optimizing privacy-preserving machine learning solutions\nbased on SMC. We managed to reduce the running time from 5,396.8s (logistic regression, unigram\nand bigrams, all 123,482 features being used) to 5.3s (logistic regression, unigrams and bigrams, 50\nfeatures) without any loss in accuracy and to 4.5s (logistic regression, unigrams only, 50 features)\nwith a small loss.\n\n4.2 Optimizing the computational and communication complexities\n\nThe feature extraction protocol requires n \u00b7 m secure equality tests of bit strings. The equality test\nrelies on secure multiplication, which is the more expensive operation. To reduce the number of\nrequired equality tests, Alice and Bob can each \ufb01rst map their bit strings to p buckets A1, A2, . . . , Ap\nand B1, B2, . . . , Bp respectively, so that bit strings from each Ai need to only be compared with bit\nstrings from Bi. Each bit string aj and bi is hashed and the \ufb01rst t bits of the hash output are used to\nde\ufb01ne the bucket number corresponding to that bit string, using a total of p = 2t buckets. In order not\nto leak how many elements are mapped to each bucket (which can leak some information about the\nprobability distribution of the elements, as the hash function is known by everyone), each bucket has\na \ufb01xed number of elements (s1 for Bob\u2019s buckets and s2 for Alice\u2019s buckets) and the empty spots in\nthe buckets are \ufb01lled up with dummy elements. The feature extraction protocol now requires p\u00b7 s1 \u00b7 s2\nequality tests, which can be substantially smaller than n \u00b7 m. When using bucketization, the feature\nvector of length n from Equation (1) is expanded to a feature vector of length p \u00b7 s1, containing the\noriginal n features as well as the p \u00b7 s1 n dummy features that Bob created to \ufb01ll up his buckets.\nThese dummy features do not have any effect on the accuracy of the classi\ufb01cation because Bob\u2019s\nmodel does not take them into account: the trees with dummy features in an AdaBoost model have 0\nweight for both class labels, and the dummy features\u2019 coef\ufb01cients in an LR model are always 0.\nThe size of the buckets has to be chosen suf\ufb01ciently large to avoid over\ufb02ow. The choice depends\ndirectly on the number p = 2t of buckets (which is kept constant for Alice and Bob) and the number\nof elements to be placed in the buckets, i.e. n elements on Bob\u2019s side and m elements on Alice\u2019s side.\nWhile for hash functions coming from a 2-universal family of hash functions the computation of these\nprobabilities is relatively straightforward, the same is not true for more complicated hash functions\n[45]. In that case, numerical simulations are needed in order to bound the required probability.\nThe effect of using buckets is more signi\ufb01cant for large values of n and m. In our case, after\nperforming feature engineering for reducing the number of elements in each set, in the best case, we\nend up with inputs for which there is no signi\ufb01cant difference between the original protocol (without\nbuckets) and the protocol that uses buckets. If the performance of these two cases is comparable, one\nis better off using the version without buckets, since there will be no probability of information being\nleaked due to bucket over\ufb02ow.\nAnother way we could possibly improve the communication and computation complexities of the\nprotocol is by reducing the number of bits used to represent each feature albeit at the cost of increasing\nthe probability of collisions (different features being mapped into the same bit strings). We used 13\nbits for representing unigrams and 17 bits for representing unigrams and bigrams. We did not observe\nany collisions.\n\n7https://bitbucket.org/uwtppml\n\n7\n\n\fFinally, we note that if the protocol is to be deployed over a wide area network, rather than a local\narea network, Yao garbled circuits would become a preferable choice for the round intensive parts of\nour solution (such as in the private feature extraction part).\n\n5 Related work\n\nThe interest in privacy-preserving machine learning (PPML) has grown substantially over the last\ndecade. The best-known results in PPML are based on differential privacy (DP), a technique that\nrelies on adding noise to answers, to prevent an adversary from learning information about any\nparticular individual in the dataset from revealed aggregate statistics [30]. While DP in an ML setting\naims at protecting the privacy of individuals in the training dataset, our focus is on protecting the\nprivacy of new user data that is classi\ufb01ed with proprietary ML models. To this end, we use Secure\nMultiparty Computation (SMC) [16], a technique in cryptography that has successfully been applied\nto various ML tasks with structured data (see e.g. [14, 19, 21, 40] and references therein).\nTo the best of our knowledge there are no existing DP or SMC based solutions for PP feature\nextraction and classi\ufb01cation of unstructured texts. Defenses against authorship attribution attacks that\nful\ufb01ll DP in text classi\ufb01cation have been proposed [53]. These methods rely on distortion of term\nfrequency vectors and result in loss of accuracy. In this paper we address a different challenge: we\nassume that Bob knows Alice, so no authorship obfuscation is needed. Instead, we want to process\nAlice\u2019s text with Bob\u2019s classi\ufb01er, without Bob learning what Alice wrote, and without accuracy loss.\nTo the best of our knowledge, Costantino et al. [15] were the \ufb01rst to propose PP feature extraction\nfrom text. In their solution, which is based on homomorphic encryption (HE), Bob learns which of\nhis lexicon\u2019s words are present in Alice\u2019s tweets, and classi\ufb01cation of a single tweet with a model\nwith less than 20 features takes 19 minutes. Our solution does not leak any information about Alice\u2019s\nwords to Bob, and classi\ufb01cation is done in seconds, even for a model with 500 features.\nBelow we present existing work that is related to some of the building blocks we use in our PP text\nclassi\ufb01cation protocol (see Section 3.1).\nPrivate equality tests have been proposed in the literature based on several different \ufb02avors [3]. They\ncan be based on Yao Gates, Homomorphic Encryption, and generic SMC [52]. In our case, we\nhave chosen a simple protocol that depends solely on additions and multiplications over a binary\n\ufb01eld. While different (and possibly more ef\ufb01cient) comparison protocols could be used instead, they\nwould either require additional computational assumptions or present a marginal improvement in\nperformance for the parameters used here.\nOur private feature extraction can be seen as a particular case of private set intersection (PSI). PSI\nis the problem of securely computing the intersection of two sets without leaking any information\nexcept (possibly) the result, such as identifying the intersection of the set of words in a user\u2019s text\nmessage with the hate speech lexicon used by the classi\ufb01er. Several paradigms have been proposed to\nrealize PSI functionality, including a Naive hashing solution, Server-aided PSI, and PSI based on\noblivious transfer extension. For a complete survey, we refer to Pinkas et al. [45]. In our protocol for\nPP text classi\ufb01cation, we implement private feature extraction by a straightforward application of\nour equality test protocol. While more ef\ufb01cient protocols could be obtained by using sophisticated\nhashing techniques, we have decided to stick with our direct solution since it has no probability of\nfailure and works well for the input sizes needed in our problem. For larger input sizes, a more\nsophisticated protocol would be a better choice [45].\nWe use two protocols for the secure classi\ufb01cation of feature vectors: an existing protocol \u21e1LR for\nsecure classi\ufb01cation with LR models [19]; and a novel secure AdaBoost classi\ufb01cation protocol. The\nlogistic regression protocol uses solely additions and multiplications over a \ufb01nite \ufb01eld. The secure\nAdaBoost classi\ufb01cation protocol is an novel optimized protocol that uses solely decision trees of\ndepth one, binary features and a binary output. All these characteristics were used in order to speed\nup the resulting protocol. The \ufb01nal secure AdaBoost classi\ufb01cation protocol uses only two secure\ninner products and one secure comparison.\nGeneric protocols for private scoring of machine learning models have been proposed in [8]. The\nsolutions proposed in [8] cannot be used in our setting since they assume that the features\u2019 description\nare publicly known, and thus can be computed locally by Alice and Bob. However, in our case, the\nfeatures themselves are part of the model and cannot be made public.\n\n8\n\n\fFinally, we note that while we implemented our protocols using our own framework for privacy-\npreserving machine learning 8, any other generic framework for SMC could be also used in principle\n[47, 22, 41].\n\n6 Conclusion\n\nIn this paper we have presented the \ufb01rst provably secure method for privacy-preserving (PP) classi\ufb01-\ncation of unstructured text. We have provided an analysis of the correctness and security of solution.\nAs a side result, we also present a novel protocol for binary classi\ufb01cation over binary input features\nwith an ensemble of decisions stumps. An implementation of the protocols in Java, run on AWS\nmachines, allowed us to classify text messages securely within seconds. It is important to note that\nthis run time (1) includes both secure feature extraction and secure classi\ufb01cation of the extracted\nfeature vector; (2) includes both computation and communication costs, as the parties involved in the\nprotocol were run on separate machines; (3) is two orders of magnitude better than the only other\nexisting solution, which is based on HE. Our results show that in order to make PP text classi\ufb01cation\npractical, one needs to pay close attention not only to the underlying cryptographic protocols but\nalso to the underlying ML algorithms. ML algorithms that would be a clear choice when used in the\nclear might not be useful at all when transferred to the SMC domain. One has to optimize these ML\nalgorithms having in mind their use within SMC protocols. Our results provide the \ufb01rst evidence that\nprovably secure PP text classi\ufb01cation is feasible in practice.\n\n8https://bitbucket.org/uwtppml\n\n9\n\n\fTracking terrorists online might\n\ninvade your privacy.\n\nBBC,\n\nReferences\n[1] Peter Ray Allison.\n\nhttp://www.bbc.com/future/story/20170808-tracking-terrorists-online-might-invade-your-\nprivacy, 2017.\n\n[2] Tiago A. Almeida, Jos\u00e9 Mar\u00eda G. Hidalgo, and Akebo Yamakami. Contributions to the study\nof SMS spam \ufb01ltering: new collection and results. In Proc. of the 11th ACM Symposium on\nDocument Engineering, pages 259\u2013262, 2011.\n\n[3] Nuttapong Attrapadung, Goichiro Hanaoka, Shinsaku Kiyomoto, Tomoaki Mimoto, and Ja-\ncob CN Schuldt. A taxonomy of secure two-party comparison protocols and ef\ufb01cient construc-\ntions. In 15th Annual Conference on Privacy, Security and Trust (PST), 2017.\n\n[4] Boaz Barak, Ran Canetti, Jesper Buus Nielsen, and Rafael Pass. Universally composable\n\nprotocols with relaxed set-up assumptions. In FOCS 2004, pages 186\u2013195, 2004.\n\n[5] Paulo S. L. M. Barreto, Bernardo David, Rafael Dowsley, Kirill Morozov, and Anderson C. A.\nNascimento. A framework for ef\ufb01cient adaptively secure composable oblivious transfer in\nthe ROM. Cryptology ePrint Archive, Report 2017/993, 2017. http://eprint.iacr.org/\n2017/993.\n\n[6] Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco\nRangel, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 Task 5: Multilingual detection\nof hate speech against immigrants and women in Twitter. In Proc. of the 13th International\nWorkshop on Semantic Evaluation (SemEval-2019). ACL, 2019.\n\n[7] Donald Beaver. Commodity-based cryptography (extended abstract). In STOC 1997, pages\n\n446\u2013455, 1997.\n\n[8] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Sha\ufb01 Goldwasser. Machine learning classi\ufb01-\n\ncation over encrypted data. In NDSS, volume 4324, page 4325, 2015.\n\n[9] Ran Canetti. Universally composable security: A new paradigm for cryptographic protocols. In\n\nFOCS 2001, pages 136\u2013145, 2001.\n\n[10] Ran Canetti and Marc Fischlin. Universally composable commitments. In Crypto 2001, pages\n\n19\u201340, 2001.\n\n[11] Ran Canetti, Yehuda Lindell, Rafail Ostrovsky, and Amit Sahai. Universally composable\n\ntwo-party and multi-party secure computation. In STOC 2002, pages 494\u2013503, 2002.\n\n[12] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of\n\nComputer and System Sciences, 18(2):143\u2013154, 1979.\n\n[13] Michele Ciampi and Claudio Orlandi. Combining private set-intersection with secure two-party\n\ncomputation. In SCN 2018, pages 464\u2013482, 2018.\n\n[14] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for\nprivacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2):28\u201334,\n2002.\n\n[15] Gianpiero Costantino, Antonio La Marra, Fabio Martinelli, Andrea Saracino, and Mina Sheikhal-\nishahi. Privacy-preserving text mining as a service. In 2017 IEEE Symposium on Computers\nand Communications (ISCC), pages 890\u2013897, 2017.\n\n[16] Ronald Cramer, Ivan Damg\u00e5rd, and Jesper Buus Nielsen. Secure Multiparty Computation and\n\nSecret Sharing. Cambridge University Press, 2015.\n\n[17] Bernardo David, Rafael Dowsley, Raj Katti, and Anderson CA Nascimento. Ef\ufb01cient uncondi-\ntionally secure comparison and privacy preserving machine learning classi\ufb01cation protocols. In\nInternational Conference on Provable Security, pages 354\u2013367. Springer, 2015.\n\n10\n\n\f[18] Bernardo David, Rafael Dowsley, Jeroen van de Graaf, Davidson Marques, Anderson C. A.\nNascimento, and Adriana C. B. Pinto. Unconditionally secure, universally composable privacy\npreserving linear algebra. IEEE Transactions on Information Forensics and Security, 11(1):59\u2013\n73, 2016.\n\n[19] Martine De Cock, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento, Wing-Sea\nPoon, and Stacey Truex. Ef\ufb01cient and private scoring of decision trees, support vector machines\nand logistic regression models based on pre-computation. IEEE Transactions on Dependable\nand Secure Computing, 16(2):217\u2013230, 2019.\n\n[20] Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, and Stacey C. Newman. Fast,\nprivacy preserving linear regression over distributed datasets based on pre-distributed data. In\n8th ACM Workshop on Arti\ufb01cial Intelligence and Security (AISec), pages 3\u201314, 2015.\n\n[21] Sebastiaan de Hoogh, Berry Schoenmakers, Ping Chen, and Harm op den Akker. Practical\nsecure decision tree learning in a teletreatment application. In International Conference on\nFinancial Cryptography and Data Security, pages 179\u2013194. Springer, 2014.\n\n[22] Daniel Demmler, Thomas Schneider, and Michael Zohner. Aby-a framework for ef\ufb01cient\n\nmixed-protocol secure two-party computation. In NDSS, 2015.\n\n[23] Nico D\u00f6ttling, Daniel Kraschewski, and J\u00f6rn M\u00fcller-Quade. Unconditional and composable\n\nsecurity using a single stateful tamper-proof hardware token. pages 164\u2013181.\n\n[24] Rafael Dowsley. Cryptography Based on Correlated Data: Foundations and Practice. PhD\n\nthesis, Karlsruhe Institute of Technology, Germany, 2016.\n\n[25] Rafael Dowsley, J\u00f6rn M\u00fcller-Quade, and Anderson C. A. Nascimento. On the possibility of\nuniversally composable commitments based on noisy channels. In SBSEG 2008, pages 103\u2013114,\nGramado, Brazil, September 1\u20135, 2008.\n\n[26] Rafael Dowsley, J\u00f6rn M\u00fcller-Quade, and Tobias Nilges. Weakening the isolation assumption of\n\ntamper-proof hardware tokens. In ICITS 2015, pages 197\u2013213, 2015.\n\n[27] Rafael Dowsley, J\u00f6rn M\u00fcller-Quade, Akira Otsuka, Goichiro Hanaoka, Hideki Imai, and\nAnderson C. A. Nascimento. Universally composable and statistically secure veri\ufb01able secret\nsharing scheme based on pre-distributed data. IEICE Transactions, 94-A(2):725\u2013734, 2011.\n\n[28] Rafael Dowsley, Jeroen Van De Graaf, Davidson Marques, and Anderson CA Nascimento. A\ntwo-party protocol with trusted initializer for computing the inner product. In International\nWorkshop on Information Security Applications, pages 337\u2013350. Springer, 2010.\n\n[29] Rafael Dowsley, Jeroen van de Graaf, J\u00f6rn M\u00fcller-Quade, and Anderson C. A. Nascimento.\nOn the composability of statistically secure bit commitments. Journal of Internet Technology,\n14(3):509\u2013516, 2013.\n\n[30] Cynthia Dwork. Differential privacy: A survey of results. In International Conference on\n\nTheory and Applications of Models of Computation, pages 1\u201319. Springer, 2008.\n\n[31] Brett Hemenway Falk, Daniel Noble, and Rafail Ostrovsky. Private set intersection with linear\ncommunication from general assumptions. Cryptology ePrint Archive, Report 2018/238, 2018.\nhttps://eprint.iacr.org/2018/238.\n\n[32] Golnoosh Farnadi, Geetha Sitaraman, Shanu Sushmita, Fabio Celli, Michal Kosinski, David\nStillwell, Sergio Davalos, Marie-Francine Moens, and Martine De Cock. Computational\npersonality recognition in social media. User Modeling and User-Adapted Interaction, 26(2-\n3):109\u2013142, 2016.\n\n[33] Kyle Fritchman, Keerthanaa Saminathan, Rafael Dowsley, Tyler Hughes, Martine De Cock,\nAnderson Nascimento, and Ankur Teredesai. Privacy-preserving scoring of tree ensembles: A\nnovel framework for AI in healthcare. In Proc. of 2018 IEEE International Conference on Big\nData, pages 2412\u20132421, 2018.\n\n11\n\n\f[34] Juan A. Garay, Berry Schoenmakers, and Jos\u00e9 Villegas. Practical and secure solutions for\n\ninteger comparison. In PKC 2007, pages 330\u2013342, 2007.\n\n[35] Tommi Gr\u00f6ndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N. Asokan. All you need is \u201clove\u201d:\nEvading hate-speech detection. In Proc. of the 11th ACM Workshop on Arti\ufb01cial Intelligence\nand Security (AISec), 2018.\n\n[36] Dennis Hofheinz and J\u00f6rn M\u00fcller-Quade. Universally composable commitments using random\n\noracles. In TCC 2004, pages 58\u201376, 2004.\n\n[37] Dennis Hofheinz, J\u00f6rn M\u00fcller-Quade, and Dominique Unruh. Universally composable zero-\n\nknowledge arguments and commitments from signature cards. In MoraviaCrypt 2005, 2005.\n\n[38] Yuval Ishai, Eyal Kushilevitz, Sigurd Meldgaard, Claudio Orlandi, and Anat Paskin-Cherniavsky.\nOn the power of correlated randomness in secure computation. In Theory of Cryptography,\npages 600\u2013620. Springer, 2013.\n\n[39] Jonathan Katz. Universally composable multi-party computation using tamper-proof hardware.\n\nIn Eurocrypt 2007, pages 115\u2013128, 2007.\n\n[40] Selim V Kaya, Thomas B Pedersen, Erkay Sava\u00b8s, and Y\u00fccel Sayg\u0131`yn. Ef\ufb01cient privacy preserv-\ning distributed clustering based on secret sharing. In Paci\ufb01c-Asia Conference on Knowledge\nDiscovery and Data Mining, pages 280\u2013291. Springer, 2007.\n\n[41] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving\nmachine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19\u201338. IEEE,\n2017.\n\n[42] Bridianne O\u2019Dea, Stephen Wan, Philip J. Batterham, Alison L. Calear, Cecile Paris, and Helen\n\nChristensen. Detecting suicidality on Twitter. Internet Interventions, 2(2):183\u2013188, 2015.\n\n[43] Chris Peikert, Vinod Vaikuntanathan, and Brent Waters. A framework for ef\ufb01cient and compos-\n\nable oblivious transfer. In Crypto 2008, pages 554\u2013571, 2008.\n\n[44] Wouter Penard and Tim van Werkhoven. On the secure hash algorithm family. In Cryptography\n\nin Context, pages 1\u201318. 2008.\n\n[45] Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based\n\non OT extension. ACM Transactions on Privacy and Security (TOPS), 21(2):7, 2018.\n\n[46] Andrew G. Reece, Andrew J. Reagan, Katharina L.M. Lix, Peter Sheridan Dodds, Christopher M.\nDanforth, and Ellen J. Langer. Forecasting the onset and course of mental illness with Twitter\ndata. Scienti\ufb01c Reports, 7(1):13006, 2017.\n\n[47] M Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M Songhori, Thomas\nSchneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework for\nmachine learning applications. In Proceedings of the 2018 on Asia Conference on Computer\nand Communications Security, pages 707\u2013721. ACM, 2018.\n\n[48] Ronald L. Rivest.\n\nUnconditionally secure commitment and oblivious\n\nschemes using private channels and a trusted initializer.\nhttp://people.csail.mit.edu/rivest/Rivest- commitment.pdf, 1999.\n\ntransfer\nPreprint available at\n\n[49] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A Bayesian approach\nto \ufb01ltering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop,\nvolume 62, pages 98\u2013105, 1998.\n\n[50] Rafael Tonicelli, Anderson C. A. Nascimento, Rafael Dowsley, J\u00f6rn M\u00fcller-Quade, Hideki Imai,\nGoichiro Hanaoka, and Akira Otsuka. Information-theoretically secure oblivious polynomial\nevaluation in the commodity-based model.\nInternational Journal of Information Security,\n14(1):73\u201384, 2015.\n\n[51] Cynthia Van Hee, Gilles Jacobs, Chris Emmery, Bart Desmet, Els Lefever, Ben Verhoeven, Guy\nDe Pauw, Walter Daelemans, and V\u00e9ronique Hoste. Automatic detection of cyberbullying in\nsocial media text. PloS one, 13(10):e0203794, 2018.\n\n12\n\n\f[52] Thijs Veugen, Frank Blom, Sebastiaan JA de Hoogh, and Zekeriya Erkin. Secure comparison\nprotocols in the semi-honest model. IEEE Journal of Selected Topics in Signal Processing,\n9(7):1217\u20131228, 2015.\n\n[53] Benjamin Weggenmann and Florian Kerschbaum. SynTF: Synthetic and differentially private\nterm frequency vectors for privacy-preserving text mining. In 41st International ACM SIGIR\nConference on Research & Development in Information Retrieval, pages 305\u2013314, 2018.\n\n13\n\n\f", "award": [], "sourceid": 2047, "authors": [{"given_name": "Devin", "family_name": "Reich", "institution": "University of Washington Tacoma"}, {"given_name": "Ariel", "family_name": "Todoki", "institution": "University of Washington Tacoma"}, {"given_name": "Rafael", "family_name": "Dowsley", "institution": "Bar-Ilan University"}, {"given_name": "Martine", "family_name": "De Cock", "institution": "University of Washington Tacoma"}, {"given_name": "anderson", "family_name": "nascimento", "institution": "UW"}]}