{"title": "Clustering Billions of Reads for DNA Data Storage", "book": "Advances in Neural Information Processing Systems", "page_first": 3360, "page_last": 3371, "abstract": "Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.", "full_text": "Clustering Billions of Reads for DNA Data Storage\n\nCyrus Rashtchiana,b Konstantin Makarycheva,c Mikl\u00f3s R\u00e1cza,d\n\nSiena Dumas Anga\n\nDjordje Jevdjica\n\nSergey Yekhanina Luis Cezea,b Karin Straussa\n\naMicrosoft Research, bCSE at University of Washington,\n\ncEECS at Northwestern University, dORFE at Princeton University\n\nAbstract\n\nStoring data in synthetic DNA offers the possibility of improving information\ndensity and durability by several orders of magnitude compared to current storage\ntechnologies. However, DNA data storage requires a computationally intensive\nprocess to retrieve the data. In particular, a crucial step in the data retrieval pipeline\ninvolves clustering billions of strings with respect to edit distance. Datasets in\nthis domain have many notable properties, such as containing a very large number\nof small clusters that are well-separated in the edit distance metric space. In this\nregime, existing algorithms are unsuitable because of either their long running time\nor low accuracy. To address this issue, we present a novel distributed algorithm\nfor approximately computing the underlying clusters. Our algorithm converges\nef\ufb01ciently on any dataset that satis\ufb01es certain separability properties, such as\nthose coming from DNA data storage systems. We also prove that, under these\nassumptions, our algorithm is robust to outliers and high levels of noise. We\nprovide empirical justi\ufb01cation of the accuracy, scalability, and convergence of our\nalgorithm on real and synthetic data. Compared to the state-of-the-art algorithm for\nclustering DNA sequences, our algorithm simultaneously achieves higher accuracy\nand a 1000x speedup on three real datasets.\n\n1\n\nIntroduction\n\nExisting storage technologies cannot keep up with the modern data explosion. Thus, researchers have\nturned to fundamentally different physical media for alternatives. Synthetic DNA has emerged as a\npromising option, with theoretical information density of multiple orders of magnitude more than\nmagnetic tapes [12, 24, 26, 52]. However, signi\ufb01cant biochemical and computational improvements\nare necessary to scale DNA storage systems to read/write exabytes of data within hours or even days.\nEncoding a \ufb01le in DNA requires several preprocessing steps, such as randomizing it\nusing a pseudo-random sequence, partitioning it into hundred-character substrings,\nadding address and error correction information to these substrings, and \ufb01nally\nencoding everything to the {A, C, G, T} alphabet. The resulting collection of short\nstrings is synthesized into DNA and stored until needed.\nTo retrieve the data, the DNA is accessed using next-generation sequencing, which\nresults in several noisy copies, called reads, of each originally synthesized short\nstring, called a reference. With current technologies, these references and reads\ncontain hundreds of characters, and in the near future, they will likely contain\nthousands [52]. After sequencing, the goal is to recover the unknown references\nfrom the observed reads. The \ufb01rst step, which is the focus of this paper, is to cluster\nthe reads into groups, each of which is the set of noisy copies of a single reference.\nThe output of clustering is fed into a consensus-\ufb01nding algorithm, which predicts\nthe most likely reference to have produced each cluster of reads. As Figure 1 shows,\n\nFigure 1: DNA\nstorage datasets\nhave many small\nclusters that are\nwell-separated\nin edit distance.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdatasets typically contain only a handful of reads for each reference, and each of these reads differs\nfrom the reference by insertions, deletions, and/or substitutions. The challenge of clustering is to\nachieve high precision and recall of many small underlying clusters, in the presence of such errors.\nDatasets arising from DNA storage have two striking properties. First, the number of clusters grows\nlinearly with the input size. Each cluster typically consists of \ufb01ve to \ufb01fteen noisy copies of the same\nreference. Second, the clusters are separated in edit distance, by design (via randomization). We\ninvestigate approximate clustering algorithms for large collections of reads with these properties.\nSuitable algorithms must satisfy several criteria. First, they must be distributed, to handle the billions\nof reads coming from modern sequencing machines. Second, their running time must scale favorably\nwith the number of clusters. In DNA storage datasets, the size of the clusters is \ufb01xed and determined\nby the number of reads needed to recover the data. Thus, the number of clusters k grows linearly with\nthe input size n (i.e., k = \u2126(n)). Any methods requiring \u2126(k \u00b7 n) = \u2126(n2) time or communication\nwould be too slow for billion-scale datasets. Finally, algorithms must be robust to noise and outliers,\nand they must \ufb01nd clusters with relatively large diameters (e.g., linear in the dimensionality).\nThese criteria rule out many clustering methods. Algorithms for k-medians and related objectives\nare unsuitable because they have running time or communication scaling with k \u00b7 n [19, 29, 33, 42].\nGraph clustering methods, such as correlation clustering [4, 9, 18, 47], require a similarity graph.1\nConstructing this graph is costly, and it is essentially equivalent to our clustering problem, since in\nDNA storage datasets, the similarity graph has connected components that are precisely the clusters\nof noisy reads. Linkage-based methods are inherently sequential, and iteratively merging the closest\npair of clusters takes quadratic time. Agglomerative methods that are robust to outliers do not extend\nto versions that are distributed and ef\ufb01cient in terms of time, space, and communication [2, 8].\nTurning to approximation algorithms, tools such as metric embeddings [43] and locality sensitive\nhashing (LSH) [31] trade a small loss in accuracy for a large reduction in running time. However, such\ntools are not well understood for edit distance [16, 17, 30, 38, 46], even though many methods have\nbeen proposed [15, 27, 39, 48, 54]. In particular, no published system has demonstrated the potential\nto handle billions of reads, and no ef\ufb01cient algorithms have experimental or theoretical results\nsupporting that they would achieve high enough accuracy on DNA storage datasets. This is in stark\ncontrast to set similarity and Hamming distance, which have many positive results [13, 36, 40, 49, 55].\nGiven the challenges associated with existing solutions, we ask two questions: (1) Can we design a\ndistributed algorithm that converges in sub-quadratic time for DNA storage datasets? (2) Is it possible\nto adapt techniques from metric embeddings and LSH to cluster billions of strings in under an hour?\n\nOur Contributions We present a distributed algorithm that clusters billions of reads arising from\nDNA storage systems. Our agglomerative algorithm utilizes a series of \ufb01lters to avoid unnecessary\ndistance computations. At a high level, our algorithm iteratively merges clusters based on random\nrepresentatives. Using a hashing scheme for edit distance, we only compare a small subset of\nrepresentatives. We also use a light-weight check based on a binary embedding to further \ufb01lter pairs.\nIf a pair of representatives passes these two tests, edit distance determines whether the clusters are\nmerged. Theoretically and experimentally, our algorithm satis\ufb01es four desirable properties.\nScalability: Our algorithm scales well in time and space, in shared-memory and shared-nothing\nenvironments. For n input reads, each of P processors needs to hold only O(n/P ) reads in memory.\nAccuracy: We measure accuracy as the fraction of clusters with a majority of found members and no\nfalse positives. Theoretically, we show that the separation of the underlying clusters implies our algo-\nrithm converges quickly to a correct clustering. Experimentally, a small number of communication\nrounds achieve 98% accuracy on multiple real datasets, which suf\ufb01ces to retrieve the stored data.\nRobustness: For separated clusters, our algorithm is optimally robust to adversarial outliers.\nPerformance: Our algorithm outperforms the state-of-the-art clustering method for sequencing data,\nStarcode [57], achieving higher accuracy with a 1000x speedup. Our algorithm quickly recovers\nclusters with large diameter (e.g., 25), whereas known string similarity search methods perform\npoorly with distance threshold larger than four [35, 53]. Our algorithm is simple to implement in any\ndistributed framework, and it clusters 5B reads with 99% accuracy in 46 minutes on 24 processors.\n\n1The similarity graph connects all pairs of elements with distance below a given threshold.\n\n2\n\n\f1.1 Outline\n\nThe rest of the paper is organized as follows. We begin, in Section 2, by de\ufb01ning the problem\nstatement, including clustering accuracy and our data model. Then, in Section 3, we describe\nour algorithm, hash function, and binary signatures. In Section 4, we provide an overview of the\ntheoretical analysis, with most details in the appendix. In Section 5, we empirically evaluate our\nalgorithm. We discuss related work in Section 6 and conclude in Section 7.\n\n2 DNA Data Storage Model and Problem Statement\nFor an alphabet \u03a3, the edit distance between two strings x, y \u2208 \u03a3\u2217 is denoted dE(x, y) and equals the\nminimum number of insertions, deletions, or substitutions needed to transform x to y. It is well known\nthat dE de\ufb01nes a metric. We \ufb01x \u03a3 = {A, C, G, T}, representing the four DNA nucleotides. We de\ufb01ne\nthe distance between two nonempty sets C1, C2 \u2286 \u03a3\u2217 as dE(C1, C2) = minx\u2208C1,y\u2208C2 dE(x, y). A\nclustering C of a \ufb01nite set S \u2286 \u03a3\u2217 is any partition of S into nonempty subsets.\nDe\ufb01nition 2.1 (Accuracy). Let C,(cid:101)C be clusterings. For 1/2 < \u03b3 (cid:54) 1 the accuracy of (cid:101)C with\nWe work with the following de\ufb01nition of accuracy, motivated by DNA storage data retrieval.\n\nrespect to C is\n\n|C|(cid:88)\n\ni=1\n\n1\n|C|\n\nA\u03b3(C,(cid:101)C) = max\n\n1{(cid:101)C\u03c0(i) \u2286 Ci and |(cid:101)C\u03c0(i) \u2229 Ci| (cid:62) \u03b3|Ci|},\n\n\u03c0\n\nwhere the max is over all injective maps \u03c0 : {1, 2, . . . ,|(cid:101)C|} \u2192 {1, 2, . . . , max(|C|,|(cid:101)C|)}.\nWe think of C as the underlying clustering and (cid:101)C as the output of an algorithm. The accuracy\nA\u03b3(C,(cid:101)C) measures the number of clusters in (cid:101)C that overlap with some cluster in C in at least a\n\n\u03b3-fraction of elements while containing no false positives.2 This is a stricter notion than the standard\nclassi\ufb01cation error [8, 44]. Notice that our accuracy de\ufb01nition does not require that the clusterings be\nof the same set. We will use this to compare clusterings of S and S \u222a O for a set of outliers O \u2286 \u03a3\u2217.\nFor DNA storage datasets, the underlying clusters have a natural interpretation. During data retrieval,\nseveral molecular copies of each original DNA strand (reference) are sent to a DNA sequencer.\nThe output of sequencing is a small number of noisy reads of each reference. Thus, the reads\nthat correspond to the same reference form a cluster. This interpretation justi\ufb01es the need for high\naccuracy: each underlying cluster represents one stored unit of information.\n\nData Model To aid in the design and analysis of clustering algorithms for DNA data storage, we\nintroduce the following natural generative model. First, pick many random centers (representing\noriginal references), then perturb each center by insertions, deletions, and substitutions to acquire the\nelements of the cluster (representing the noisy reads). We model the original references as random\nstrings because during the encoding process, the original \ufb01le has been randomized using a \ufb01xed\npseudo-random sequence [45]. We make this model precise, starting with the perturbation.\nDe\ufb01nition 2.2 (p-noisy copy). For p \u2208 [0, 1] and z \u2208 \u03a3\u2217, de\ufb01ne a p-noisy copy of z by the following\nprocess. For each character in z, independently, do one of the following four operations: (i) keep the\ncharacter unchanged with probability (1 \u2212 p), (ii) delete it with probability p/3, (iii) with probability\np/3, replace it with a character chosen uniformly at random from \u03a3, or (iv) with probability p/3,\nkeep the character and insert an additional one after it, chosen uniformly at random from \u03a3.\n\nWe remark that our model and analysis can be generalized to incorporate separate deletion, insertion,\nand substitution probabilities p = pD + pI + pS, but we use balanced probabilities p/3 to simplify\nthe exposition. Now, we de\ufb01ne a noisy cluster. For simplicity, we assume uniform cluster sizes.\nDe\ufb01nition 2.3 (Noisy cluster of size s). We de\ufb01ne the distribution Ds,p,m with cluster size s, noise\nrate p \u2208 [0, 1], and dimension m. Sample a cluster C \u223c Ds,p,m as follows: pick a center z \u2208 \u03a3m\nuniformly at random; then, each of the s elements of C will be an independent p-noisy copy of z.\n2The requirement \u03b3 \u2208 (1/2, 1] implies A\u03b3(C,(cid:101)C) \u2208 [0, 1].\n\nWith our de\ufb01nition of accuracy and our data model in hand, we de\ufb01ne the main clustering problem.\n\n3\n\n\fProblem Statement Fix p, m, s, n. Let C = {C1, . . . , Ck} be a set of k = n/s independent\nclusters Ci \u223c Ds,p,m. Given an accuracy parameter \u03b3 \u2208 (1/2, 1] and an error tolerance \u03b5 \u2208 [0, 1], on\ninput set S = \u222ak\n\ni=1Ci, the goal is to quickly \ufb01nd a clustering (cid:101)C of S with A\u03b3(C,(cid:101)C) (cid:62) 1 \u2212 \u03b5.\n\n3 Approximately Clustering DNA Storage Datasets\n\nOur distributed clustering method iteratively merges clusters with similar representatives, alternating\nbetween local clustering and global reshuf\ufb02ing. At the core of our algorithm is a hash family that\ndetermines (i) which pairs of representatives to compare, and (ii) how to repartition the data among\nthe processors. On top of this simple framework, we use a cheap pre-check, based on the Hamming\ndistance between binary signatures, to avoid many edit distance comparisons. Our algorithm achieves\nhigh accuracy by leveraging the fact that DNA storage datasets contain clusters that are well-separated\nin edit distance. In this section, we will de\ufb01ne separated clusterings, explain the hash function and\nthe binary signature, and describe the overall algorithm.\n\n3.1 Separated Clusters\nThe most important consequence of our data model Ds,p,m is that the clusters will be well-separated\nin the edit distance metric space. Moreover, this re\ufb02ects the actual separation of clusters in real\ndatasets. To make this precise, we introduce the following de\ufb01nition.\nDe\ufb01nition 3.1. A clustering {C1, . . . , Ck} is (r1, r2)-separated if Ci has diameter3 at most r1 for\nevery i \u2208 {1, 2, . . . , k}, while any two different clusters Ci and Cj satisfy dE(Ci, Cj) > r2.\nDNA storage datasets will be separated with r2 (cid:29) r1. Thus, recovering the clusters corresponds to\n\ufb01nding pairs of strings with distance at most r1. Whenever r2 (cid:62) 2\u00b7 r1, our algorithm will be robust to\noutliers. In Section 4, we provide more details about separability under our DNA storage data model.\nWe remark that our clustering separability de\ufb01nition differs slightly from known notions [2, 3, 8] in\nthat we explicitly bound both the diameter of clusters and distance between clusters.\n\n3.2 Hashing for Edit Distance\nAlgorithms for string similarity search revolve around the simple fact that when two strings x, y \u2208 \u03a3m\nhave edit distance at most r, then they share a substring of length at least m/(r + 1). However,\ninsertions and deletions imply that the matching substrings may appear in different locations. Exact\nalgorithms build inverted indices to \ufb01nd matching substrings, and many optimizations have been\nproposed to exactly \ufb01nd all close pairs [34, 51, 57]. Since we need only an approximate solution,\nwe design a hash family based on \ufb01nding matching substrings quickly, without being exhaustive.\nInformally, for parameters w, (cid:96), our hash picks a random \u201canchor\u201d a of length w, and the hash value\nfor x is the substring of length w + (cid:96) starting at the \ufb01rst occurrence of a in x.\nWe formally de\ufb01ne the family of hash functions Hw,(cid:96) = {h\u03c0,(cid:96) : \u03a3\u2217 \u2192 \u03a3w+(cid:96)} parametrized by w, (cid:96),\nwhere \u03c0 is a permutation of \u03a3w. For x = x1x2 \u00b7\u00b7\u00b7 xm, the value of h\u03c0,(cid:96)(x) is de\ufb01ned as follows.\nFind the earliest, with respect to \u03c0, occurring w-gram a in x, and let i be the index of the \ufb01rst\noccurrence of a in x. Then, h\u03c0,(cid:96)(x) = xi \u00b7\u00b7\u00b7 xm(cid:48) where m(cid:48) = min(m, i + w + (cid:96)). To sample h\u03c0,(cid:96)\nfrom Hw,(cid:96), simply pick a uniformly random permutation \u03c0 : \u03a3w \u2192 \u03a3w.\nNote that Hw,(cid:96) resembles MinHash [13, 14] with the natural mapping from strings to sets of substrings\nof length w + (cid:96). Our hash family has the bene\ufb01t of \ufb01nding long substrings (such as w + (cid:96) = 16),\nwhile only having the overhead of \ufb01nding anchors of length w. This reduces computation time, while\nstill leading to effective hashes. We now describe the signatures.\n\n3.3 Binary Signature Distance\n\nThe q-gram distance is an approximation for edit distance [50]. By now, it is a standard tool in\nbioinformatics and string similarity search [27, 28, 48, 54]. A q-gram is simply a substring of length q,\nand the q-gram distance measures the number of different q-grams between two strings. For a string\n\n3A cluster C has diameter at most r if dE(x, y) (cid:54) r for all pairs x, y \u2208 C.\n\n4\n\n\f(cid:101)C = S.\n\nFor i = 1, 2, . . . , comm_steps:\n\nSample h\u03c0,(cid:96) \u223c Hw,(cid:96).\n\nAlgorithm 1 Clustering DNA Strands\n1: function CLUSTER(S, r, q, w, (cid:96), \u03b8low, \u03b8high, comm_steps, local_steps)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end function\n\nUpdate (cid:101)C = ((cid:101)C \\ {Cx, Cy}) \u222a {Cx \u222a Cy}.\n\nSample h\u03c0,(cid:96) \u223c Hw,(cid:96) and hash-partition clusters, applying h\u03c0,(cid:96) to representatives.\nFor j = 1, 2, . . . , local_steps:\n\nFor C \u2208 (cid:101)C, sample a representative xC \u223c C, and then compute the hash h\u03c0,(cid:96)(xC).\n\nIf (dH (\u03c3(x), \u03c3(y)) (cid:54) \u03b8low) or (dH (\u03c3(x), \u03c3(y)) (cid:54) \u03b8high and dE(x, y) (cid:54) r):\n\nFor each pair x, y with h\u03c0,(cid:96)(x) = h\u03c0,(cid:96)(y):\n\nreturn (cid:101)C.\n\nx \u2208 \u03a3m, let the binary signature \u03c3q(x) \u2208 {0, 1}4q be the indicator vector for the set q-grams in x.\nThen, the q-gram distance between x and y equals the Hamming distance dH (\u03c3q(x), \u03c3q(y)).\nThe utility of the q-gram distance is that the Hamming distance dH (\u03c3q(x), \u03c3q(y)) approximates\nthe edit distance dE(x, y), yet it is much faster to check dH (\u03c3q(x), \u03c3q(y)) (cid:54) \u03b8 than to verify\ndE(x, y) (cid:54) r. The only drawback of the q-gram distance is that it may not faithfully preserve\nthe separation of clusters, in the worst case. This implies that the q-gram distance by itself is not\nsuf\ufb01cient for clustering. Therefore, we use binary signatures as a coarse \ufb01ltering step, but reserve edit\ndistance for ambiguous merging decisions. We provide theoretical bounds on the q-gram distance in\nSection 4.1 and Appendix B. We now explain our algorithm.\n\n3.4 Algorithm Description\n\nglobal clustering. The algorithm alternates between global communication and local computation.\n\nCommunication One representative xC is sampled uniformly from each cluster Cx in the current\nsampled from Hw,(cid:96). Using this same hash function for each core, a hash value is computed for\n\nWe describe our distributed, agglomerative clustering algorithm (displayed in Algorithm 1). The\nalgorithm ingests the input set S \u2282 \u03a3\u2217 in parallel, so each core begins with roughly the same\nis initialized as singletons. It will be convenient to use the notation xC for an element x \u2208 C, and\n\nnumber of reads. Signatures \u03c3q(x) are pre-computed and stored for each x \u2208 S. The clustering (cid:101)C\nthe notation Cx for the cluster that x belongs to. We abuse notation and use (cid:101)C to denote the current\nclustering (cid:101)C, in parallel. Then, using shared randomness among all cores, a hash function h\u03c0,(cid:96) is\neach representative xC for cluster C in the current clustering (cid:101)C. The communication round ends\ndetermines which core receives C. The current clustering (cid:101)C is thus repartitioned among cores.\nrevolves around one hash function h\u03c0,(cid:96) \u223c Hw,(cid:96). Let (cid:101)Cj be the set of clusters that have been\nsampled for each cluster C \u2208 (cid:101)Cj. The representatives are bucketed based on h\u03c0,(cid:96)(xc). Now, the\n\nLocal Computation The local computation proceeds independently on each core. One local round\n\ndistributed to the jth core. During each local clustering step, one uniform representative xC is\n\nby redistributing the clusters randomly using these hash values. In particular, the value h\u03c0,(cid:96)(xc)\n\nlocal clustering requires three parameters, r, \u03b8low, \u03b8high, set ahead of time, and known to all the cores.\nFor each pair y, z in a bucket, \ufb01rst the algorithm checks whether dH (\u03c3q(y), \u03c3q(z)) (cid:54) \u03b8low. If so, the\nclusters Cy and Cz are merged. Otherwise, the algorithm checks if both dH (\u03c3q(y), \u03c3q(z)) (cid:54) \u03b8high\nand dE(x, y) (cid:54) r, and merges the clusters Cy and Cz if these two conditions hold. Immediately after\n\na merge,(cid:101)Cj is updated, and Cx corresponds to the present cluster containing x. Note that distributing\nrounds, the algorithm outputs the current clustering (cid:101)C =(cid:83)\n\nthe clusters among cores during communication implies that no coordination is needed after merges.\nThe local clustering repeats for local_steps rounds before moving to the next communication round.\n\nTermination After the local computation \ufb01nishes, after the last of comm_steps communication\n\nj (cid:101)Cj and terminates.\n\n5\n\n\f4 Theoretical Algorithm Analysis\n\n4.1 Cluster Separation and Binary Signatures\n\nWhen storing data in DNA, the encoding process leads to clusters with nearly-random centers. Recall\nthat we need the clusters to be far apart for our algorithm to perform well. Fortunately, random cluster\ncenters will have edit distance \u2126(m) with high-probability. Indeed, two independent random strings\nhave expected edit distance cind \u00b7 m, for a constant cind > 0. Surprisingly, the exact value of cind\nremains unknown. Simulations suggest that cind \u2248 0.51, and it is known that cind > 0.338 [25].\nWhen recovering the data, DNA storage systems receive clusters that consist of p-noisy copies of\nthe centers. In particular, two reads inside of a cluster will have edit distance O(pm), since they are\np-noisy copies of the same center. Therefore, any two reads in different clusters will be far apart in\nedit distance whenever p (cid:28) cind is a small enough constant. We formalize these bounds and provide\nmore details, such as high-probability results, in Appendix A.\nAnother feature of our algorithm is the use of binary signatures. To avoid incorrectly merging distinct\nclusters, we need the clusters to be separated according to q-gram distance. We show that random\ncluster centers will have q-gram distance \u2126(m) when q = 2 log4 m. Additionally, for any two reads\nx, y, we show that dH (\u03c3q(x), \u03c3q(y)) (cid:54) 2q \u00b7 dE(x, y), implying that if x and y are in the same cluster,\nthen their q-gram distance will be at most O(qpm). Therefore, whenever p (cid:28) 1/q \u2248 1/ log m,\nsignatures will already separate clusters. For larger p, we use the pair of thresholds \u03b8low < \u03b8high to\nmitigate false merges. We provide more details in Appendix B.\nIn Section 5, we mention an optimization for the binary signatures, based on blocking, which\nempirically improves the approximation quality, while reducing memory and computational overhead.\n\n4.2 Convergence and Hash Analysis\n\nThe running time of our algorithm depends primarily on the number of iterations and the total number\nof comparisons performed. The two types of comparisons are edit distance computations, which take\ntime O(rm) to check distance at most r, and q-gram distance computations, which take time linear\nin the signature length. To avoid unnecessary comparisons, we partition cluster representatives using\nour hash function and only compare reads with the same hash value. Therefore, we bound the total\nnumber of comparisons by bounding the total number of hash collisions. In particular, we prove the\nfollowing convergence theorem (details appear in Appendix C.\nTheorem 4.1 (Informal). For suf\ufb01ciently large n and m and small p, there exist parameters for\nour algorithm such that it outputs a clustering with accuracy (1 \u2212 \u03b5) and the expected number of\ncomparisons is\n\n(cid:18)\n\n(cid:26)\n\nO\n\nmax\n\nn1+O(p),\n\nn2\n\nm\u2126(1/p)\n\n1 +\n\nlog(s/\u03b5)\n\ns\n\n(cid:27)\n\n(cid:18)\n\n\u00b7\n\n(cid:19)(cid:19)\n\n.\n\nNote that n1+O(p) (cid:62) n2/m\u2126(1/p) in the expression above whenever the reads are long enough, that\nis, when m (cid:62) ncp (where c is some small constant). Thus, for a large range of n, m, p, and \u03b5, our\nalgorithm converges in time proportional to n1+O(p), which is sub-quadratic in n, the number of\ninput reads. Since we expect the number of clusters k to be k = \u2126(n), our algorithm outperforms\nany methods that require time \u2126(kn) = \u2126(n2) in this regime.\nThe running time analysis of our algorithm revolves around estimating both the collision probability\nof our hash function and the overall convergence time to identify the underlying clusters. The main\noverhead comes from unnecessarily comparing reads that belong to different clusters. Indeed, for pairs\nof reads inside the same cluster, the total number of comparisons is O(n), since after a comparison,\nthe reads will merge into the same cluster. For reads in different clusters, we show that they collide\nwith probability that is exponentially small in the hash length (since they are nearly-random strings).\nFor the convergence analysis, we prove that reads in the same cluster will collide with signi\ufb01cant\nprobability, implying that after roughly\n\n(cid:18)\n\n(cid:110)\n\n(cid:18)\n\n(cid:111) \u00b7\n\n(cid:19)(cid:19)\n\nO\n\nmax\n\nnO(p),\n\nn\n\nm\u2126(1/p)\n\n1 +\n\nlog(s/\u03b5)\n\ns\n\niterations, the found clustering will be (1 \u2212 \u03b5) accurate.\n\n6\n\n\fIn Section 5, we experimentally validate our algorithm\u2019s running time, convergence, and correctness\nproperties on real and synthetic data.\n\n4.3 Outlier Robustness\n\nsize \u03b5(cid:48)k. Fixing the randomness and parameters in the algorithm with distance threshold r, let (cid:101)C be\n\nOur \ufb01nal theoretical result involves bounding the number of incorrect merges caused by potential\noutliers in the dataset. In real datasets, we expect some number of highly-noisy reads, due to\nexperimental error. Fortunately, such outliers lead to only a minor loss in accuracy for our algorithm,\nwhen the clusters are separated. We prove the following theorem in Appendix D.\nTheorem 4.2. Let C = {C1, . . . , Ck} be an (r, 2r)-separated clustering of S. Let O be any set of\nthe output on S and \u02dcC(cid:48) be the output on S \u222a O. Then, A\u03b3(C, \u02dcC(cid:48)) (cid:62) A\u03b3(C, \u02dcC) \u2212 \u03b5(cid:48).\nNotice that this is optimal since \u03b5(cid:48)k outliers can clearly modify \u03b5(cid:48)k clusters. For DNA storage data\nrecovery, if we desire 1 \u2212 \u03b5 accuracy overall, and we expect at most \u03b5(cid:48)k outliers, then we simply need\nto aim for a clustering with accuracy at least 1 \u2212 \u03b5 + \u03b5(cid:48).\n\n5 Experiments\n\nWe experimentally evaluate our algorithm on real and synthetic data, measuring accuracy and wall\nclock time. Table 1 describes our datasets. We evaluate accuracy on the real data by comparing\nthe found clusterings to a gold standard clustering. We construct the gold standard by using the\noriginal reference strands, and we group the reads by their most likely reference using an established\nalignment tool (see Appendix E for full details). The synthetically generated data resembles real data\ndistributions and properties [45]. We implement our algorithm in C++ using MPI. We run tests on\nMicrosoft Azure virtual machines (size H16mr: 16 cores, 224 GB RAM, RDMA network).\n\n# Reads Avg. Length Description\n\nTable 1: Datasets. Real data from Organick et. al. [45]. Synthetic data from Defn. 2.3. Appendix E has details.\nDataset\n3.1M real\n13.2M real\n58M real\n12M real\n5.3B synthetic\n\nMovie \ufb01le stored in DNA\nMusic \ufb01le stored in DNA\nCollection of \ufb01les (40MB stored in DNA; includes above)\nText \ufb01le stored in DNA\nNoise p = 4%; cluster size s = 10.\n\n3,103,511\n13,256,431\n58,292,299\n11,973,538\n5,368,709,120\n\n150\n150\n150\n110\n110\n\n5.1\n\nImplementation and Parameter Details\n\nFor the edit distance threshold, we desire r to be just larger than the cluster diameter. With p noise,\nwe expect the diameter to be at most 4pm with high probability. We conservatively estimate p \u2248 4%\nfor real data, and thus we set r = 25, since 4pm = 24 for p = 0.04 and m = 150.\nFor the binary signatures, we observe that choosing larger q separates clusters better, but it also\nincreases overhead, since \u03c3q(x) \u2208 {0, 1}4q is very high-dimensional. To remedy this, we used a\nblocking approach. We partitioned x into blocks of 22 characters and computed \u03c33 of each block,\nconcatenating these 64-bit strings for the \ufb01nal signature. On synthetic data, we found that setting\n\u03b8low = 40 and \u03b8high = 60 leads to very reduced running time while sacri\ufb01cing negligible accuracy.\nFor the hashing, we set w, (cid:96) to encourage collisions of close pairs and discourage collisions of far\npairs. Following Theorem C.1, we set w = (cid:100)log4(m)(cid:101) = 4 and (cid:96) = 12, so that w + (cid:96) = 16 = log4 n\nwith n = 232. Since our clusters are very small, we \ufb01nd that we can further \ufb01lter far pairs by\nconcatenating two independent hashes to de\ufb01ne a bucket based on this 64-bit value. Moreover, since\nwe expect very few reads to have the same hash, instead of comparing all pairs in a hash bucket, we\nsort the reads based on hash value and only compare adjacent elements. For communication, we use\nonly the \ufb01rst 20 bits of the hash value, and we uniformly distribute clusters based on this.\nFinally, we conservatively set the number of iterations to 780 total (26 communication rounds, each\nwith 30 local iterations) because this led to 99.9% accuracy on synthetic data (even with \u03b3 = 1.0).\n\n7\n\n\f(a) Time Comparison (log scale)\n\n(b) Accuracy Comparison\n\nFigure 2: Comparison to Starcode. Figure 2a plots running times on three real datasets of our algorithm versus\nfour Starcode executions using four distance thresholds d \u2208 {2, 4, 6, 8}. For the \ufb01rst dataset, with 3.1M real\nreads, Figure 2b plots A\u03b3 for varying \u03b3 \u2208 {0.6, 0.7, 0.8, 0.9, 1.0} of our algorithm versus Starcode. We stopped\nStarcode if it did not \ufb01nish within 28 hours. We ran tests on one processor, 16 threads.\n\n(a) Distributed Convergence\n\n(b) Binary Signature Improvement\n\n(c) Strong Scaling\n\nFigure 3: Empirical results for our algorithm. Figure 3a plots accuracy A0.9 of intermediate clusterings (5.3B\nsynthetic reads, 24 processors). Figure 3b shows single-threaded running times for four variants of our algorithm,\ndepending on whether it uses signatures for merging and/or \ufb01ltering (3.1M real reads; single thread). Figure 3c\nplots times as the number of processors varies from 1 to 8, with 16 cores per processor (58M real reads).\n\nStarcode Parameters Starcode [57] takes a distance threshold d \u2208 {1, 2, . . . , 8} as an input\nparameter and \ufb01nds all clusters with radius not exceeding this threshold. We run Starcode for various\nsettings of d, with the intention of understanding how Starcode\u2019s accuracy and running time change\nwith this parameter. We use Starcode\u2019s sphere clustering \u201c-s\u201d option, since this has performed most\naccurately on sample data, and we use the \u201c-t\u201d parameter to run Starcode with 16 threads.\n\n5.2 Discussion\n\nFigure 2 shows that our algorithm outperforms Starcode, the state-of-the-art clustering algorithm for\nDNA sequences [57], in both accuracy and time. As explained above, we have set our algorithm\u2019s\nparameters based on theoretical estimates. On the other hand, we vary Starcode\u2019s distance threshold\nparameter d \u2208 {2, 4, 6, 8}. We demonstrate in Figures 2a and 2b that increasing this distance\nparameter signi\ufb01cantly improves accuracy on real data, but also it also greatly increases Starcode\u2019s\nrunning time. Both algorithms achieve high accuracy for \u03b3 = 0.6, and the gap between the algorithms\nwidens as \u03b3 increases. In Figure 2a, we show that our algorithm achieves more than a 1000x speedup\nover the most accurate setting of Starcode on three real datasets of varying sizes and read lengths.\nFor d \u2208 {2, 4, 6}, our algorithm has a smaller speedup and a larger improvement in accuracy.\nFigure 3a shows how our algorithm\u2019s clustering accuracy increases with the number of communication\nrounds, where we evaluate A\u03b3 with \u03b3 = 0.9. Clearly, using 26 rounds is quite conservative.\nNonetheless, our algorithm took only 46 minutes wall clock time to cluster 5.3B synthetic reads on 24\nprocessors (384 cores). We remark that distributed MapReduce-based algorithms for string similarity\njoins have been reported to need tens of minutes for only tens of millions of reads [21, 51].\n\n8\n\n\fFigure 3b demonstrates the effect of binary signatures on runtime. Recall that our algorithm uses\nsignatures in two places: merging clusters when dH (\u03c3(x), \u03c3(y)) (cid:54) \u03b8low, and \ufb01ltering pairs when\ndH (\u03c3(x), \u03c3(y)) > \u03b8high. This leads to four natural variants: (i) omitting signatures, (ii) using them\nfor merging, (iii) using them for \ufb01ltering, or (iv) both. The biggest improvement (20x speedup) comes\nfrom using signatures for \ufb01ltering (comparing (i) vs. (iii)). This occurs because the cheap Hamming\ndistance \ufb01lter avoids a large number of expensive edit distance computations. Using signatures for\nmerging provides a modest 30% improvement (comparing (iii) vs. (iv)); this gain does not appear\nbetween (i) and (ii) because of time it takes to compute the signatures. Overall, the effectiveness of\nsignatures justi\ufb01es their incorporation into an algorithm that already \ufb01lters based on hashing.\nFigure 3c evaluates the scalability of our algorithm on 58M real reads as the number of processors\nvaries from 1 to 8. At \ufb01rst, more processors lead to almost optimal speedups. Then, the communi-\ncation overhead outweighs the parallelization gain. Achieving perfect scalability requires greater\nunderstanding and control of the underlying hardware and is left as future work.\n\n6 Related Work\n\nRecent work identi\ufb01es the dif\ufb01culty of clustering datasets containing large numbers of small clusters.\nBetancourt et. al. [11] calls this \u201cmicroclustering\u201d and proposes a Bayesian non-parametric model for\nentity resolution datasets. Kobren et. al. [37] calls this \u201cextreme clustering\u201d and studies hierarchical\nclustering methods. DNA data storage provides a new domain for micro/extreme clustering, with\ninteresting datasets and important consequences [12, 24, 26, 45, 52].\nLarge-scale, extreme datasets \u2013 with billions of elements and hundreds of millions of clusters \u2013 are\nan obstacle for many clustering techniques [19, 29, 33, 42]. We demonstrate that DNA datasets are\nwell-separated, which implies that our algorithm converges quickly to a highly-accurate solution. It\nwould be interesting to determine the minimum requirements for robustness in extreme clustering.\nOne challenge of clustering for DNA storage comes from the fact that reads are strings with edit errors\nand a four-character alphabet. Edit distance is regarded as a dif\ufb01cult metric, with known lower bounds\nin various models [1, 5, 7]. Similarity search algorithms based on MinHash [13, 14] originally aimed\nto \ufb01nd duplicate webpages or search results, which have much larger natural language alphabets.\nHowever, known MinHash optimizations [40, 41] may improve our clustering algorithm.\nChakraborty, Goldenberg, and Kouck\u00fd explore the question of preserving small edit distances with\na binary embedding [16]. This embedding was adapted by Zhang and Zhang [56] for approximate\nstring similarity joins. We leave a thorough comparison to these papers as future work, along with\nobtaining better theoretical bounds for hashing or embeddings [17, 46] under our data distribution.\n\n7 Conclusion\n\nWe highlighted a clustering task motivated by DNA data storage. We proposed a new distributed\nalgorithm and hashing scheme for edit distance. Experimentally and theoretically, we demonstrated\nour algorithm\u2019s effectiveness in terms of accuracy, performance, scalability, and robustness.\nWe plan to release one of our real datasets. We hope our dataset and data model will lead to further\nresearch on clustering and similarity search for computational biology or other domains with strings.\nFor future work, our techniques may also apply to other metrics and to other applications with large\nnumbers of small, well-separated clusters, such as entity resolution or deduplication [20, 23, 32].\nFinally, our work motivates a variety of new theoretical questions, such as studying the distortion of\nembeddings for random strings under our generative model (we elaborate on this in Appendix B ).\n\n8 Acknowledgments\n\nWe thank Yair Bartal, Phil Bernstein, Nova Fandina, Abe Friesen, Sariel Har-Peled, Christian Konig,\nParis Koutris, Marina Meila, Mark Yatskar for useful discussions. We also thank Alyshia Olsen for\nhelp designing the graphs. Finally, we thank Jacob Nelson for sharing his MPI wisdom and Taylor\nNewill and Christian Smith from the Microsoft Azure HPC Team for help using MPI on Azure.\n\n9\n\n\fReferences\n[1] A. Abboud, T. D. Hansen, V. V. Williams, and R. Williams. Simulating Branching Programs with Edit\n\nDistance and Friends: Or: A polylog shaved is a lower bound made. In STOC, 2016.\n\n[2] M. Ackerman, S. Ben-David, D. Loker, and S. Sabato. Clustering Oligarchies. In AISTATS, 2013.\n\n[3] M. Ackerman and S. Dasgupta. Incremental Clustering: The Case for Extra Clusters. In Advances in\n\nNeural Information Processing Systems, pages 307\u2013315, 2014.\n\n[4] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering.\n\nJournal of the ACM (JACM), 55(5):23, 2008.\n\n[5] A. Andoni and R. Krauthgamer. The Computational Hardness of Estimating Edit Distance. SIAM J.\n\nComput., 39(6).\n\n[6] A. Andoni and R. Krauthgamer. The Smoothed Complexity of Edit Distance. ACM Transactions on\n\nAlgorithms (TALG), 8(4):44, 2012.\n\n[7] A. Backurs and P. Indyk. Edit Distance Cannot be Computed in Strongly Subquadratic time (unless SETH\n\nis false). In STOC, 2015.\n\n[8] M.-F. Balcan, Y. Liang, and P. Gupta. Robust Hierarchical Clustering. Journal of Machine Learning\n\nResearch, 15(1):3831\u20133871, 2014.\n\n[9] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89\u2013113, 2004.\n\n[10] T. Batu, F. Erg\u00fcn, J. Kilian, A. Magen, S. Raskhodnikova, R. Rubinfeld, and R. Sami. A Sublinear\n\nAlgorithm for Weakly Approximating Edit Distance. In STOC, 2003.\n\n[11] B. Betancourt, G. Zanella, J. W. Miller, H. Wallach, A. Zaidi, and B. Steorts. Flexible Models for\n\nMicroclustering with Application to Entity Resolution. In NIPS, 2016.\n\n[12] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss. A DNA-based Archival Storage\n\nSystem. In ASPLOS, 2016.\n\n[13] A. Z. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of\n\nSequences, pages 21\u201329. IEEE, 1997.\n\n[14] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic Clustering of the Web. Computer\n\nNetworks and ISDN Systems, 29(8-13):1157\u20131166, 1997.\n\n[15] J. Buhler. Ef\ufb01cient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics,\n\n17(5):419\u2013428, 2001.\n\n[16] D. Chakraborty, E. Goldenberg, and M. Kouck\u00fd. Streaming Algorithms for Embedding and Computing\n\nEdit Distance in the Low Distance Regime. In STOC, 2016.\n\n[17] M. Charikar and R. Krauthgamer. Embedding the Ulam Metric into L1. Theory of Computing, 2(11):207\u2013\n\n224, 2006.\n\n[18] S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near Optimal LP Rounding Algorithm for\n\nCorrelation Clustering on Complete and Complete k-partite Graphs. In STOC, 2015.\n\n[19] J. Chen, H. Sun, D. Woodruff, and Q. Zhang. Communication-Optimal Distributed Clustering. In Advances\n\nin Neural Information Processing Systems, pages 3720\u20133728, 2016.\n\n[20] P. Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate\n\ndetection. Springer Science & Business Media, 2012.\n\n[21] D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A Mapreduce-based Method for Scalable String\n\nSimilarity Joins. In ICDE, pages 340\u2013351. IEEE, 2014.\n\n[22] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms.\n\nCambridge University Press, 2009.\n\n[23] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE\n\nTransactions on knowledge and data engineering, 19(1):1\u201316, 2007.\n\n[24] Y. Erlich and D. Zielinski. DNA Fountain Enables a Robust and Ef\ufb01cient Storage Architecture. Science,\n\n355(6328):950\u2013954, 2017.\n\n10\n\n\f[25] S. Ganguly, E. Mossel, and M. Z. R\u00e1cz. Sequence Assembly from Corrupted Shotgun Reads. In ISIT,\n\npages 265\u2013269, 2016. http://arxiv.org/abs/1601.07086.\n\n[26] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney. Towards Practical,\n\nHigh-capacity, Low-maintenance Information Storage in Synthesized DNA. Nature, 494(7435), 2013.\n\n[27] S. Gollapudi and R. Panigrahy. A Dictionary for Approximate String Search and Longest Pre\ufb01x Search. In\n\nCIKM, 2006.\n\n[28] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, et al. Approximate\n\nString Joins in a Database (almost) for Free. In VLDB, volume 1, pages 491\u2013500, 2001.\n\n[29] S. Guha, Y. Li, and Q. Zhang. Distributed Partial Clustering. arXiv preprint arXiv:1703.01539, 2017.\n\n[30] H. Hanada, M. Kudo, and A. Nakamura. On Practical Accuracy of Edit Distance Approximation Algorithms.\n\narXiv preprint arXiv:1701.06134, 2017.\n\n[31] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of\n\ndimensionality. Theory of Computing, 8(1):321\u2013350, 2012.\n\n[32] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework for Evaluating Clustering Algorithms\n\nin Duplicate Detection. PVLDB, 2(1):1282\u20131293, 2009.\n\n[33] C. Hennig, M. Meila, F. Murtagh, and R. Rocci. Handbook of Cluster Analysis. CRC Press, 2015.\n\n[34] Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Ef\ufb01cient Parallel Partition-based Algorithms for Similarity\n\nSearch and Join with Edit Distance Constraints. In Joint EDBT/ICDT Workshops, 2013.\n\n[35] Y. Jiang, G. Li, J. Feng, and W.-S. Li. String Similarity Joins: An Experimental Evaluation. PVLDB,\n\n7(8):625\u2013636, 2014.\n\n[36] J. Johnson, M. Douze, and H. J\u00e9gou. Billion-scale Similarity Search with GPUs. arXiv preprint\n\narXiv:1702.08734, 2017.\n\n[37] A. Kobren, N. Monath, A. Krishnamurthy, and A. McCallum. A Hierarchical Algorithm for Extreme\n\nClustering. In KDD, 2017.\n\n[38] R. Krauthgamer and Y. Rabani. Improved Lower Bounds for Embeddings Into L1. SIAM J. on Computing,\n\n38(6):2487\u20132498, 2009.\n\n[39] H. Li and R. Durbin. Fast and Accurate Short Read Alignment with Burrows\u2013Wheeler Transform.\n\nBioinformatics, 25(14):1754\u20131760, 2009.\n\n[40] P. Li and C. K\u00f6nig. b-Bit Minwise Hashing. In WWW, pages 671\u2013680. ACM, 2010.\n\n[41] P. Li, A. Owen, and C.-H. Zhang. One Permutation Hashing. In NIPS, 2012.\n\n[42] G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley. Fast Distributed k-center\n\nClustering with Outliers on Massive Data. In NIPS, 2015.\n\n[43] J. Matou\u0161ek. Lectures on Discrete Geometry, volume 212. Springer New York, 2002.\n\n[44] M. Meil\u02d8a and D. Heckerman. An Experimental Comparison of Model-based Clustering Methods. Machine\n\nlearning, 42(1-2):9\u201329, 2001.\n\n[45] L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath,\nP. Gopalan, B. Nguyen, C. Takahashi, S. Newman, H.-Y. Parker, C. Rashtchian, K. Stewart, G. Gupta,\nR. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss. Scaling up DNA data storage and\nrandom access retrieval. bioRxiv, 2017.\n\n[46] R. Ostrovsky and Y. Rabani. Low Distortion Embeddings for Edit Distance. J. ACM, 2007.\n\n[47] X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramchandran, and M. I. Jordan. Parallel Correlation\n\nClustering on Big Graphs. In Advances in Neural Information Processing Systems, pages 82\u201390, 2015.\n\n[48] Z. Rasheed, H. Rangwala, and D. Barbara. Ef\ufb01cient Clustering of Metagenomic Sequences using Locality\nSensitive Hashing. In Proceedings of the 2012 SIAM International Conference on Data Mining, pages\n1023\u20131034. SIAM, 2012.\n\n11\n\n\f[49] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming\nSimilarity Search Over One Billion Tweets Using Parallel Locality-Sensitive Hashing. PVLDB, 6(14):1930\u2013\n1941, 2013.\n\n[50] E. Ukkonen. Approximate String-matching with q-grams and Maximal Matches. Theoretical computer\n\nscience, 92(1):191\u2013211, 1992.\n\n[51] C. Yan, X. Zhao, Q. Zhang, and Y. Huang. Ef\ufb01cient string similarity join in multi-core and distributed\n\nsystems. PloS one, 12(3):e0172526, 2017.\n\n[52] S. H. T. Yazdi, R. Gabrys, and O. Milenkovic. Portable and Error-Free DNA-Based Data Storage. bioRxiv,\n\npage 079442, 2016.\n\n[53] M. Yu, G. Li, D. Deng, and J. Feng. String Similarity Search and Join: A Survey. Frontiers of Computer\n\nScience, 10(3):399\u2013417, 2016.\n\n[54] P. Yuan, C. Sha, and Y. Sun. Hash\u02c6{ed}-Join: Approximate String Similarity Join with Hashing. In\nInternational Conference on Database Systems for Advanced Applications, pages 217\u2013229. Springer, 2014.\n\n[55] R. B. Zadeh and A. Goel. Dimension Independent Similarity Computation. The Journal of Machine\n\nLearning Research, 14(1):1605\u20131626, 2013.\n\n[56] H. Zhang and Q. Zhang. EmbedJoin: Ef\ufb01cient Edit Similarity Joins via Embeddings. In KDD, 2017.\n\n[57] E. V. Zorita, P. Cusc\u00f3, and G. Filion. Starcode: Sequence Clustering Based on All-pairs Search. Bioinfor-\n\nmatics, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1910, "authors": [{"given_name": "Cyrus", "family_name": "Rashtchian", "institution": "University of Washington"}, {"given_name": "Konstantin", "family_name": "Makarychev", "institution": null}, {"given_name": "Miklos", "family_name": "Racz", "institution": "Princeton University"}, {"given_name": "Siena", "family_name": "Ang", "institution": "Microsoft"}, {"given_name": "Djordje", "family_name": "Jevdjic", "institution": "Microsoft Research"}, {"given_name": "Sergey", "family_name": "Yekhanin", "institution": "Microsoft"}, {"given_name": "Luis", "family_name": "Ceze", "institution": "University of Washington"}, {"given_name": "Karin", "family_name": "Strauss", "institution": "Microsoft Research"}]}