{"title": "Approximation algorithms for stochastic clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 6038, "page_last": 6047, "abstract": "We consider stochastic settings for clustering, and develop provably-good (approximation) algorithms for a number of these notions. These algorithms allow one to obtain better approximation ratios compared to the usual deterministic clustering setting. Additionally, they offer a number of advantages including providing fairer clustering and clustering which has better long-term behavior for each user. In particular, they ensure that *every user* is guaranteed to get good service (on average). We also complement some of these with impossibility results.", "full_text": "Approximation algorithms for stochastic clustering\u2217\n\nDavid G. Harris\n\nDepartment of Computer Science\n\nShi Li\n\nUniversity at Buffalo\n\nUniversity of Maryland, College Park, MD 20742\n\nBuffalo, NY. shil@buffalo.edu\n\ndavidgharris29@gmail.com\n\nThomas Pensyl\nBandwidth, Inc.\n\nRaleigh, NC\n\ntpensyl@bandwidth.com\n\nDepartment of Computer Science and Institute for Advanced Computer Studies\n\nUniversity of Maryland, College Park, MD 20742\n\nAravind Srinivasan\n\nsrin@cs.umd.edu\n\nKhoa Trinh\n\nGoogle\n\nMountain View, CA 94043\nkhoatrinh@google.com\n\nAbstract\n\nWe consider stochastic settings for clustering, and develop provably-good\n(approximation) algorithms for a number of these notions. These algorithms\nallow one to obtain better approximation ratios compared to the usual deterministic\nclustering setting. Additionally, they offer a number of advantages including\nproviding fairer clustering and clustering which has better long-term behavior for\neach user. In particular, they ensure that every user is guaranteed to get good\nservice (on average). We also complement some of these with impossibility results.\nKEYWORDS: clustering, k-center, k-median, lottery, approximation algorithms\n\n1\n\nIntroduction\n\nClustering is a fundamental problem in machine learning and data science. A general clustering\ntask is to partition the given data points into clusters such that the points inside the same cluster are\n\u201csimilar\u201d to each other. More formally, consider a set of datapoints C and a set of \u201cpotential cluster\ncenters\u201d F, with a metric d on C \u222a F. We de\ufb01ne n := |C \u222a F|. Given any set S \u2286 F, each j \u2208 C is\nassociated with the key statistic d(j,S) = mini\u2208S d(i, j). The typical task in a clustering problem is\nto select a set S \u2286 F, with a small size, in order to minimize the values of d(j,S). The size of the set\nS is often \ufb01xed to a value k, and we typically \u201cboil down\u201d the large collection of values d(j,S) into\na single overall objective function. There are different clustering problems depending on the choice\nof the objective function and the assumptions on sets C and F. The most popular problems include\n\n\u2022 the k-center problem: minimize the maximum value maxj\u2208C d(j,S) given that F = C.\n\n\u2217Research supported in part by NSF Awards CNS-1010789, CCF-1422569 and CCF-1749864, CCF-1566356,\n\nCCF-1717134 and by research awards from Adobe, Inc.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u2022 the k-supplier problem: minimize the maximum value maxj\u2208C d(j,S) (where F and C may\n\nbe unrelated);\n\n\u2022 the k-median problem: minimize the summed value(cid:80)\n\u2022 the k-means problem: minimize the summed square value(cid:80)\n\nj\u2208C d(j,S); and\n\nj\u2208C d(j,S)2.\n\nAn important special case is when C = F (e.g. the k-center problem); since this often occurs in the\ncontext of data clustering, we refer to this case as the self-contained clustering (SCC) setting. In\nthe general case, C and F may be unrelated (intersect with each other arbitrarily).\nThese classic NP-hard problems have been studied intensively for the past few decades. There is\nan alternative interpretation of these clustering problems from the viewpoint of operations research:\nthe sets F and C can be thought of as \u201cfacilities\u201d and \u201cclients\u201d, respectively. We say that a i \u2208 F\nis open if i is placed into the solution set S. For a set S \u2286 F of open facilities, d(j,S) can then be\ninterpreted as the connection cost of client j. The above-mentioned problems try to optimize the cost\nof serving all clients in different ways. This terminology has historically been used for k-center type\nclustering problems, and so will adopt this throughout for consistency here. However, our focus is on\nthe case in which C and F are arbitrary given sets in the data-clustering setting.\nSince these problems are NP-hard, much effort has been paid on algorithms with \u201csmall\u201d provable\napproximation ratios/guarantees: i.e., polynomial-time algorithms that produce solutions of cost\nat most \u03b1 times the optimal. The current best-known approximation ratio for k-median is 2.675 by\nByrka et. al. [6] and it is NP-hard to approximate this problem to within a factor of 1 + 2/e \u2248 1.735\n[15]. The recent breakthrough by Ahmadian et. al. [1] gives a 6.357-approximation algorithm for\nk-means, improving on the previous approximation guarantee of (9 + \u0001) based on local search [16].\nFinally, the k-supplier problem is \u201ceasier\u201d than both k-median and k-means in the sense that a simple\n3-approximation algorithm [12] is known, as is a 2-approximation for k-center problem: we cannot\ndo better than these approximation ratios unless P = NP [12].\nWhile optimal approximation algorithms for the center-type problems are well-known, all of the\ncurrent algorithms give deterministic solutions. One can easily demonstrate instances where such\nalgorithms return a worst-possible solution: (i) all clusters have the same worst-possible radius (2T\nfor k-center and 3T for k-supplier where T is the optimal radius) and (ii) almost all data points are\non the circumference of the resulting clusters. Although it is NP-hard to improve the approximation\nratios here, our new randomized algorithms provide signi\ufb01cantly-improved \u201cper-point\u201d guarantees.\nFor example, we achieve a new \u201cper-point\u201d guarantee E[d(j,S)] \u2264 (1 + 2/e)T \u2248 1.736T , while\nrespecting the usual guarantee d(j,S) \u2264 3T with probability one. Thus, while maintaining good\nglobal quality with probability one, we also provide superior stochastic guarantees for each user.\nIn this paper, we study generalized variants of the center-type problems where S is drawn from a\n\n(cid:1) denotes the set of k-element subsets of F); we refer\n\nprobability distribution over(cid:0)F\n\nto these as k-lotteries. We aim to construct a k-lottery \u2126, which achieves certain guarantees on the\ndistributional properties or expected value of d(j,S). The k-center problem can be viewed as the\nspecial case in which the distribution \u2126 in supported on a single point (we refer to this situation by\nsaying that \u2126 is deterministic). Our goal is to \ufb01nd an approximating distribution \u02dc\u2126 which matches\nthe target distribution \u2126 as closely as possible for each client j.\nWe have seen that stochastic solutions allows one to go beyond the approximation hardness of a\nnumber of classical facility location problems. In addition to allowing higher-quality solutions, there\nare a number of applications where clustering based on expected distances can be bene\ufb01cial. We\nsummarize three here: smoothing the integrality constraints of clustering, solving repeated problem\ninstances, and achieving fair solutions.\n\n(cid:1) (where(cid:0)F\n\nk\n\nk\n\nStochasticity as interpolation. For many problems in practice, robustness of the solution is often\nmore important than achieving the absolute optimal value for the objective function. Our stochastic\nperspective is very useful here. One potential problem with the center measure is that it can be highly\nnon-robust; adding a single new point may drastically change the overall objective function. As an\nextreme example, consider k-center with k points, each at distance 1 from each other. This clearly\nhas value 0 (choosing S = C). However, if a single new point at distance 1 to all other points is\nadded, then the solution jumps to 1. By choosing k facilities uniformly at random among the full set\nof k + 1, we can ensure that E[d(j,S)] = 1\n\nk+1 for every point j, a much smoother transition.\n\n2\n\n\fRepeated clustering problems. Consider clustering problems where the choice of S can be changed\nperiodically: e.g., S could be the set of k locations in the cloud where a service-provider chooses to\nprovide services from. This set S can be shuf\ufb02ed periodically in a manner transparent to end-users.\nFor any user j \u2208 C, the statistic d(j,S) represents the latency of the service j receives (from its\nclosest service-point in S). If we aim for a fair or minmax service allocation, then our k-center\nstochastic approximation results ensure that for every client j, the long-term average service-time\n\u2013 where the average is taken over the periodic re-provisioning of S \u2013 is at most around 1.736T\nwith high probability. (Furthermore, we have the risk-avoidance guarantee that in no individual\nprovisioning of S will any client have service-time greater than 3T .) We emphasize that this type of\nguarantee pertains to multi-round clustering problems, and is not by itself stochastic. This is dif\ufb01cult\nto achieve for the usual type of approximation algorithms and impossible for \u201cstateless\u201d deterministic\nalgorithms.\nFairness in clustering. The classical clustering problems combine the needs of many different points\n(elements of C) into one metric. However, clustering (and indeed many other ML problems) are\nincreasingly driven by inputs from parties with diverse interests. Fairness in these contexts is a\nchallenging issue to address; this concern has taken on greater importance in our current world, where\ndecisions will increasingly be taken by algorithms and machine learning. Representative examples of\nrecent concerns include the accusations of, and \ufb01xes for, possible racial bias in Airbnb rentals [4] and\nthe \ufb01nding that setting the gender to \u201cfemale\" in Google\u2019s Ad Settings resulted in getting fewer ads\nfor high-paying jobs [8]. Starting with older work such as [21], there have been highly-publicized\nworks on bias in allocating scarce resources \u2013 e.g., racial discrimination in hiring applicants who have\nvery similar resum\u00e9s [5]. Additional work discusses the possibility of bias in electronic marketplaces,\nwhether human-mediated or not [3, 4].\nIn such settings, a fair allocation should provide good service guarantees to each user individually. In\ndata clustering settings where a user corresponds to a datapoint, this means that every point j \u2208 C\nshould be guaranteed a good value of d(j,S). This is essentially what k-center type problems are\naiming for, but the stochastic setting broadens the meaning of good per-user service.\nConsider the following scenarios. Each user submits their data (corresponding to a point in C) \u2013 as\nis increasingly common, explicitly or implicitly \u2013 to an aggregator such as an e-commerce or other\nsite. The cluster centers are \u201cin\ufb02uencer\u201d nodes that the aggregator tries to connect users with in some\nway. Two examples that motivate the aggregator\u2019s budget on k are: (i) the aggregator can give a free\nproduct sample to each in\ufb02uencer to in\ufb02uence the whole population in aggregate, as in [17], and (ii)\nin a resource-constrained setting, the aggregator forms a sparse \u201csketch\u201d with k nodes (the cluster\ncenters), with each cluster center hopefully being similar to the nodes in its cluster so that the nodes\nget relevant recommendations (in a recommender-like system). Each point j would like to be in a\ncluster that is \u201chigh quality\u201d from its perspective, with d(j,S) being a good proxy for such quality.\nIndeed, there is increasing emphasis on the fact that organizations monetize their user data, and that\nusers need to be compensated for this (see, e.g., the well-known works of Lanier and others [19, 14]).\nThis is a transition from viewing data as capital to viewing data as labor. A concrete way for users\n(i.e., the data points j \u2208 C) to be compensated in our context is for each user to get a guarantee on\ntheir solution quality: i.e., bounds on d(j,S).\n\n1.1 Our contributions and overview\n\nIn Section 2, we encounter the \ufb01rst type of approximation algorithm which we refer to as chance\nk-coverage: namely, where every client j has a distance demand rj and probability demand pj,\nand we wish to \ufb01nd a distribution satisfying Pr[d(j,S) \u2264 rj] \u2265 pj. We show how to obtain an\napproximation algorithm to \ufb01nd an approximating distribution \u02dc\u2126 with2\n\n[d(j,S) \u2264 9rj] \u2265 pj.\n\nPr\nS\u223c \u02dc\u2126\n\nIn a number of special cases, such as when all the values of pj or rj are the same, the distance factor\n9 can be improved to 3, which is optimal; it is an interesting question whether we can improve the\ncoef\ufb01cient \u201c9\u201d to its best-possible value in the general case.\nIn Section 3, we consider a special case of chance k-coverage, in which pj = 1 for all clients j.\nThis is equivalent to the classical (deterministic) k-supplier problem. By allowing the approximating\n\n2Notation such as \u201cS \u223c \u02dc\u2126\" indicates that the random set S is drawn from the distribution \u02dc\u2126.\n\n3\n\n\fdistribution \u02dc\u2126 to be stochastic, we are able to achieve signi\ufb01cantly better distance guarantees than are\npossible for k-supplier. For instance, we are able to approximate the k-center problem by \ufb01nding an\napproximating distribution \u02dc\u2126 with\n\nES\u223c \u02dc\u2126[d(j,S)] \u2264 1.592T and Pr[d(j,S) \u2264 3T ] = 1\n\nwhere T is the optimal solution to the (deterministic) k-center problem. By contrast, deterministic\npolynomial-time algorithms cannot guarantee d(j,S) < 2T for all j, unless P = NP [12].\nIn Section 4, we show a variety of lower bounds on the approximation factors achievable by ef\ufb01cient\nalgorithms (assuming that P (cid:54)= N P ). For instance, we show that our approximation algorithm\nfor homogeneous chance k-coverage has the optimal distance approximation factor 3, that our\napproximation algorithm for k-supplier has optimal approximation factor 1 + 2/e, and that the\napproximation factor 1.592 for k-center cannot be improved below 1 + 1/e.\n\n1.2 Related work\n\nWhile most analysis for facility location problems has focused on the static case, there have been other\nworks analyzing a similar lottery model for center-type problems. In [11, 10], Harris et. al. analyzed\nmodels similar to chance k-coverage and minimization of E[d(j,S)], but applied to knapsack center\nand matroid center problems; they also considered robust versions (in which a small subset of clients\nmay be denied service). While the overall model was similar to the ones we explore here, the\ntechniques are somewhat different. In particular, the works [11, 10] focus on approximately satisfying\nthe knapsack constraints; this is very different from the problem of opening exactly k cluster centers,\nwhich we mostly cover here.\nSimilar types of stochastic approximation guarantees have appeared in the context of developing\napproximation algorithms for static problems, particularly k-median problems. In [7], Charikar\n& Li discussed a randomized procedure for converting a linear-programming relaxation in which\na client has fractional distance tj, into a distribution \u2126 satisfying ES\u223c\u2126[d(j,S)] \u2264 3.25tj. This\nproperty can be used, among other things, to achieve a 3.25-approximation for k-median. However,\nmany other randomized rounding algorithms for k-median only seek to preserve the aggregate value\n\n(cid:80)\nj E[d(j,S)], without our type of per-point guarantee.\n\nWe also contrast our approach with a different stochastic k-center problem considered in works such\nas [13, 2]. These consider a model with a \ufb01xed, deterministic set S of open facilities, while the client\nset is determined stochastically; this model is almost precisely opposite to ours.\n\na = (a1, . . . , an) and a subset X \u2286 [n], we often write a(X) as shorthand for(cid:80)\n\n1.3 Notation\nWe will let [t] denote the set {1, 2, . . . , t}. We use the Iverson notation throughout, so that for any\nBoolean predicate P we let [[P]] be equal to one if P is true and zero otherwise. For any vector\ni\u2208X ai. Given any\nj \u2208 C and any real number r \u2265 0, we de\ufb01ne the ball B(j, r) = {i \u2208 F | d(i, j) \u2264 r}. We let Vj\ndenote the facility closest to j. Note that Vj = j in the SCC setting.\n\n1.4 Some useful subroutines\n\nThere are two basic subroutines that will be used repeatedly in our algorithms: dependent rounding\nand greedy clustering.\nProposition 1.1 ([22]). There exists a randomized polynomial-time algorithm DEPROUND(y) which\ntakes as input a vector y \u2208 [0, 1]n, and in polynomial time outputs a random set Y \u2286 [n] with the\nfollowing properties:\n\n(P1) Pr[i \u2208 Y ] = yi, for all i \u2208 [n],\n\n(P2) (cid:98)(cid:80)n\ni=1 yi(cid:99) \u2264 |Y | \u2264 (cid:100)(cid:80)n\n(P3) Pr[Y \u2229 S = \u2205] \u2264(cid:81)\n\ni=1 yi(cid:101) with probability one,\n\ni\u2208S(1 \u2212 yi) for all S \u2286 [n].\n\nWe adopt the following additional convention: suppose (y1, . . . , yn) \u2208 [0, 1]n and S \u2286 [n]; we then\nde\ufb01ne DEPROUND(y, S) \u2286 S to be DEPROUND(x), for the vector x de\ufb01ned by xi = yi[[i \u2208 S]].\n\n4\n\n\fThe greedy clustering procedure takes an input a set of weights wj and sets Fj \u2286 F for every client\nj \u2208 C, and executes the following procedure:\n\nAlgorithm 1 GREEDYCLUSTER(F, w)\n1: Sort C as C = {j1, j2, . . . , j(cid:96)} where wj1 \u2264 wj2 \u2264 \u00b7\u00b7\u00b7 \u2264 wj(cid:96).\n2: Initialize C(cid:48) \u2190 \u2205\n3: for t = 1, . . . , (cid:96) do\n4:\n5: Return C(cid:48)\n\nif Fjt \u2229 Fj(cid:48) = \u2205 for all j(cid:48) \u2208 C(cid:48) then update C(cid:48) \u2190 C(cid:48) \u222a {jt}\n\nObservation 1.2. If C(cid:48) = GREEDYCLUSTER(F, w) then for any j \u2208 C there is z \u2208 C(cid:48) with\nwz \u2264 wj and Fz \u2229 Fj (cid:54)= \u2205.\n\n2 The chance k-coverage Problem\n\nIn this section, we consider a scenario we refer to as the chance k-coverage problem, wherein every\npoint j \u2208 C has demand parameters pj, rj, and we wish to \ufb01nd a k-lottery \u2126 such that\n\n[d(j,S) \u2264 rj] \u2265 pj.\n\nPrS\u223c\u2126\n\n(1)\n\nIf a set of demand parameters has a k-lottery satisfying (1), we say that they are feasible. We refer to\nthe special cases wherein every client j has a common value pj = p, or every client j has a common\nvalue rj = r (or both), as homogeneous. This often arises in the context of fair allocations, for\nexample, k-supplier is a special case of the homogeneous chance k-coverage problem, in which\npj = 1 and rj is equal to the optimal k-supplier radius.\nAs we discuss in Section 4, any approximation algorithm must either signi\ufb01cantly give up a guarantee\non the distance, or probability (or both). Our approximation algorithms for chance k-coverage will\nbe based on the following linear programming (LP) relaxation, which we refer to as the chance LP.\nWe consider the following polytope P over variables bi, where i ranges over F (bi represents the\nprobability of opening facility i):\n\n(B1) (cid:80)\n\ni\u2208B(j,rj ) bi \u2265 pj for all j \u2208 C,\n\n(B2) b(F) = k,\n(B3) bi \u2208 [0, 1] for all i \u2208 F.\n\nFor the remainder of this section, we assume we have found a vector b \u2208 P. By a standard center-\nsplitting step, we also generate, for every point j \u2208 C, a center set Fj \u2286 B(j, rj) with b(Fj) = pj.\nTheorem 2.1. If p, r is feasible then one may ef\ufb01ciently construct a k-lottery \u2126 satisfying\nPrS\u223c\u2126[d(j,S) \u2264 rj] \u2265 (1 \u2212 1/e)pj.\n\n2.1 Distance approximation for chance k-coverage\n\nWe next will show how to satisfy the probability constraint exactly, with a constant-factor loss in the\ndistance guarantee. This algorithm will be based on iterated randomized rounding of the vector b.\nThis is similar to a method of [18], which also uses iterative rounding for a (deterministic) k-median\nand k-means approximations.\nWe will maintain a vector b \u2208 [0, 1]F and maintain two sets of points Ctight and Cslack, with the\nfollowing properties:\n\n(C1) Ctight \u2229 Cslack = \u2205.\n(C2) For all j, j(cid:48) \u2208 Ctight, we have Fj \u2229 Fj(cid:48) = \u2205\n(C3) Every j \u2208 Ctight has b(Fj) = 1,\n(C4) Every j \u2208 Cslack has b(Fj) \u2264 1.\nFj) \u2264 k\n\n(C5) We have b((cid:83)\n\nj\u2208Ctight\u222aCslack\n\n5\n\n\fGiven our initial LP solution b, setting Ctight = \u2205, Cslack = C will satisfy criteria (C1)\u2013(C5); note that\n(C4) holds as b(Fj) = pj \u2264 1 for all j \u2208 C.\nProposition 2.2. Given any vector b \u2208 [0, 1]F satisfying constraints (C1)\u2014(C5) with Cslack (cid:54)= \u2205, it\nis possible to generate a random vector b(cid:48) \u2208 [0, 1]F such that E[b(cid:48)] = b, and with probability one b(cid:48)\nsatis\ufb01es constraints (C1) \u2014 (C5) as well as having some j \u2208 Cslack with b(cid:48)(Fj) \u2208 {0, 1}.\nWe can now describe our iterative rounding algorithm, Algorithm 2.\n\nAlgorithm 2 Iterative rounding algorithm for chance k-coverage\n1: Let b be a fractional LP solution and form the corresponding sets Fj.\n2: Initialize Ctight = \u2205, Cslack = C\n3: while Cslack (cid:54)= \u2205 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: For each j \u2208 Ctight, open an arbitrary center in Fj.\n\nDraw a fractional solution b(cid:48) with E[b(cid:48)] = b according to Proposition 2.2.\nSelect some w \u2208 Cslack with b(cid:48)(Fw) \u2208 {0, 1}.\nUpdate Cslack \u2190 Cslack \u2212 {w}\nif b(cid:48)(Fw) = 1 then\nUpdate Ctight \u2190 Ctight \u222a {w}\nif there is any z \u2208 Ctight \u222a Cslack such that rz \u2265 rw/2 and Fz \u2229 Fw (cid:54)= \u2205 then\n\nUpdate Ctight \u2190 Ctight \u2212 {z}, Cslack \u2190 Cslack \u2212 {z}\n\nUpdate b \u2190 b(cid:48).\n\nTheorem 2.3. Every j \u2208 C has Pr[d(j,S) \u2264 9rj] \u2265 pj.\n\n2.2 Approximation algorithm for uniform p or r\n\nThe distance guarantee can be signi\ufb01cantly improved in two natural cases: when all the values of pj\nare the same, or when all the values of rj are the same.\nWe use a similar algorithm for both these cases. The main idea is to \ufb01rst select a set C(cid:48) according to\nsome greedy order, such that the cluster sets {Fj(cid:48) | j(cid:48) \u2208 C(cid:48)} intersect all the cluster Fj. We then open\na single item from each cluster. The only difference between the two cases is the choice of weighting\nfunction wj for the greedy cluster selection. We can summarize these algorithms as follows:\n\nAlgorithm 3 Rounding algorithm for k-coverage for uniform p or uniform r.\n1: Set C(cid:48) = GREEDYCLUSTER(Fj, wj)\n2: Set Y = DEPROUND(p, C(cid:48))\n3: Form solution set S = {Vj | j \u2208 Y }.\n\nProposition 2.4. Algorithm 3 opens at most k facilities.\n\nProof. The dependent rounding step ensures that(cid:80)\n(cid:100)(cid:80)\n\ni\u2208F bi(cid:101) \u2264 k.\n\nj\u2208Y pj \u2264 (cid:100)(cid:80)\n\nj\u2208C(cid:48) pj(cid:101) = (cid:100)(cid:80)\n\nj\u2208C(cid:48) b(Fj)(cid:101) \u2264\n\nProposition 2.5. Suppose that pj is the same for every j \u2208 C. Then using the weighting function\nwj = rj ensures that every j \u2208 C it satis\ufb01es Pr[d(j,S) \u2264 3rj] \u2265 pj. Furthermore, in the SCC\nsetting, it satis\ufb01es Pr[d(j,S) \u2264 2rj] \u2265 pj.\nProposition 2.6. Suppose rj is the same for every j \u2208 C. Then using the weighting function\nwj = 1 \u2212 pj ensures that every j \u2208 C satis\ufb01es Pr[d(j,S) \u2264 3rj] \u2265 pj. Furthermore, in the SCC\nsetting, it satis\ufb01es Pr[d(j,S) \u2264 2rj] \u2265 pj.\n\n3 Chance k-coverage: the deterministic case\nAn important special case of the k-coverage problem is the one where pj = 1 for all j \u2208 C. In this\ncase, the target distribution \u2126 is just a single set S satisfying d(j,S) \u2264 rj. In the homogeneous\nsetting (where all the rj are equal to the same value), this is speci\ufb01cally the k-center or k-supplier\n\n6\n\n\fproblem. The typical approach to approximate this is to have the approximation distribution \u02dc\u2126 also\nbe a single set S(cid:48); in such a case the best guarantee available is d(j,S) \u2264 3rj.\nWe improve the distance guarantee by allowing \u02dc\u2126 to be a distribution. Speci\ufb01cally, we construct\na k-lottery \u02dc\u2126 such that d(j,S) \u2264 3rj with probability one, and ES\u223c \u02dc\u2126[d(j,S)] \u2264 crj, where the\nconstant c satis\ufb01es the following bounds:\n\n1. In the general case, c = 1 + 2/e \u2248 1.73576;\n2. In the SCC setting, c = 1.60793;\n3. In the homogeneous SCC setting, c = 1.592.\n\nBy a reduction to set cover, we will show that the constant value 1 + 2/e is optimal for the general\ncase (even when all the rj are equal), and for the third case the constant c cannot be made lower than\n1 + 1/e \u2248 1.367.\nWe also remark that this type of stochastic guarantee allows us to ef\ufb01ciently construct publicly-\nveri\ufb01able lotteries. This means that the server locations are not only fair, but they are also transparent\nand seen to be fair.\nProposition 3.1. Let \u0001 > 0. Suppose that there is an ef\ufb01ciently-samplable k-lottery \u2126 with\nPrS\u223c\u2126[d(j,S) \u2264 c1rj] = 1 and ES\u223c\u2126[d(j,S)] \u2264 c2rj, for constants c2 \u2264 c1. Then there is an\nexpected polynomial time procedure to construct an explicitly-enumerated k-lottery \u2126(cid:48), with support\nsize |\u2126(cid:48)| = O( log n\n\n\u00012 ), such that PrS\u223c\u2126(cid:48)[d(j,S) \u2264 c1rj] = 1 and ES\u223c\u2126(cid:48)[d(j,S)] \u2264 (c2 + \u0001)rj.\n\n3.1 Randomized rounding via clusters\n\nWe use a similar type of algorithm to that considered in Section 2.2: we choose a covering set of\nclusters C(cid:48), and we open exactly one item from each cluster. The main difference is that instead of\nopening the nearest item Vj for each j \u2208 C(cid:48), we instead open a cluster according to the LP solution\nbi.\n\n2: Set F0 = F \u2212(cid:83)\n\nAlgorithm 4 Rounding algorithm for k-supplier\n1: Set C(cid:48) = GREEDYCLUSTER(Fj, rj).\n3: for j \u2208 C(cid:48) do\n4:\n5: Let S0 \u2190 DEPROUND(b, F0)\n6: Return S = S0 \u222a {Wj | j \u2208 C(cid:48)}\n\nj\u2208C(cid:48) Fj\n\nRandomly select Wj \u2208 Fj according to the distribution Pr[Wj = i] = bi\n\nTheorem 3.2. For any w \u2208 C, the solution set S of Algorithm 4 satis\ufb01es d(w,S) \u2264 3rw with\nprobability one and E[d(w,S)] \u2264 (1 + 2/e)rw.\nIn the SCC setting, we may improve the approximation ratio using the following Algorithm 5; it is\nthe same as Algorithm 4, except that we have moved some values of b to the cluster centers.\n\n2: Set F0 = F \u2212(cid:83)\n\nAlgorithm 5 Rounding algorithm for k-center\n1: Set C(cid:48) = GREEDYCLUSTER(Fj, rj).\n3: for j \u2208 C(cid:48) do\n4:\n5: Let S0 \u2190 DEPROUND(b, F0)\n6: Return S = S0 \u222a {Wj | j \u2208 C(cid:48)}\n\nj\u2208C(cid:48) Fj\n\nRandomly select Wj \u2208 Fj according to the distribution Pr[Wj = i] = (1 \u2212 q)bi + q[[i = j]]\n\nTheorem 3.3. Let w \u2208 C. After running Algorithm 5 with q = 0.464587 we have d(w,S) \u2264 3rw\nwith probability one and E[d(w,S)] \u2264 1.60793rw.\n\n7\n\n\f3.2 The homogeneous SCC setting\n\nThe SCC setting, in which all the values of rj are equal, is equivalent to the classical k-center\nproblem. We will improve on Theorem 3.3 through a more complicated rounding process based on\ngreedily-chosen partial clusters. Speci\ufb01cally, we select cluster centers \u03c0(1), . . . \u03c0(n), wherein \u03c0(i) is\nchosen to maximize b(F\u03c0(i) \u2212 F\u03c0(1) \u2212 \u00b7\u00b7\u00b7 \u2212 F\u03c0(i\u22121)). By renumbering C, we may assume without\nloss of generality that the resulting permutation \u03c0 is the identity; therefore, we assume throughout\nthis section that C = F = [n] and for all i, j \u2208 [n] we have\n\nb(Fi \u2212 F1 \u2212 \u00b7\u00b7\u00b7 \u2212 Fi\u22121) \u2265 b(Fj \u2212 F1 \u2212 \u00b7\u00b7\u00b7 \u2212 Fi\u22121)\n\nFor each j \u2208 [n], we let Gj = Fj \u2212 F1 \u2212 \u00b7\u00b7\u00b7 \u2212 Fj\u22121; we refer to Gj as a cluster and we let\nzj = b(Gj). We say that Gj is a full cluster if zj = 1 and a partial cluster otherwise.\nWe use the following randomized rounding strategy to select the centers. We begin by choosing\ntwo real numbers Qf, Qp (short for full and partial); these are drawn according to a joint probability\ndistribution which we discuss later. Recall our notational convention that q = 1 \u2212 q; this notation\nwill be used extensively in this section to simplify the formulas.\nWe then use Algorithm 7:\n\nAlgorithm 7 Partial-cluster based algorithm for k-center\n1: Z \u2190 DEPROUND(z)\n2: for j \u2208 Z do\n3:\n4:\n\nif zj = 1 then\n\nRandomly select a point Wj \u2208 Gj according to the distribution\nPr[Wj = i] = (1 \u2212 Qf)yi + Qf[[i = j]]\nRandomly select a point Wj \u2208 Gj according to the distribution\n\nPr[Wj = i] =(cid:0)(1 \u2212 Qp)yi + Qp[[i = j]](cid:1)/zj\n\nelse\n\n5:\n6:\n\n7: Return S = {Wj | j \u2208 Z}\n\nLet us give some intuitive motivation for Algorithm 7. Consider some j \u2208 C. It may be bene\ufb01cial to\nopen the center of some cluster near j as this will ensures d(j,S) \u2264 2. However, there is no bene\ufb01t\nto opening the centers of multiple clusters. So, we would like to ensure that there is a signi\ufb01cant\nnegative correlation between opening the centers of distinct clusters near j. Unfortunately, there\ndoes not seem to be any way to achieve this with respect to full clusters \u2014 as all full clusters \u201clook\nalike,\u201d we cannot enforce any signi\ufb01cant negative correlation among the indicator random variables\nfor opening their centers. By taking advantage of partial clusters, we are able to break this symmetry.\nTheorem 3.4. Suppose that Qf, Qp has the following distribution:\n\n(cid:26)(0.4525, 0)\n\n(Qf, Qp) =\n\n(0.0480, 0.3950) with prob 1 \u2212 p\n\n.\n\nwith prob p = 0.773\n\nThen for all i \u2208 F we have d(i,S) \u2264 3ri with probability one, and E[d(i,S)] \u2264 1.592ri.\n\n4 Lower bounds on approximation ratios\n\nWe next show lower bounds corresponding to optimization problem for chance k-coverage analyzed\nin Sections 2, and 3. These constructions are adapted from similar lower bounds for approximability\nof k-median [9], which in turn are based on the hardness of set cover. In a set cover instance, we have\na ground set [n], as well as a collection of sets B = {S1, . . . , Sm} \u2286 2[n]. The goal is to select a\ncollection X \u2286 [m] such that the sets \u222ai\u2208X Si = [n], and such that |X| is minimized. The minimum\nvalue of |X| thus obtained is denoted by OPT.\nWe quote a result of Moshovitz [20] on the inapproximability of set cover.\nTheorem 4.1 ([20]). Assuming P (cid:54)= N P , there is no algorithm running in polynomial time which\nguarantees a set-cover solution X with SX = [n] and |X| \u2264 (1 \u2212 \u0001) ln n \u00d7 OPT, where \u0001 > 0 is any\nconstant.\n\n8\n\n\fFor the remainder of this section, we assume that P (cid:54)= N P . We will need a simple corollary of\nTheorem 4.1, which is a (slight reformulation) of the hardness of approximating max-coverage.\nCorollary 4.2. Assuming P (cid:54)= N P , there is no polynomial-time algorithm which guarantees a\n\nset-cover solution X with |X| \u2264 OPT and(cid:12)(cid:12)SX\n\n(cid:12)(cid:12) \u2265 cn for any constant c > 1 \u2212 1/e.\n\nX \u2208 (cid:0)F\n\nWe can encode facility location problems as special cases of max-coverage. Let us sketch how this\nworks in the non-SCC setting.\nConsider a set cover instance B = {S1, . . . , Sm} on ground set [n]. We begin by guessing the value\nOPT (there are at most m possibilities, so this can be done in polynomial time). We de\ufb01ne a k-center\ninstance with k = OPT. We de\ufb01ne disjoint client and facility sets, where F is identi\ufb01ed with [m] and\nC is identi\ufb01ed with [n]. The distance is de\ufb01ned by d(i, j) = 1 if j \u2208 Si and d(i, j) = 3 otherwise.\nNow note that if X is an optimal solution to B then d(j, X) \u2264 1 for all points j \u2208 C. So there exists a\n(deterministic) distribution achieving rj = 1. On the other hand, suppose that A generates a solution\n\n(cid:1) where E[d(j, X)] \u2264 crj; the set X can also be regarded as a solution to the set cover\n\nk\ninstance.\nThus, if a polynomial-time algorithm A generates a distribution \u2126 which ensures that E[d(j, X)] \u2264\ncrj, it also generates a solution a fraction (3 \u2212 c)/2 of the sets in B. By Corollary 4.2, this means\nthat we must have (3 \u2212 c)/2 \u2265 1 \u2212 1/e, i.e. c \u2265 1 + 2/e.\nThe construction for SCC instances and for chance k-coverage problems is similar, with a few more\ntechnical details.\nTheorem 4.3. Assuming P (cid:54)= N P , no polynomial-time algorithm can guarantee E[d(j,S)] \u2264 crj\nfor c < 1 + 2/e, even if all rj are equal to each other. Thus, the approximation constant in\nTheorem 3.2 cannot be improved.\nTheorem 4.4. Assuming P (cid:54)= N P , no polynomial-time algorithm can guarantee E[d(j,S)] \u2264 crj\nfor c < 1 + 1/e, even if all rj are equal to each other and C = F. Thus, the approximation factor\n1.592 in Theorem 3.4 cannot be improved below 1 + 1/e.\nProposition 4.5. Assuming P (cid:54)= N P , no polynomial-time algorithm can take as input a feasible\nvector r, p for the chance k-coverage problem and generate a k-lottery \u2126 guaranteeing either\nPrS\u223c\u2126[d(j,S) < rj] \u2265 (1 \u2212 1/e \u2212 \u0001)pj, or PrS\u223c\u2126[d(j,S) < 3rj] \u2265 pj, for any constant \u0001 > 0.\nThis holds even when restricted to homogeneous problem instances. Likewise, in the homogeneous\nSCC setting, we cannot ensure that PrS\u223c\u2126[d(j,S) < 2rj] \u2265 pj,\n\n5 Acknowledgments\n\nThanks to the anonymous conference referees, for many useful suggestions and for helping to tighten\nthe focus of the paper.\n\nReferences\n[1] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and\n\nEuclidean k-median by primal-dual algorithms. CoRR, abs/1612.07925, 2016.\n\n[2] S. Alipour and A. Jafari.\n\nIn\nProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principle of Database\nSystem (PODS), pages 425\u2013433, 2018.\n\nImprovements on the k-center problem for uncertain data.\n\n[3] I. Ayres, M. Banaji, and C. Jolls. Race effects on eBay. The RAND Journal of Economics,\n\n46(4):891\u2013917, 2015.\n\n[4] E. Badger. How Airbnb plans to \ufb01x its racial-bias problem. Washington Post, 2016. September\n\n8, 2016.\n\n[5] M. Bertrand and S. Mullainathan. Are Emily and Greg More Employable Than Lakisha and\nJamal? A Field Experiment on Labor Market Discrimination. American Economic Review,\n94(4):991\u20131013, 2004.\n\n9\n\n\f[6] J. Byrka, T. Pensyl, B. Rybicki, A. Srinivasan, and K. Trinh. An improved approximation for\nk-median, and positive correlation in budgeted optimization. ACM Transaction on Algorithms,\n13(2):23:1\u201323:31, 2017.\n\n[7] M. Charikar and S. Li. A dependent LP-rounding approach for the k-median problem. Automata,\n\nLanguages, and Programming, pages 194\u2013205, 2012.\n\n[8] A. Datta, M. C. Tschantz, and A. Datta. Automated experiments on ad privacy settings: A tale\nof opacity, choice, and discrimination. In Proceedings on Privacy Enhancing Technologies,\npages 92\u2013112, 2015.\n\n[9] S. Guha and S. Khuller. Greedy strikes back: improved facility location algorithms. Journal of\n\nalgorithms, 31(1):228\u2013248, 1999.\n\n[10] D. G. Harris, T. Pensyl, A. Srinivasan, and K. Trinh. Additive pseudo-solutions for knapsack\n\nconstraints via block-selection rounding. arxiv, abs/1709.06995, 2017.\n\n[11] D. G. Harris, T. Pensyl, A. Srinivasan, and K. Trinh. A lottery model for center-type problems\nwith outliers. In LIPIcs-Leibniz International Proceedings in Informatics, volume 81. Schloss\nDagstuhl-Leibniz-Zentrum fuer Informatik, 2017.\n\n[12] D. S. Hochbaum and D. B. Shmoys. A uni\ufb01ed approach to approximation algorithms for\n\nbottleneck problems. J. ACM, 33(3):533\u2013550, 1986.\n\n[13] L. Huang and J. Li. Stochastic k-center and j-\ufb02at-center problems. In Proceedings of the 28th\n\nannual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 110\u2013129, 2017.\n\n[14] I. A. Ibarra, L. Goff, D. J. Hern\u00e1ndez, J. Lanier, and E. G. Weyl. Should we treat data as labor?\n\nMoving beyond \u2018free\u2019. American Economic Association Papers & Proceedings, 1(1), 2018.\n\n[15] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location problems.\nIn Proceedings of the 34th annual ACM Symposium on Theory of Computing (STOC), pages\n731\u2013740, 2002.\n\n[16] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local\nsearch approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89\u2013112, 2004.\n\n[17] D. Kempe, J. M. Kleinberg, and \u00c9. Tardos. Maximizing the spread of in\ufb02uence through a social\n\nnetwork. Theory of Computing, 11:105\u2013147, 2015.\n\n[18] R. Krishnaswamy, S. Li, and S. Sandeep. Constant approximation for k-median and k-means\nwith outliers via iterative rounding. In Proceedings of the 50th annual ACM SIGACT Symposium\non Theory of Computing (STOC), pages 646\u2013659, 2018.\n\n[19] J. Lanier. Who Owns the Future? Simon Schuster, 2014.\n\n[20] D. Moshkovitz. The projection games conjecture and the NP-hardness of ln n-approximating\n\nset-cover. Theory of Computing, 11(7):221\u2013235, 2015.\n\n[21] K. A. Schulman, J. A. Berlin, W. Harless, J. F. Kerner, S. Sistrunk, B. J. Gersh, R. Dub\u00e9, C. K.\nTaleghani, J. E. Burke, S. Williams, J. M. Eisenberg, W. Ayers, and J. J. Escarce. The effect of\nrace and sex on physicians\u2019 recommendations for cardiac catheterization. New England Journal\nof Medicine, 340(8):618\u2013626, 1999.\n\n[22] A. Srinivasan. Distributions on level-sets with applications to approximation algorithms. In\nProceedings of the 42nd annual IEEE Symposium on Foundations of Computer Science (FOCS),\npages 588\u2013597, 2001.\n\n10\n\n\f", "award": [], "sourceid": 2961, "authors": [{"given_name": "David", "family_name": "Harris", "institution": "University of Maryland"}, {"given_name": "Shi", "family_name": "Li", "institution": "University at Buffalo"}, {"given_name": "Aravind", "family_name": "Srinivasan", "institution": "University of Maryland College Park"}, {"given_name": "Khoa", "family_name": "Trinh", "institution": null}, {"given_name": "Thomas", "family_name": "Pensyl", "institution": null}]}