{"title": "G2SAT: Learning to Generate SAT Formulas", "book": "Advances in Neural Information Processing Systems", "page_first": 10553, "page_last": 10564, "abstract": "The Boolean Satisfiability (SAT) problem is the canonical NP-complete problem and is fundamental to computer science, with a wide array of applications in planning, verification, and theorem proving. Developing and evaluating practical SAT solvers relies on extensive empirical testing on a set of real-world benchmark formulas. However, the availability of such real-world SAT formulas is limited. While these benchmark formulas can be augmented with synthetically generated ones, existing approaches for doing so are heavily hand-crafted and fail to simultaneously capture a wide range of characteristics exhibited by real-world SAT instances. In this work, we present G2SAT, the first deep generative framework that learns to generate SAT formulas from a given set of input formulas. Our key insight is that SAT formulas can be transformed into latent bipartite graph representations which we model using a specialized deep generative neural network. We show that G2SAT can generate SAT formulas that closely resemble given real-world SAT instances, as measured by both graph metrics and SAT solver behavior. Further, we show that our synthetic SAT formulas could be used to improve SAT solver performance on real-world benchmarks, which opens up new opportunities for the continued development of SAT solvers and a deeper understanding of their performance.", "full_text": "G2SAT: Learning to Generate SAT Formulas\n\nJiaxuan You1\u2217\n\njiaxuan@cs.stanford.edu\n\nHaoze Wu1\u2217\n\nhaozewu@stanford.edu\n\nClark Barrett1\n\nbarrett@cs.stanford.edu\n\nRaghuram Ramanujan2\n\nraramanujan@davidson.edu\n\nJure Leskovec1\n\njure@cs.stanford.edu\n\n1Department of Computer Science, Stanford University\n\n2Department of Mathematics and Computer Science, Davidson College\n\nAbstract\n\nThe Boolean Satis\ufb01ability (SAT) problem is the canonical NP-complete problem\nand is fundamental to computer science, with a wide array of applications in plan-\nning, veri\ufb01cation, and theorem proving. Developing and evaluating practical SAT\nsolvers relies on extensive empirical testing on a set of real-world benchmark for-\nmulas. However, the availability of such real-world SAT formulas is limited. While\nthese benchmark formulas can be augmented with synthetically generated ones,\nexisting approaches for doing so are heavily hand-crafted and fail to simultaneously\ncapture a wide range of characteristics exhibited by real-world SAT instances. In\nthis work, we present G2SAT, the \ufb01rst deep generative framework that learns to\ngenerate SAT formulas from a given set of input formulas. Our key insight is\nthat SAT formulas can be transformed into latent bipartite graph representations\nwhich we model using a specialized deep generative neural network. We show that\nG2SAT can generate SAT formulas that closely resemble given real-world SAT\ninstances, as measured by both graph metrics and SAT solver behavior. Further,\nwe show that our synthetic SAT formulas could be used to improve SAT solver\nperformance on real-world benchmarks, which opens up new opportunities for\nthe continued development of SAT solvers and a deeper understanding of their\nperformance.\n\n1\n\nIntroduction\n\nThe Boolean Satis\ufb01ability (SAT) problem is central to computer science, and \ufb01nds many applications\nacross Arti\ufb01cial Intelligence, including planning [24], veri\ufb01cation [7], and theorem proving [14].\nSAT was the \ufb01rst problem to be shown to be NP-complete [9], and there is believed to be no general\nprocedure for solving arbitrary SAT instances ef\ufb01ciently. Nevertheless, modern solvers are able\nto routinely decide large SAT instances in practice, with different algorithms proving to be more\nsuccessful than others on particular problem instances. For example, incomplete search methods such\nas WalkSAT [35] and survey propagation [6] are more effective at solving large, randomly generated\nformulas, while complete solvers leveraging con\ufb02ict-driven clause learning (CDCL) [30] fare better\non large structured SAT formulas that commonly arise in industrial settings.\nUnderstanding, developing and evaluating modern SAT solvers relies heavily on extensive empirical\ntesting on a suite of benchmark SAT formulas. Unfortunately, in many domains, availability of\n\n\u2217The two \ufb01rst authors made equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An overview of the proposed G2SAT model. Top left: A given bipartite graph can be\ndecomposed into a set of disjoint trees by applying a sequence of node splitting operations. Orange\nnode c in graph Gi is split into two blue c nodes in graph Gi\u22121. Every time a node is split, one more\nnode appears in the right partition. Right: We use node pairs gathered from such a sequence of node\nsplitting operations to train a GCN-based classi\ufb01er that predicts whether a pair of c nodes should be\nmerged. Bottom left: Given such a classi\ufb01er, G2SAT generates a bipartite graph by starting with a\nset of trees G0 and applying a sequence of node merging operations, where two blue nodes in graph\nGi\u22121 get merged in graph Gi. G2SAT uses the GCN-based classi\ufb01er that captures the bipartite graph\nstructure to sequentially decide which nodes to merge from a set of candidates. Best viewed in color.\n\nbenchmarks is still limited. While this situation has improved over the years, new and interesting\nbenchmarks \u2014 both real and synthetic \u2014 are still in demand and highly welcomed by the SAT\ncommunity. Developing expressive generators of structured SAT formulas is important, as it would\nprovide for a richer set of evaluation benchmarks, which would in turn allow for the development of\nbetter and faster SAT solvers. Indeed, the problem of pseudo-industrial SAT formula generation has\nbeen identi\ufb01ed as one of the ten key challenges in propositional reasoning and search [36].\nOne promising approach for tackling this challenge is to represent SAT formulas as graphs, thus\nrecasting the original problem as a graph generation task. Speci\ufb01cally, every SAT formula can\nbe converted into a corresponding bipartite graph, known as its literal-clause graph (LCG), via a\nbijective mapping. Prior work in pseudo-industrial SAT instance generation has relied on hand-crafted\nalgorithms [15, 16], focusing on capturing one or two of the graph statistics exhibited by real-world\nSAT formulas [1, 34]. As researchers continue to uncover new and interesting characteristics of\nreal-world SAT instances [1, 23, 34, 31], previous SAT generators might become invalid, and hand-\ncrafting new models that simultaneously capture all the pertinent properties becomes increasingly\ndif\ufb01cult. On the other hand, recent work on deep generative models for graphs [4, 29, 43, 45] has\ndemonstrated their ability to capture many of the essential features of real-world graphs such as\nsocial networks and citation networks, as well as graphs arising in biology and chemistry. However,\nthese models do not enforce bipartiteness and therefore cannot be directly employed in our setting.\nWhile it might be possible to post-process these generated graphs, such a solution would be ad hoc,\ncomputationally expensive and would fail to exploit the unique structure of bipartite graphs.\nIn this paper, we present G2SAT, the \ufb01rst deep generative model that learns to generate SAT formulas\nbased on a given set of input formulas. We use LCGs to represent SAT formulas, and formulate the\ntask of SAT formula generation as a bipartite graph generation problem. Our key insight is that any\nbipartite graph can be generated by starting with a set of trees, and then applying a sequence of node\nmerging operations over the nodes from one of the two partitions. As we merge nodes, trees are also\nmerged, and complex bipartite structures begin to appear (Figure 1, left). In this manner, a set of\ninput bipartite graphs (SAT formulas) can be characterized by a distribution over the sequence of\nnode merging operations. Assuming we can capture/learn the distribution over the pairs of nodes\nto merge, we can start with a set of trees and then keep merging nodes in order to generate realistic\nbipartite graphs (i.e., realistic SAT formulas). G2SAT models this iterative node merging process in an\nauto-regressive manner, where a node merging operation is viewed as a sample from the underlying\nconditional distribution that is parameterized by a Graph Convolutional Neural Network (GCN)\n[18, 19, 27, 44], and the same GCN is shared across all the steps of the generation process.\n\n2\n\n!\u0305!!\u0305!###!\u0305!!\u0305!####!\u0305!!\u0305!#####!\u0305!!\u0305!#########Node splitBipartite Graph!\u0305!###!\u0305!###GCNPositive pair!\u0305!###!\u0305!###PredictNode splitting sequenceNode merging sequence$%$&$'$(Modeling )(+,|+,./)with GCNSet of TreesNode proposalphaseNode mergingphase#Merged clause#Split clauseEdge\u0305!Negative literal!Positive literalExtra messagepassing pathIntermediategraph$1Bipartite GraphSet of Trees###Node mergeGCNNegative pairPredictLegend#Clause\fThis formulation raises the following question: how do we devise a sequential generative process\nwhen we are only given a static input SAT formula? In other words, how do we generate training data\nfor our generator? We resolve this challenge as follows (Figure 1). We de\ufb01ne node splitting as the\ninverse operation to node merging. We apply this node splitting operation to a given input bipartite\ngraph (a real-world SAT formula) and decompose it into a set of trees. We then reverse the splitting\nsequence, so that we start with a set of trees and learn from the sequence of node merging operations\nthat recovers a realistic SAT formula. We train a GCN-based classi\ufb01er that decides which two nodes\nto merge next, based on the structure of the graph generated so far.\nAt graph generation time, we initialize G2SAT with a set of trees and iteratively merge node pairs\nbased on the conditional distribution parameterized by the trained GCN model, until a user-speci\ufb01ed\nstopping criterion is met. We utilize an ef\ufb01cient two-phase generation procedure: in the node proposal\nphase, candidate node pairs are randomly drawn, and in the node merging phase, the learned GCN\nmodel is applied to select the most likely node pair to merge.\nExperiments demonstrate that G2SAT is able to generate formulas that closely resemble the input\nreal-world SAT instances in many graph-theoretic properties such as modularity and the presence\nof scale-free structures, with 24% lower relative error compared to state-of-the-art methods. More\nimportantly, G2SAT generates formulas that exhibit the same hardness characteristics as real-world\nSAT formulas in the following sense: when the generated instances are solved using various SAT\nsolvers, those solvers that are known to be effective on structured, real-world instances consistently\noutperform those solvers that are specialized in solving random SAT instances. Moreover, our\nresults suggest that we can use our synthetically generated formulas to more effectively tune the\nhyperparameters of SAT solvers, achieving an 18% speedup in run time on unseen formulas, compared\nto tuning the solvers only on the formulas used for training.1\n\n2 Preliminaries\n\nGoal of generating SAT formulas. Our goal is to design a SAT generator that, given a suite of SAT\nformulas, generates new SAT formulas with similar properties. Our aim is to capture not only graph\ntheoretic properties, but also realistic SAT solver behavior. For example, if we train our G2SAT model\non formulas from application domain X, then solvers that traditionally excel in solving problems\nin domain X, should also excel in solving synthetic G2SAT formulas (rather than, say, solvers that\nspecialize in solving random SAT formulas).\nSAT formulas and their graph representations. A SAT formula \u03c6 is composed of Boolean vari-\nables xi connected by the logical operators and (\u2227), or (\u2228), and not (\u00ac). A formula is satis\ufb01able if\nthere exists an assignment of Boolean values to the variables such that the overall formula evaluates\nto true. In this paper, we are concerned with formulas in Conjunctive Normal Form (CNF)2, i.e.,\nformulas expressed as conjunctions of disjunctions. Each disjunction is called a clause, while a\nBoolean variable xi or its negation \u00acxi is called a literal. For example, (x1\u2228x2\u2228\u00acx3)\u2227(\u00acx1\u2228\u00acx2)\nis a CNF formula with two clauses that can be satis\ufb01ed by assigning true to x1 and false to x2.\nTraditionally, researchers have studied four different graph representations for SAT formulas [3]: (1)\nLiteral-Clause Graph (LCG): there is a node for each literal and each clause, with an edge denoting\nthe occurrence of a literal in a clause. An LCG is bipartite and there exists a bijection between CNF\nformulas and LCGs. (2) Literal-Incidence Graph (LIG): there is a node for each literal and two\nliterals have an edge if they co-occur in a clause. (3) Variable-Clause Graph (VCG): obtained by\nmerging the positive and negative literals of the same variables in an LCG. (4) Variable-Incidence\nGraph (VIG): obtained by performing the same literal merging operation on the LIG. In this paper,\nwe use LCGs to represent SAT formulas.\nLCGs as bipartite graphs. We represent a bipartite graph G = (V G,E G) by its node set V G =\n{vG\nj \u2208 V G}. In the rest of paper, we omit the\nsuperscript G whenever it is possible to do so. Nodes in a bipartite graph can be split into two disjoint\npartitions V1 and V2 such that V = V1 \u222a V2. Edges only exist between nodes in different partitions,\ni.e., E \u2286 {(vi, vj)|vi \u2208 V1, vj \u2208 V2}. An LCG with n literals and m clauses has V1 = {l1, ..., ln}\n\nn } and edge set E G \u2286 {(vG\n\n1 , ..., vG\n\nj )|vG\n\ni , vG\n\ni , vG\n\n1Link to code and datasets: http://snap.stanford.edu/g2sat/\n2Any SAT formula can be converted to an equisatis\ufb01able CNF formula in linear time [39].\n\n3\n\n\fand V2 = {c1, ..., cm}, where V1 and V2 are referred to as the literal partition and the clause partition,\nrespectively. We may also write out li as xi or \u00acxi when specifying the literal sign is necessary.\nBene\ufb01ts of using LCGs to generate SAT formulas. While we choose to work with LCGs because\nthey are bijective to SAT formulas, the LIG is also a viable alternative. Unlike LCGs, there are no\nexplicit constraints over LIGs, and thus, previously developed general deep graph generators could in\nprinciple be used. However, the ease of generating LIGs is offset by the fact that key information is\nlost during the translation from the corresponding SAT formula. In particular, given a pair of literals,\nthe LIG only encodes whether they co-occur in a clause but fails to capture how many times and in\nwhich clauses they co-occur. It can further be shown that an LIG corresponds to a number of SAT\nformulas that is at least exponential in the number of 3-cliques in the LIG. This ambiguity severely\nlimits the usefulness of LIGs for SAT benchmark generation.\n\n3 Related Work\n\nSAT Generators. Existing synthetic SAT generators are hand-crafted models that are typically\ndesigned to generate formulas that \ufb01t a particular graph statistic. The mainstream generators for\npseudo-industrial SAT instances include the Community Attachment (CA) model [15], which gener-\nates formulas with a given VIG modularity, and the Popularity-Similarity (PS) model [16], which\ngenerates formulas with a speci\ufb01c VCG degree distribution. In addition, there are also generators\nfor random k-SAT instances [5] and crafted instances that come from translations of structured\ncombinatorial problems, such as graph coloring, graph isomorphism, and Ramsey numbers [28].\nCurrently, all SAT generators are hand-crafted and machine learning provides an exciting alternative.\nDeep Graph Generators. Existing deep generative models of graphs fall into two categories. In the\n\ufb01rst class are models that focus on generating perturbations of a given graph, by direct decoding from\ncomputed node embeddings [26] or latent variables [17], or learning the random walk distribution of\na graph [4]. The second class comprises models that can learn to generate a graph by sequentially\nadding nodes and edges [29, 45, 43]. Domain speci\ufb01c generators for molecular graphs [10, 22] and\n3D point cloud graphs [40] have also been developed. However, current deep generative models of\ngraphs do not readily apply to SAT instance generation. Thus, we develop a novel bipartite graph\ngenerator that respects all the constraints imposed by graphical representations of SAT formulas and\ngenerates the formula graph via a sequence of node merging operations.\nDeep learning for SAT. NeuroSAT also represents SAT formulas as graphs and computes node\nembeddings using GCNs [38]. However, NeuroSAT focuses on using the embeddings to solve SAT\nformulas, while we aim to generate SAT formulas. A preliminary version of the work presented in\nthis paper appeared in [41], where existing graph generative models were used to learn the LIG of a\nSAT formula. However, extensive post-processing was required to extract a formula from a generated\nLIG, since an LIG is an ambiguous representation of a SAT formula. In this work, we develop a new\ndeep graph generative model, that, unlike existing techniques, is able to directly learn the bijective\ngraph representation of a SAT formula, and therefore better capture its characteristics.\n\n4 The G2SAT Framework\n\n4.1 G2SAT: Generating Bipartite Graphs by Iterative Node Merging Operations\n\ngenerate a graph via an n-step iterative process, p(G) =(cid:81)n\n\nAs discussed in Section 2, a SAT formula is uniquely represented by its LCG which is a bipartite\ngraph. From the perspective of generative models, our primary objective is to learn a distribution over\nbipartite graphs, based on a set of observed bipartite graphs G sampled from the data distribution\np(G). Each bipartite graph G \u2208 G may have a different number of nodes and edges. Due to the\ncomplex dependency between nodes and edges, directly learning p(G) is challenging. Therefore, we\ni=1 p(Gi|G1, ..., Gi\u22121), where Gi refers\nto an intermediate graph at step i in the iterative generation process. Since we focus on generating\nstatic graphs, we assume that the order of the generation trajectory does not matter, as long as the\nsame graph is generated. This assumption implies the following Markov property over the conditional\ndistribution, p(Gi|G1, ..., Gi\u22121) = p(Gi|Gi\u22121).\nThe key to a successful iterative graph generative model is a proper instantiation of the conditional\ndistribution p(Gi|Gi\u22121). Existing approaches [29, 43, 45] often model p(Gi|Gi\u22121) as the distribution\n\n4\n\n\fover the random addition of nodes or edges to Gi\u22121. While in theory this formulation allows the\ngeneration of any kind of graph, it cannot satisfy the hard partition constraint for bipartite graphs.\nIn contrast, our proposed G2SAT has a simple generation process that is guaranteed to preserve the\nbipartite partition constraint, without the need for hand-crafted generation rules or post-processing\nprocedures. The G2SAT framework relies on node splitting and merging operations, which are\nde\ufb01ned as follows.\nDe\ufb01nition 1. The node splitting operation, when applied to node v, removes some edges between\nv and its neighboring nodes, and then connects those edges to a new node u. The node merging\noperation, when performed over two nodes u and v, removes all the edges between v and its\nneighboring nodes, and then connects those edges to u. Formally, NodeSplit(u, G) returns a tuple\n(u, v, G(cid:48)), and NodeMerge(u, v, G) returns a tuple (u, G(cid:48)).\n\nNote that according to this de\ufb01nition, a node merging operation can always be reversed by a node\nsplitting operation. The core idea underlying G2SAT is then motivated by the following observation.\nObservation 1. A bipartite graph can always be transformed into a set of trees by a sequence of\nnode splitting operations over the nodes in one of the partitions.\n\nThe proof of this claim follows from the fact that the node splitting operation strictly reduces a node\u2019s\ndegree. Therefore, repeatedly applying node splitting to all the nodes in a partition will ultimately\nreduce the degree of all those nodes to 1, producing a set of trees (Figure 1, Left). This observation\nimplies that a bipartite graph can always be generated via a sequence of node merging operations. In\nG2SAT, we always merge clause nodes in the clause partition V Gi\u22121\nfor a given graph Gi\u22121. We\nthen design the following instantiation of p(Gi|Gi\u22121),\np(Gi|Gi\u22121) = p(NodeMerge(u, v, Gi\u22121)|Gi\u22121) = Multimonial(hT\n(1)\nwhere hu and hv are the embeddings for nodes u and v, and Z is the normalizing constant that\nensures that the distribution Multimonial(\u00b7) is valid. We aim for embeddings hu that capture the\nmulti-hop neighborhood structure of a node u and that can be computed from a single trainable\nmodel. Further, this model needs to be capable of generalizing across different generation stages and\ndifferent graphs. Therefore, we use the GraphSAGE framework [18] to compute node embeddings,\nwhich is a variant of GCNs that has been shown to have strong inductive learning capabilities across\ndifferent graphs. Speci\ufb01cally, the l-th layer of GraphSAGE can be written as\nv + q(l)|v \u2208 N (u)))\n\nu hv/Z|\u2200u, v \u2208 V Gi\u22121\n\n)\n\n2\n\n2\n\nu = AGG(RELU(Q(l)h(l)\nn(l)\nh(l+1)\n\n= RELU(W(l)CONCAT(h(l)\n\nu\n\nu , n(l)\n\nu ))\n\nu is the l-th layer node embedding for node u, N (u) is the local neighborhood of u, AGG(\u00b7)\nwhere h(l)\nis an aggregation function such as mean pooling, and Q(l), q(l), W(l) are trainable parameters. The\ninput node features are length-3 one-hot vectors, which are used to represent the three node types\nin LCGs, i.e., positive literals, negative literals and clauses. In addition, since each literal and its\nnegation are closely related, we add an additional message passing path between them.\n\n4.2 Scalable G2SAT with Two-phase Generation Scheme\n\nLCGs can easily have tens of thousands of nodes; thus, there are millions of candidate node pairs that\ncould be merged. This makes the computation of the normalizing constant Z in Equation 1 infeasible.\nTo avoid this issue, we design a two-phase scheme to instantiate Equation 1, which includes a node\nproposal phase and a node merging phase (Figure 1, right). Intuitively, the idea is to begin with a \ufb01xed\noracle that proposes random candidate node pairs. Then, a model only needs to decide if the proposed\nnode pair should be merged or not, which is an easier learning task compared to selecting from\namong millions of candidate options. Instead of directly learning and sampling from p(Gi|Gi\u22121), we\nintroduce additional random variables u and v to represent random nodes, and then learn the joint\ndistribution p(Gi, u, v|Gi\u22121) = p(u, v|Gi\u22121)p(Gi|Gi\u22121, u, v). Here, p(u, v|Gi\u22121) corresponds to\nthe node proposal phase and p(Gi|Gi\u22121, u, v) models the node merging phase.\nIn theory, p(u, v|Gi\u22121) can be any distribution as long as it has non-empty support. Since LCGs are\ninherently static graphs, there is little prior knowledge or additional information on how this iterative\ngeneration process should proceed. Therefore, we implement the node proposal phase such that a\n\n5\n\n\frandom node pair is sampled from all candidate clause nodes uniformly at random. Then, in the node\nmerging phase, instead of computing the dot product between all possible node pairs, the model only\nneeds to compute the dot product between the sampled node pairs. Speci\ufb01cally, we have\n\np(Gi, u, v|Gi\u22121) = p(u, v|Gi\u22121)p(NodeMerge(u, v, Gi\u22121)|Gi\u22121, u, v)\n}) Bernoulli(\u03c3(hT\n\n= Uniform({(u, v)|\u2200u, v \u2208 V Gi\u22121\n\n2\n\nu hv)|u, v)\n\n(2)\n\nwhere Uniform is the discrete uniform distribution and \u03c3(\u00b7) is the sigmoid function.\n\n4.3 G2SAT at Training Time\n\nThe two-phase generation scheme described in Section 4.2 transforms the bipartite graph generation\ntask into a binary classi\ufb01cation task. We train the classi\ufb01er to minimize the following binary cross\nentropy loss:\n\nu hv))]\n\nL = \u2212Eu,v\u223cppos [log(\u03c3(hT\n\nu hv))] \u2212 Eu,v\u223cpneg [log(1 \u2212 \u03c3(hT\n\n(3)\nwhere ppos and pneg are the distributions over positive and negative training examples (i.e. node\npairs). We say a node pair is a positive example if the node pair should be merged according to\nthe training set. To acquire the necessary training data from input bipartite graphs, we develop a\nprocedure that is described in Algorithm 1. Given an input bipartite graph G, we apply the node\nsplitting operation to the graph for n = |E G| \u2212 |V G\n2 | steps, which guarantees that the input graph will\nbe decomposed into a set of trees. Within each step, a random node s in partition V Gi\n2 with degree\ngreater than 1 is chosen for splitting, and a random subset of edges that connect to s is chosen to\nconnect to a new node. After the split operation, we obtain an updated graph Gi\u22121, as well as the split\nnodes u+ and v+, which are treated as a positive training example. Then, another node v\u2212, that is\ndistinct from u+ and v+, is randomly chosen from the nodes in V Gi\u22121\n, and (u+, v\u2212) are viewed as a\nnegative training example. The data tuple (u+, v+, v\u2212, Gi\u22121) is saved in the dataset D. We also save\nthe step count n and the graph G0 as \u201cgraph templates\u201d, which are later used to initialize G2SAT at\ninference time. The procedure is repeated r times until the desired number of data points are gathered.\nFinally, G2SAT is trained with the dataset D to minimize the objective listed in Equation 3.\n\n2\n\nAlgorithm 1 G2SAT at training time\n\nInput: Bipartite graphs G, number of repeti-\ntions r\nOutput: Graph templates T\nD \u2190 \u2205, T \u2190 \u2205\nfor k = 1, . . . , r do\n\n2 |, Gn \u2190 G\n2 , Degree(u) > 1}\n\\ {u+, v+}\n\nG \u223c G, n \u2190 |E G| \u2212 |V G\nfor i = n, . . . , 1 do\ns \u223c {u|u \u2208 V Gi\n(u+, v+, Gi\u22121) \u2190 NodeSplit(s, Gi)\nv\u2212 \u223c V Gi\u22121\nD \u2190 D \u222a {(u+, v+, v\u2212, Gi\u22121)}\nTrain G2SAT with D to minimize Eq. 3\n\nT \u2190 T \u222a {(G0, n)}\n\n2\n\n4.4 G2SAT at Inference Time\n\nAlgorithm 2 G2SAT at inference time\n\nInput: Graph templates T , number of output\ngraphs r, number of proposed node pairs o\nOutput: Generated bipartite graphs G\nfor k = 1, . . . , r do\n\n(G0, n) \u223c T\nfor i = 0, . . . , n \u2212 1 do\nP \u2190 \u2205\nfor j = 1, . . . , o do\n\n2 , (s, x) /\u2208\n\n2 , v \u223c {s|s \u2208 V Gi\n\nu \u223c V Gi\nE Gi, (s,\u00acx) /\u2208 E Gi,\u2200x \u2208 N (u)}\nP = P \u222a {(u, v)}\n\n(u+, v+) \u2190 argmax{hT\nP, hu = GCN(u), hv = GCN(v)}\nGi+1 \u2190 NodeMerge(u+, v+, Gi)\n\nu hv|(u, v) \u2208\n\nG = G \u222a {Gn}\n\nA trained G2SAT model can be used to generate graphs. We summarize the procedure in Algorithm 2.\nAt graph generation time, we \ufb01rst initialize G2SAT with a graph template sampled from T gathered\nat training time, which speci\ufb01es the initial graph G0 and the number of generation steps n. Note that\nG2SAT can take bipartite graphs with arbitrary size as input and iterate for a variable number of steps.\nThe reason we specify the initial state and the number of steps is to control the behavior of G2SAT\nand simplify the experiment setting.\nAt each generation step, we use the two-phase generation scheme described in Section 4.2. In the\nnode proposal phase, we additionally make sure that the sampled node pair does not correspond to a\n\n6\n\n\fvacuous clause, i.e., if u, v is a valid node pair, then \u2200x \u2208 N (u), we ensure that (v, x) /\u2208 E Gi and\n(v,\u00acx) /\u2208 E Gi. We parallelize the algorithm by sampling o random node pair proposals at once and\nfeeding them to the node merging phase. In the node merging phase, although following Equation\n2 would allow us to sample from the true distribution, we \ufb01nd in practice that it usually requires\nsampling a large number of candidate nodes pairs until a positive node pair is predicted by the GCN\nmodel. Therefore, we use a greedy algorithm that selects the most likely node pair to be merged\namong the o proposed node pairs and merge those nodes. Admittedly, this biases the generator away\nfrom the true data distribution. However, our experiments reveal that the synthesized graphs are\nnonetheless reasonable. After n steps, the generated graph Gn is saved as an output.\n\n5 Experiments\n\n5.1 Dataset and Evaluation\n\nDataset. We use 10 small real-world SAT formulas from the SATLIB benchmark library [21] and\npast SAT competitions.1 The two data sources contain SAT formulas generated from a variety\nof application domains, such as bounded model checking, planning, and cryptography. We use\nthe standard SatElite preprocessor [11] to remove duplicate clauses and perform polynomial-time\nsimpli\ufb01cations (for example, unit propagation). The preprocessed formulas contain 82 to 1122\nvariables and 327 to 4555 clauses.\nWe evaluate if the generated SAT formulas preserve the properties of the input training SAT formulas,\nas measured by graph statistics and SAT solver performance. We then investigate whether the\ngenerated SAT formulas can indeed help in designing better domain-speci\ufb01c SAT solvers.\nGraph statistics. We focus on the graph statistics studied previously in the SAT literature [1, 34]. In\nparticular, we consider the VIG, VCG and LCG representations of SAT formulas. We measure the\nmodularity [33] (in VIG, VCG, LCG), average clustering coef\ufb01cient [32] (in VIG) and the scale-free\nstructure parameters as measured by variable \u03b1v and clause \u03b1c [1, 8] (in VCG).\nSAT solver performance. We report the relative SAT solver performance, i.e., given k SAT solvers,\nwe rank them based on their performance over the SAT formulas used for training and the generated\nSAT formulas, and evaluate how well the two rankings align. Previous research has shown that\nSAT instances can be made hard using various post-processing approaches [13, 37, 42]. Therefore,\nthe absolute hardness of the formulas is not a good indicator of how realistic the formulas are. On\nthe other hand, it is not trivial for a post-processing procedure to precisely manipulate the relative\nperformance of a set of SAT solvers. Therefore, we report the relative solver performance for a\nfairer comparison. We took the three best performing solvers from both the application track and\nthe random track of the 2018 SAT competition [20], which are denoted as I1, I2, I3, and R1, R2, R3\nrespectively.2 Our experiments con\ufb01rm that solvers that are tailored to real-world SAT formulas\n(I1, I2, I3) indeed outperform solvers that focus on random SAT formulas (R1, R2, R3), over the\ntraining formulas. Therefore, we measure if on the generated formulas, the solvers I similarly\noutperform the solvers R, as measured by ranking accuracy. All the run time performances are\nmeasured by wall clock time under carefully controlled experimental settings.\nApplication: Developing better SAT solvers. Finally, we consider the scenario where people wish\nto use the synthetic formulas for developing better SAT solvers. Speci\ufb01cally, we use either the 10\ntraining SAT formulas or the generated SAT formulas to guide the hyperparameter selection of a\npopular SAT solver called Glucose [2]. We conduct a grid search over two of its hyperparameters \u2014\nthe variable decay vd, that in\ufb02uences the ordering of the variables in the search tree, and the clause\ndecay cd, that in\ufb02uences which learned clauses are to be removed [12]. We sweep over the set\n{0.75, 0.85, 0.95} for vd, and the set {0.7, 0.8, 0.9, 0.99, 0.999} for cd. We measure the run time of\nthe SAT solvers using the optimal hyperparameters found by grid search, over 22 real-world SAT\nformulas unobserved by any of the models. Since the number of training SAT formulas is limited, we\nexpect that using the abundant generated SAT formulas will lead to better hyperparameter choices.\n\n1http://www.satcompetition.org/\n2The solvers are, in order, MapleLCMDistChronoBT, Maple_LCM_Scavel_\ufb01x2, Maple_CM, Sparrow2Riss-\n\n2018, gluHack, glucose-3.0_PADC_10_NoDRUP [20].\n\n7\n\n\fTable 1: Graph statistics of generated formulas (mean \u00b1 std. (relative error to training formulas)).\n\nVIG\n\nVCG\n\nLCG\n\nMethod\n\nClustering\n\nModularity\n\nVariable \u03b1v\n\n3.57\u00b11.08\n\nClause \u03b1c\n\n4.53\u00b11.09\n\n0.58\u00b10.09\n\n0.74\u00b10.06\n0.50\u00b10.07\nTraining\n0.33\u00b10.08(34%) 0.48\u00b10.10(17%) 6.30\u00b11.53(76%) N/A\n0.65\u00b10.08(12%) 0.53\u00b10.05(16%)\nCA\n0.82\u00b10.04(64%) 0.72\u00b10.13(24%) 3.25\u00b10.89(9%) 4.70\u00b11.59(4%) 0.86\u00b10.05(16%) 0.64\u00b10.05(2%)\nPS(T=0)\nPS(T=1.5) 0.30\u00b10.10(40%) 0.14\u00b10.03(76%) 4.19\u00b11.10(17%) 6.86\u00b11.65(51%) 0.40\u00b10.05(46%) 0.41\u00b10.05(35%)\n0.41\u00b10.09(18%) 0.54\u00b10.11(7%) 3.57\u00b11.08(0%) 4.79\u00b12.80(6%) 0.68\u00b10.07(8%) 0.67\u00b10.03(6%)\nG2SAT\n\n0.63\u00b10.05\n\nModularity\n\nModularity\n\nFigure 2: Scatter plots of distributions of selected properties of the generated formulas.\n\n5.2 Models\n\nWe compare G2SAT with two state-of-the-art generators for real-world SAT formulas. Both generators\nare prescribed models designed to match a speci\ufb01c graph property. To properly generate formulas\nusing these baselines, we set their arguments to match the corresponding statistics in the training set.\nWe generate 200 formulas each using G2SAT and the baseline models.\nG2SAT. We implement G2SAT with a 3-layer GraphSAGE model using mean pooling and ReLU\nactivation [18] with hidden and output embedding size of 32. We use the Adam optimizer [25] with a\nlearning rate of 0.001 to train the model until the validation accuracy plateaus.\nCommunity Attachment (CA). The CA model generates formulas to \ufb01t a desired VIG modularity\nQ [15]. The output of the algorithm is a SAT formula with n variables and m clauses, each of length\nk, such that the optimal modularity for any c-partition of the VIG of the formula is approximately Q.\nPopularity-Similarity (PS). The PS model generates formulas to \ufb01t desired \u03b1v and \u03b1c [16]. The\nmodel accepts a temperature parameter T that trades off the modularity and the (\u03b1v, \u03b1c) measures of\nthe generated formulas. We run PS with two temperature settings, T = 0 and T = 1.5.\n\n5.3 Results\n\nGraph statistics. The graph statistics of the generated SAT formulas are shown in Table 1. We\nobserve that G2SAT is the only model that is able to closely \ufb01t all the graph properties that we\nmeasure, whereas the baseline models only \ufb01t some of the statistics and fail to perform well on the\nother statistics. Surprisingly, G2SAT \ufb01ts the modularity even better than CA, which is tailored for\n\ufb01tting that statistic. We compute the relative error over the generated graph statistics with respect to\nthe ground-truth statistics, and G2SAT can reduce the relative error by 24% on average compared\nwith baseline methods. To further illustrate this performance gain, we plot the distribution of the\nselected properties over the generated formulas in Figure 2, where each dot corresponds to a graph.\nWe see that G2SAT nicely interpolates and extrapolates on all the statistics of the input graphs, while\nthe baselines only do well on some of the statistics.\nSAT solver performance. As seen in Table 2, the ranking of solver performance over the formulas\ngenerated by G2SAT and CA align perfectly with their ranking over the training graphs. Both models\nare able to correctly generate formulas on which application-focused solvers (I1, I2, I3) outperform\nrandom-focused solvers (R1, R2, R3). By contrast, PS models do poorly at this task.\nApplication: Developing better SAT solvers. The run time gain of tuning solvers on synthetic\nformulas compared to tuning on a small set of real-world formulas is shown in Table 3. While all the\ngenerators are able to improve the SAT solver\u2019s performance by suggesting different hyperparameter\n\n8\n\n\fTable 2: Relative SAT Solver Performance\non training as well as synthetic SAT formulas.\n\nTable 3: Performance gain when using generated\nSAT formulas to tune SAT solvers.\n\nSolver ranking\n\nMethod\nTraining I2, I3, I1, R2, R3, R1\nI2, I3, I1, R2, R3, R1\nPS(T=0) R3, I3, R2, I2, I1, R1\nPS(T=1.5) R3, R2, I3, I1, I2, R1\nG2SAT I1, I2, I3, R2, R3, R1\n\nCA\n\nAccuracy\n\n100%\n100%\n33%\n33%\n100%\n\nMethod Best parameters Runtime(s) Gain\nN/A\nTraining\n2.31%\n0.41%\n0.07%\n18.25%\n\n(0.95, 0.9)\n(0.75, 0.99)\n(0.75, 0.999)\n(0.95, 0.9)\n(0.95, 0.99)\n\n2679\n2617\n2668\n2677\n2190\n\nCA\n\nPS(T=0)\nPS(T=1.5)\nG2SAT\n\ncon\ufb01gurations, G2SAT is the only method that \ufb01nds one that results in a large performance gain\n(18% faster run time) on unobserved SAT formulas. Although this experiment is limited in scale, the\npromising results indicate that G2SAT could open up opportunities for developing better SAT solvers,\neven in application domains where benchmarks are scarce.\n\n5.4 Analysis of Results\n\nFigure 3: G2SAT Run time.\n\nScalability of G2SAT. While existing deep graph generative models can only generate graphs\nwith up to about 1,000 nodes [4, 17, 45], the novel design\nof the G2SAT framework enables the generation of graphs\nthat are an order of magnitude larger. The largest graph\nwe have generated has 39,578 nodes and 102,927 edges,\nwhich only took 489 seconds (data-processing time ex-\ncluded) on a single GPU. Figure 3 shows the time-scaling\nbehavior for both training (from 100k batches of node\npairs) and formula generation. We found that G2SAT\nscales roughly linearly for both tasks with respect to the\nnumber of clauses.\nExtrapolation ability of G2SAT. To determine whether a trained model can learn to generate SAT\ninstances different from those in the training set, we design an extrapolation experiment as follows.\nWe train on 10 small formulas with 327 to 4,555 clauses, while forcing G2SAT to generate large\nformulas with 13,028 to 27,360 clauses. We found that G2SAT can generate large graphs whose\ncharacteristics are similar to those of the small training graphs, which shows that G2SAT has learned\nnon-trivial properties of real-world SAT problems, and thus can extrapolate beyond the training set.\nSpeci\ufb01cally, the VCG modularity of the large formulas generated by G2SAT is 0.81 \u00b1 0.03, while\nthe modularity of the small formulas used to train G2SAT is 0.74 \u00b1 0.06.\nAblation study. Here we demonstrate that the expressive power of GCN model signi\ufb01cantly\naffects the generated formulas. Figure 4 shows the effect of the\nnumber of layers in the GCN neural network model on the modu-\nlarity of the generated formulas. As the number of layers increases,\nthe average modularity of the generated formulas becomes closer\nto that of the training formulas, which indicates that machine\nlearning contributes signi\ufb01cantly to the ef\ufb01cacy of G2SAT. The\nother graph properties that we measured generally follow the same\npattern as well.\n\nFigure 4: Ablation study.\n\n6 Conclusions\n\nIn this paper, we introduced G2SAT, the \ufb01rst deep generative model for SAT formulas. In contrast to\nexisting SAT generators, G2SAT does not rely on hand-crafted algorithms and is able to generate\ndiverse SAT formulas similar to input formulas, as measured by many graph statistics and SAT solver\nperformance. While future work is called for to generate larger and harder formulas, we believe our\nframework shows great potential for understanding and improving SAT solvers.\n\n9\n\n\fAcknowledgements\n\nJure Leskovec is a Chan Zuckerberg Biohub investigator. We gratefully acknowledge the support of\nDARPA under No. FA865018C7880 (ASED) and MSC; NIH under No. U54EB020405 (Mobilize);\nARO under No. 38796-Z8424103 (MURI); IARPA under No. 2017-17071900005 (HFC); NSF under\nNo. OAC-1835598 (CINES) and HDR; Stanford Data Science Initiative, Chan Zuckerberg Biohub,\nEnlight Foundation, JD.com, Amazon, Boeing, Docomo, Huawei, Hitachi, Observe, Siemens, UST\nGlobal. The U.S. Government is authorized to reproduce and distribute reprints for Governmental\npurposes notwithstanding any copyright notation thereon. Any opinions, \ufb01ndings, and conclusions or\nrecommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect the\nviews, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S.\nGovernment.\n\nReferences\n[1] C. Ans\u00f3tegui, M. L. Bonet, and J. Levy. On the structure of industrial sat instances. In Principles\nand Practice of Constraint Programming, pages 127\u2013141. Springer Berlin Heidelberg, 2009.\n\n[2] G. Audemard and L. Simon. Predicting learnt clauses quality in modern sat solvers.\n\nIn\nProceedings of the 21st International Joint Conference on Arti\ufb01cal Intelligence, pages 399\u2013404,\n2009.\n\n[3] A. Biere, A. Biere, M. Heule, H. van Maaren, and T. Walsh. Handbook of Satis\ufb01ability: Volume\n\n185 Frontiers in Arti\ufb01cial Intelligence and Applications. IOS Press, 2009.\n\n[4] A. Bojchevski, O. Shchur, D. Z\u00fcgner, and S. G\u00fcnnemann. NetGAN: Generating graphs via\nrandom walks. In Proceedings of the 35th International Conference on Machine Learning,\n2018.\n\n[5] Y. Boufkhad, O. Dubois, Y. Interian, and B. Selman. Regular random k-sat: Properties of\n\nbalanced formulas. J. Autom. Reason., 35:181\u2013200, 2005.\n\n[6] A. Braunstein, M. M\u00e9zard, and R. Zecchina. Survey propagation: An algorithm for satis\ufb01ability.\n\nRandom Struct. Algorithms, 27:201\u2013226, 2005.\n\n[7] E. Clarke, A. Biere, R. Raimi, and Y. Zhu. Bounded model checking using satis\ufb01ability solving.\n\nForm. Methods Syst. Des., 19:7\u201334, 2001.\n\n[8] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data.\n\nSIAM Review, 51:661\u2013703, 2009.\n\n[9] S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the third annual\n\nACM symposium on Theory of computing, pages 151\u2013158. ACM, 1971.\n\n[10] N. De Cao and T. Kipf. Molgan: An implicit generative model for small molecular graphs.\n\narXiv preprint arXiv:1805.11973, 2018.\n\n[11] N. E\u00e9n and A. Biere. Effective preprocessing in sat through variable and clause elimination.\nIn International conference on theory and applications of satis\ufb01ability testing, pages 61\u201375.\nSpringer, 2005.\n\n[12] N. E\u00e9n and N. S\u00f6rensson. An extensible sat-solver. In Theory and Applications of Satis\ufb01ability\n\nTesting, pages 502\u2013518. Springer Berlin Heidelberg, 2004.\n\n[13] G. Escamocher, B. O\u2019Sullivan, and S. D. Prestwich. Generating dif\ufb01cult sat instances by\n\npreventing triangles. CoRR, abs/1903.03592, 2019.\n\n[14] H. Ganzinger, G. Hagen, R. Nieuwenhuis, A. Oliveras, and C. Tinelli. Dpll(t): Fast decision\nprocedures. In R. Alur and D. A. Peled, editors, Computer Aided Veri\ufb01cation, pages 175\u2013188.\nSpringer Berlin Heidelberg, 2004.\n\n[15] J. Gir\u00e1ldez-Cru and J. Levy. A modularity-based random sat instances generator. In Proceedings\nof the 24th International Conference on Arti\ufb01cial Intelligence, IJCAI\u201915, pages 1952\u20131958.\nAAAI Press, 2015.\n\n10\n\n\f[16] J. Gir\u00e1ldez-Cru and J. Levy. Locality in random sat instances. In Proceedings of the Twenty-Sixth\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17, pages 638\u2013644, 2017.\n\n[17] A. Grover, A. Zweig, and S. Ermon. Graphite: Iterative generative modeling of graphs. arXiv\n\npreprint arXiv:1803.10459, 2018.\n\n[18] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n[19] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and\n\napplications. IEEE Data Engineering Bulletin, 2017.\n\n[20] M. J. Heule, M. J. J\u00e4rvisalo, M. Suda, et al. Proceedings of sat competition 2018. 2018.\n\n[21] H. H. Hoos and T. St\u00fctzle. Satlib: An online resource for research on sat. Sat, 2000:283\u2013292,\n\n2000.\n\n[22] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph\n\ngeneration. International Conference on Machine Learning, 2018.\n\n[23] G. Katsirelos and L. Simon. Eigenvector centrality in industrial sat instances. In Principles and\n\nPractice of Constraint Programming, pages 348\u2013356. Springer Berlin Heidelberg, 2012.\n\n[24] H. Kautz and B. Selman. Planning as satis\ufb01ability. In Proceedings of the 10th European\n\nConference on Arti\ufb01cial Intelligence, pages 359\u2013363. John Wiley & Sons, Inc., 1992.\n\n[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[26] T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308,\n\n2016.\n\n[27] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nInternational Conference on Learning Representations, 2017.\n\n[28] M. Lauria, J. Elffers, J. Nordstr\u00f6m, and M. Vinyals. Cnfgen: A generator of crafted benchmarks.\nIn S. Gaspers and T. Walsh, editors, Theory and Applications of Satis\ufb01ability Testing, pages\n464\u2013473. Springer International Publishing, 2017.\n\n[29] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of\n\ngraphs. arXiv preprint arXiv:1803.03324, 2018.\n\n[30] J. Marques-Silva, I. Lynce, and S. Malik. Con\ufb02ict-driven clause learning sat solvers.\n\nHandbook of Satis\ufb01ability, 2009.\n\nIn\n\n[31] N. Mull, D. J. Fremont, and S. A. Seshia. On the hardness of sat with community structure. In\nN. Creignou and D. Le Berre, editors, Theory and Applications of Satis\ufb01ability Testing, pages\n141\u2013159. Springer International Publishing, 2016.\n\n[32] M. E. Newman. Clustering and preferential attachment in growing networks. Physical review E,\n\n64(2), 2001.\n\n[33] M. E. Newman. Modularity and community structure in networks. Proceedings of the national\n\nacademy of sciences, 103(23):8577\u20138582, 2006.\n\n[34] Z. Newsham, V. Ganesh, S. Fischmeister, G. Audemard, and L. Simon. Impact of community\nstructure on sat solver performance. In Theory and Applications of Satis\ufb01ability Testing, pages\n252\u2013268. Springer International Publishing, 2014.\n\n[35] B. Selman, H. Kautz, and B. Cohen. Local search strategies for satis\ufb01ability testing. Cliques,\n\nColoring, and Satis\ufb01ability: Second DIMACS Implementation Challenge, 26, 09 1999.\n\n[36] B. Selman, H. Kautz, and D. McAllester. Ten challenges in propositional reasoning and search.\nIn Proceedings of the 15th International Joint Conference on Arti\ufb01cial Intelligence - Volume 1,\nIJCAI\u201997, pages 50\u201354, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.\n\n11\n\n\f[37] B. Selman, D. G. Mitchell, and H. J. Levesque. Generating hard satis\ufb01ability problems. Artif.\n\nIntell., 81:17\u201329, 1996.\n\n[38] D. Selsam, M. Lamm, B. B\u00fcnz, P. Liang, L. de Moura, and D. L. Dill. Learning a sat solver\n\nfrom single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.\n\n[39] G. S. TSEITIN. On the complexity of derivation in propositional calculus. Structures in\n\nConstructive Mathematics and Mathematical Logic, pages 115\u2013125, 1968.\n\n[40] D. Valsesia, G. Fracastoro, and E. Magli. Learning localized generative models for 3d point\n\nclouds via graph convolution. International Conference on Learning Representations, 2019.\n\n[41] H. Wu and R. Ramanujan. Learning to generate industrial sat instances. In Proceedings of the\n\n12th International Symposium on Combinatorial Search, pages 17\u201329. AAAI Press, 2019.\n\n[42] K. Xu, F. Boussemart, F. Hemery, and C. Lecoutre. A simple model to generate hard satis\ufb01able\ninstances. In Proceedings of the 19th International Joint Conference on Arti\ufb01cial Intelligence,\n2005.\n\n[43] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec. Graph convolutional policy network for\ngoal-directed molecular graph generation. Advances in Neural Information Processing Systems,\n2018.\n\n[44] J. You, R. Ying, and J. Leskovec. Position-aware graph neural networks.\n\nConference on Machine Learning, 2019.\n\nInternational\n\n[45] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. GraphRNN: Generating realistic graphs\n\nwith deep auto-regressive models. In International Conference on Machine Learning, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5575, "authors": [{"given_name": "Jiaxuan", "family_name": "You", "institution": "Stanford University"}, {"given_name": "Haoze", "family_name": "Wu", "institution": "Stanford University"}, {"given_name": "Clark", "family_name": "Barrett", "institution": "Stanford University"}, {"given_name": "Raghuram", "family_name": "Ramanujan", "institution": "Davidson College"}, {"given_name": "Jure", "family_name": "Leskovec", "institution": "Stanford University and Pinterest"}]}