{"title": "Synchronization of neural networks by mutual learning and its application to cryptography", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 696, "abstract": null, "full_text": " Synchronization of neural networks by mutual\n learning and its application to cryptography\n\n\n Einat Klein Rachel Mislovaty Ido Kanter\n Department of Physics Department of Physics Department of Physics\n Bar-Ilan University Bar-Ilan University Bar-Ilan University\n Ramat-Gan, 52900 Israel Ramat-Gan, 52900 Israel Ramat-Gan, 52900 Israel\n\n Andreas Ruttor Wolfgang Kinzel\n Institut fur Theoretische Physik, Institut fur Theoretische Physik,\n Universitat Wurzbur Universitat Wurzbur\n Am Hubland 97074 Wurzburg, Germany Am Hubland 97074 Wurzburg, Germany\n\n\n Abstract\n\n Two neural networks that are trained on their mutual output synchronize\n to an identical time dependant weight vector. This novel phenomenon\n can be used for creation of a secure cryptographic secret-key using a\n public channel. Several models for this cryptographic system have been\n suggested, and have been tested for their security under different sophis-\n ticated attack strategies. The most promising models are networks that\n involve chaos synchronization. The synchronization process of mutual\n learning is described analytically using statistical physics methods.\n\n1 Introduction\n\nNeural networks learn from examples. This concept has extensively been investigated using\nmodels and methods of statistical mechanics [1, 2]. A \"teacher\" network is presenting\ninput/output pairs of high dimensional data, and a \"student\" network is being trained on\nthese data. Training means, that synaptic weights adapt by simple rules to the i/o pairs.\nWhen the networks -- teacher as well as student -- have N weights, the training process\nneeds of the order of N examples to obtain generalization abilities. This means, that after\nthe training phase the student has achieved some overlap to the teacher, their weight vectors\nare correlated. As a consequence, the student can classify an input pattern which does not\nbelong to the training set. The average classification error decreases with the number of\ntraining examples.\nTraining can be performed in two different modes: Batch and on-line training. In the first\ncase all examples are stored and used to minimize the total training error. In the second\ncase only one new example is used per time step and then destroyed. Therefore on-line\ntraining may be considered as a dynamic process: at each time step the teacher creates\na new example which the student uses to change its weights by a tiny amount. In fact,\nfor random input vectors and in the limit N , learning and generalization can be\ndescribed by ordinary differential equations for a few order parameters [3].\n\n\f\n x\n\n\n w w\n \n\n\n\n\n\nFigure 1: Two perceptrons receive an identical input x and learn their mutual output bits .\n\n\nOn-line training is a dynamic process where the examples are generated by a static network\n- the teacher. The student tries to move towards the teacher. However, the student network\nitself can generate examples on which it is trained. What happens if two neural networks\nlearn from each other? In the following section an analytic solution is presented [6], which\nshows a novel phenomenon: synchronization by mutual learning. The biological conse-\nquences of this phenomenon are not explored, yet, but we found an interesting application\nin cryptography: secure generation of a secret key over a public channel.\nIn the field of cryptography, one is interested in methods to transmit secret messages be-\ntween two partners A and B. An attacker E who is able to listen to the communication\nshould not be able to recover the secret message.\nIn 1976, Diffie and Hellmann found a method based on number theory for creating a secret\nkey over a public channel accessible to any attacker[7]. Here we show how neural networks\ncan produce a common secret key by exchanging bits over a public channel and by learning\nfrom each other.\n\n2 Mutual Learning\n\nWe start by presenting the process of mutual learning for a simple network: Two percep-\ntrons receive a common random input vector x and change their weights w according to\ntheir mutual bit , as sketched in Fig. 1. The output bit of a single perceptron is given by\nthe equation = sign(w x) (1)\nx is an N -dimensional input vector with components which are drawn from a Gaussian with\nmean 0 and variance 1. w is a N-dimensional weight vector with continuous components\nwhich are normalized, w w = 1 (2)\nThe initial state is a random choice of the components wA/B, i = 1, ...N for the two weight\n i\nvectors wA and wB. At each training step a common random input vector is presented to\nthe two networks which generate two output bits A and B according to (1). Now the\nweight vectors are updated by the perceptron learning rule [3]:\n \n wA(t + 1) = wA(t) + xB (\n N -AB)\n \n wB(t + 1) = wB(t) + xA (\n N -AB) (3)\n(x) is the step function. Hence, only if the two perceptrons disagree a training step is\nperformed with a learning rate . After each step (3), the two weight vectors have to be\nnormalized. In the limit N , the overlap\n R(t) = wA(t) wB(t) (4)\n\n\f\n 1\n\n\n\n 0.5\n\n\n ) 0\n cos(\n\n\n -0.5 theory\n simulation\n cos()c\n\n -1 0 0.5 1 1.5 2\n c\nFigure 2: Final overlap R between two perceptrons as a function of learning rate . Above\na critical rate c the time dependent networks are synchronized. From Ref. [6]\n\n\nhas been calculated analytically [6]. The number of training steps t is scaled as = t/N,\nand R() follows the equation\n\n\n dR 2 \n = (R + 1) (1 (5)\n d - R) - 2 \nwhere is the angle between the two weight vectors wA and wB, i.e. R = cos . This\nequation has fixed points R = 1, R = -1, and\n 1\n = - cos (6)\n 2 \n\nFig. 2 shows the attractive fixed point of (5) as a function of the learning rate . For small\nvalues of the two networks relax to a state of a mutual agreement, R 1 for 0.\nWith increasing learning rate the angle between the two weight vectors increases up to\n = 133 for c = 1.816 (7)\nAbove the critical rate c the networks relax to a state of complete disagreement, =\n180, R = -1. The two weight vectors are antiparallel to each other, wA = -wB.\nAs a consequence, the analytic solution shows, well supported by numerical simulations\nfor N = 100, that two neural networks can synchronize to each other by mutual learning.\nBoth networks are trained to the examples generated by their partner and finally obtain an\nantiparallel alignment. Even after synchronization the networks keep moving, the motion\nis a kind of random walk on an N-dimensional hypersphere producing a rather complex bit\nsequence of output bits A = -B [8].\n3 Random walk in weight space\n\nWe want to apply synchronization of neural networks to cryptography. In the previous sec-\ntion we have seen that the weight vectors of two perceptrons learning from each other can\nsynchronize. The new idea is to use the common weights wA = -wB as a key for en-\ncryption [11]. But two issues have to be solved yet: (i) Can an external observer, recording\nthe exchange of bits, calculate the final wA(t) ? The essence of using mutual learning as\nan encryption tool is the fact that while the parties preform a mutual process in which they\n\n\f\nreact towards one another, the attacker preforms a learning process, in which the 'teacher'\ndoes not react towards him. (ii) Does this phenomenon exist for discrete weights? Since\ncommunication is usually based on bit sequences, this is an important practical issue. Both\nissues are discussed below.\nSynchronization occurs for normalized weights, unnormalized ones do not synchronize [6].\nTherefore, for discrete weights, we introduce a restriction in the space of possible vectors\nand limit the components wA/B to 2L + 1 different values,\n i\n\n wA/B\n i {-L,-L + 1,...,L - 1,L} (8)\nIn order to obtain synchronization to a parallel instead of an antiparallel state wA = wB,\nwe modify the learning rule (3) to:\n\n wA(t + 1) = wA(t) - xA(AB) wB(t + 1) = wB(t) - xB(AB) (9)\nNow the components of the random input vector x are binary xi {+1,-1}. If the two\nnetworks produce an identical output bit A = B, then their weights move one step in\nthe direction of -xiA. But the weights should remain in the interval (8), therefore if any\ncomponent moves out of this interval, |wi| = L+1, it is set back to the boundary wi = L.\nEach component of the weight vectors performs a kind of random walk with reflecting\nboundary. Two corresponding components wA and receive the same random number\n i wB\n i\n1. After each hit at the boundary the distance |wAi - wBi| is reduced until it has reached\nzero. For two perceptrons with a N-dimensional weight space we have two ensembles of\nN random walks on the interval {-L,...,L}. We expect that after some characteristic time\nscale = O(L2) the probability of two random walks being in different states decreases as\nP (t) P(0)e-t/. Hence the total synchronization time should be given by N P(t)\n1 which gives tsync lnN. In fact, our simulations show the synchronization time\nincreases logarithmically with N.\n\n4 Mutual Learning in the Tree Parity Machine\n\nA single perceptron transmits too much information. An attacker, who knows the set of\ninput/output pairs, can derive the weights of the two partners. On one hand, the information\nshould be hidden so that the attacker does not calculate the weights, but on the other hand\nenough information should be transmitted so that the two partners can synchronize. We\nfound that multilayer networks with hidden units may be candidates for such a task [11].\nMore precisely, we consider a Tree Parity Machine(TPM), with three hidden units as shown\nin Fig. 3.\n\n\n\n\n \n\n\n\n\n\n \n 1 2 3\n\n\n\n\n\n 1 2 ... N 1 2 ... N 1 2 ... N\n Figure 3: A tree parity machine with K = 3\n\n\f\nEach hidden unit is a perceptron (1) with discrete weights (8). The output bit of the total\nnetwork is the product of the three bits of the hidden units\n\n A = AAA B = BBB (10)\n 1 2 3 1 2 3\nAt each training step the two machines A and B receive identical input vectors x , x , x .\n 1 2 3\nThe training algorithm is the following: Only if the two output bits are identical, A = B,\nthe weights can be changed. In this case, only the hidden unit i which is identical to \nchanges its weights using the Hebbian rule\n wA\n i (t + 1) = wA\n i (t) - xiA (11)\nThe partner as well as any attacker does not know which one of the K weight vectors is\nupdated. The partners A and B react to their mutual output and move signals A and B,\nwhereas an attacker can only receive these signals but not influence the partners with its\nown output bit. This is the essential mechanism which allows synchronization but pro-\nhibits learning. Nevertheless, advanced attackers use different heuristics to accelerate their\nsynchronization, as described in the next section.\n\n5 Attackers\n\nThe following are possible attack strategies, which were suggested by Shamir et al.[12]:\nThe Genetic Attack, in which a large population of attackers is trained, and every new\ntime step each attacker is multiplied to cover the 2K-1 possible internal representations of\n{i} for the current output . As dynamics proceeds successful attackers stay while the\nunsuccessful are removed. The Probabilistic Attack, in which the attacker tries to follow\nthe probability of every weight element by calculating the distribution of the local field of\nevery input and using the output, which is publicly known. The Naive Attacker, in which\nthe attacker imitates one of the parties.\nMore successful is the Flipping Attack strategy, in which the attacker imitates one of the\nparties, but in steps in which his output disagrees with the imitated party's output, he\nnegates (\"flips\") the sign of one of his hidden units. The unit most likely to be wrong\nis the one with the minimal absolute value of the local field, therefore that is the unit which\nis flipped.\nWhile the synchronization time increases with L2[15], the probability of finding a success-\nful flipping-attacker decreases exponentially with L,\n\n P e-yL\nas seen in Figure 4. Therefore, for large L values the system is secure[15]. Every time step,\nthe parties either appraoch each other (\"attractive step\" or drift apart (\"repulsive step\").\nClose to synchronization the probability for a repulsive step in the mutual learning between\nA and B scales like ( )2, while in the dynamic learning between the naive attacker C and\nA it scales like , where we define = P rob C [18].\n i = A\n i\nIt has been shown that among a group of Ising vector students which perform learning, and\nhave an overlap R with the teacher, the best student is the center of mass vector (which was\nshown to be an Ising vector as well) which has an overlap Rcm R , for R [0 : 1][19].\nTherefore letting a group of attackers cooperate throughout the process may be to their\nadvantage. The most successful attack strategy, the \"Majority Flipping Attacker\" uses a\ngroup of attackers as a cooperating group rather than as individuals. When updating the\nweights, instead of each attacker being updated according to its own result, all are updated\naccording to the majority's result. This \"team-work\" approach improves the attacker's\nperformance. When using the majority scheme, the probability for a successful attacker\nseems to approach a constant value 0.5 independent of L.\n\n\f\n 0 2 4 6 8 10 12\n\n 1\n\n\n\n\n\n 0.1\n\n\n\n P\n\n\n\n 0.01 Flipping attack\n\n Majority-Flipping attack\n\n P = 1.55 exp( -0.4335 L )\n\n\n\n 0.001\n\n L\n\n\n\nFigure 4: The attacker's success probability P as a function of L, for the flipping attack and\nthe majority-flipping attack, with N=1000, M=100, averaged over 1000 samples. To avoid\nfluctuations, we define the attacker successful if he found out 98% of the weights\n\n6 Analytical description\n\nThe semi-analytical description of this process gives us further insight to the synchroniza-\ntion process of mutual and dynamic learning. The study of discrete networks requires dif-\nferent methods of analysis than those used for the continuous case. We found that instead\nof examining the evolution of R and Q, we must examine (2L + 1) (2L+1) parameters,\nwhich describe the mutual learning process. By writing a Markovian process that describes\nthe development of these parameters, one gains an insight into the learning procedure. Thus\nwe define a (2L + 1) (2L+1) matrix, F, in which the state of the machines in the time\nstep is represented. The elements of F, are fqr, where q, r = -L,... - 1,0,1,...L.\nThe element fqr represents the fraction of components in a weight vector in which the A's\ncomponents are equal to q and the matching components in d unit B are equal to r. Hence,\nthe overlap between the two units as well as their norms are defined through this matrix,\n L L L\n R = qrfqr, QA = q2fqrQB = r2fqr (12)\n q,r=-L q=-L r=-L\n\nThe updating of matrix elements is described as follows: for the elements with q and r\nwhich are not on the boundary, (q = L and r = L) the update can be written in a\nsimple manner,\n 1 1\n f +\n q,r = (p - ) fq,r + ( - p) f f\n 2 q+1,r-1 + 2 q-1,r+1 . (13)\n\nOur results indicate that the order parameters are not self-averaged quantities [16]. Several\nruns with the same N, results in different curves for the order parameters as a function of\nthe number of steps, see Figure 5. This explains the non-zero variance of as a results of\nthe fluctuations in the local fields induced by the input even in the thermodynamic limit.\n\n7 Combining neural networks and chaos synchronization\n\nTwo chaotic system starting from different initial conditions can be synchronized by differ-\nent kinds of couplings between them. This chaotic synchronization can been used in neural\n\n\f\n 0\n\n 0\n\n -0.2 -0.2\n\n\n -0.4\n -0.4 <>\n -0.6\n <>\n\n -0.6 -0.8\n\n\n -10 5 10 15 20\n -0.8 # steps\n\n\n -10 20 40 60 80 100\n # steps\n\n\nFigure 5: The averaged overlap and its standard deviation as a function of the number\nof steps as found from the analytical results (solid line) and simulation results (circles)\nof mutual learning in TPMs. Inset: analytical results (solid line) and simulation results\n(circles) results for the perceptron, with L = 1 and N = 104.\n\n\n\ncryptography to enhance the cryptographic systems and to improve their security. A model\nwhich combines a TPM and logistic maps and is hereby presented, was shown to be more\nsecure than the TPM discussed above. Other models which use mutual synchronization of\nnetworks whose dynamics are those of the Lorenz system are now under research and seem\nvery promising.\nIn the following system we combine neural networks with logistic maps: Both partners A\nand B use their neural networks as input for the logistic maps which generate the output\nbits to be learned. By mutually learning these bits, the two neural networks approach each\nother and produce an identical signal to the chaotic maps which in turn synchronize as\nwell, therefore accelerating the synchronization of the neural nets.\nPreviously, the output bit of each hidden unit was the sign of the local field[11]. Now we\ncombine the PM with chaotic synchronization by feeding the local fields into logistic maps:\n\n\n \n s ~\n k(t + 1) = (1 - )sk(t)(1 - sk(t)) + h\n 2 k(t) (14)\n\nHere ~h denotes a transformed local field which is shifted and normalized to fit into the\ninterval [0, 2]. For = 0 one has the usual quadratic iteration which produces K chaotic\nseries sk(t) when the parameter is chosen correspondingly; here we use = 3.95. For\n0 < < 1 the logistic maps are coupled to the fields of the hidden units. It has been\nshown that such a coupling leads to chaotic synchronization[17]: If two identical maps\nwith different initial conditions are coupled to a common external signal they synchronize\nwhen the coupling strength is large enough, > c.\nThe security of key generation increases as the system approaches the critical point of\nchaotic synchronization. The probability of a successful attack decreases like exp(-yL)\nand it is possible that the exponent y diverges as the coupling constant between the neural\nnets and the chaotic maps is tuned to be critical.\n\n\f\n8 Conclusions\n\nA new phenomenon has been observed: Synchronization by mutual learning. If the learning\nrate is large enough, and if the weight vectors keep normalized, then the two networks\nrelax to a parallel orientation. Their weight vectors still move like a random walk on a\nhypersphere, but each network has complete knowledge about its partner.\nIt has been shown how this phenomenon can be used for cryptography. The two partners\ncan create a common secret key over a public channel. The fact that the parties are learning\nmutually, gives them an advantage over the attacker who is learning one-way. In contrast\nto number theoretical methods the networks are very fast; essentially they are linear filters,\nthe complexity to generate a key of length N scales with N (for sequential update of the\nweights).\nYet sophisticated attackers which use ensembles of cooperating attackers have a good\nchance to synchronize. However, advanced algorithms for synchronization, which involve\ndifferent types of chaotic synchronization seem to be more secure. Such models are sub-\njects of active research, and only the future will tell whether the security of neural network\ncryptography can compete with number theoretical methods.\n\nReferences\n [1] J. Hertz, A. Krogh, and R. G. Palmer: Introduction to the Theory of Neural Compu-\n tation, (Addison Wesley, Redwood City, 1991)\n [2] A. Engel, and C. Van den Broeck: Statistical Mechanics of Learning, (Cambridge\n University Press, 2001)\n [3] M. Biehl and N. Caticha: Statistical Mechanics of On-line Learning and Generaliza-\n tion, The Handbook of Brain Theory and Neural Networks, ed. by M. A. Arbib (MIT\n Press, Berlin 2001)\n [4] E. Eisenstein, I. Kanter, D.A. Kessler and W. Kinzel, Phys. Rev. Lett. 74, 6-9 (1995)\n [5] I. Kanter, D.A. Kessler, A. Priel and E. Eisenstein, Phys. Rev. Lett. 75, 2614-2617\n (1995);L. Ein-Dor and I. Kanter, Phys. Rev. E 57, 6564 (1998);M. Schroder and W.\n Kinzel, J. Phys. A 31, 9131-9147 (1998); A. Priel and I. Kanter, Europhys. Lett.(2000)\n [6] R. Metzler and W. Kinzel and I. Kanter, Phys. Rev. E 62, 2555 (2000)\n [7] D. R. Stinson, Cryptography: Theory and Practice (CRC Press 1995)\n [8] R. Metzler, W. Kinzel, L. Ein-Dor and I. Kanter, Phys. Rev. E 63, 056126 (2001)\n [9] M. Rosen-Zvi, I. Kanter and W. Kinzel, cond-mat/0202350 (2002)\n[10] R. Urbanczik, private communication\n[11] I. Kanter, W. Kinzel and E. Kanter, Europhys. Lett., 57, 141 (2002).\n[12] A.Klimov, A. Mityagin, A. Shamir, ASIACRYPT 2002 : 288-298.\n[13] W. Kinzel, R. Metzler and I. Kanter, J. Phys. A. 33 L141 (2000).\n[14] W. Kinzel, Contribution to Networks, ed. by H. G. Schuster and S. Bornholdt, to be\n published by Wiley VCH (2002).\n[15] R Mislovaty, Y. Perchenok, I. Kanter and W. Kinzel, Phys. Rev. E 66, 066102 (2002).\n[16] G. Reents and R. Urbanczik, Phys. Rev. Lett., 80, 5445 (1998).\n[17] R. Mislovaty, E. Klein, I. Kanter and W. Kinzel, Phys. Rev. Lett. 91, 118701 (2003).\n[18] M. Rosen-Zvi, E. Klein, I. Kanter and W. Kinzel, Phys. Rev. E 66 066135 (2002).\n[19] M. Copelli, M. Boutin, C. Van Der Broeck and B. Van Rompaey, Europhys. Lett., 46,\n 139 (1999).\n\n\f\n", "award": [], "sourceid": 2744, "authors": [{"given_name": "Einat", "family_name": "Klein", "institution": null}, {"given_name": "Rachel", "family_name": "Mislovaty", "institution": null}, {"given_name": "Ido", "family_name": "Kanter", "institution": null}, {"given_name": "Andreas", "family_name": "Ruttor", "institution": null}, {"given_name": "Wolfgang", "family_name": "Kinzel", "institution": null}]}