{"title": "Automatic Discovery of Cognitive Skills to Improve the Prediction of Student Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1386, "page_last": 1394, "abstract": "To master a discipline such as algebra or physics, students must acquire a set of cognitive skills. Traditionally, educators and domain experts manually determine what these skills are and then select practice exercises to hone a particular skill. We propose a technique that uses student performance data to automatically discover the skills needed in a discipline. The technique assigns a latent skill to each exercise such that a student's expected accuracy on a sequence of same-skill exercises improves monotonically with practice. Rather than discarding the skills identified by experts, our technique incorporates a nonparametric prior over the exercise-skill assignments that is based on the expert-provided skills and a weighted Chinese restaurant process. We test our technique on datasets from five different intelligent tutoring systems designed for students ranging in age from middle school through college. We obtain two surprising results. First, in three of the five datasets, the skills inferred by our technique support significantly improved predictions of student performance over the expert-provided skills. Second, the expert-provided skills have little value: our technique predicts student performance nearly as well when it ignores the domain expertise as when it attempts to leverage it. We discuss explanations for these surprising results and also the relationship of our skill-discovery technique to alternative approaches.", "full_text": "Automatic Discovery of Cognitive Skills\n\nto Improve the Prediction of Student Learning\n\nRobert V. Lindsey, Mohammad Khajah, Michael C. Mozer\nDepartment of Computer Science and Institute of Cognitive Science\n\nUniversity of Colorado, Boulder\n\nAbstract\n\nTo master a discipline such as algebra or physics, students must acquire\na set of cognitive skills. Traditionally, educators and domain experts use\nintuition to determine what these skills are and then select practice exer-\ncises to hone a particular skill. We propose a technique that uses student\nperformance data to automatically discover the skills needed in a disci-\npline. The technique assigns a latent skill to each exercise such that a\nstudent\u2019s expected accuracy on a sequence of same-skill exercises improves\nmonotonically with practice. Rather than discarding the skills identi\ufb01ed by\nexperts, our technique incorporates a nonparametric prior over the exercise-\nskill assignments that is based on the expert-provided skills and a weighted\nChinese restaurant process. We test our technique on datasets from \ufb01ve\ndi\ufb00erent intelligent tutoring systems designed for students ranging in age\nfrom middle school through college. We obtain two surprising results. First,\nin three of the \ufb01ve datasets, the skills inferred by our technique support\nsigni\ufb01cantly improved predictions of student performance over the expert-\nprovided skills. Second, the expert-provided skills have little value: our\ntechnique predicts student performance nearly as well when it ignores the\ndomain expertise as when it attempts to leverage it. We discuss expla-\nnations for these surprising results and also the relationship of our skill-\ndiscovery technique to alternative approaches.\n\n1 Introduction\n\nWith the advent of massively open online courses (MOOCs) and online learning platforms\nsuch as Khan Academy and Reasoning Mind, large volumes of data are collected from\nstudents as they solve exercises, acquire cognitive skills, and achieve a conceptual under-\nstanding. A student\u2019s data provides clues as to his or her knowledge state\u2014the speci\ufb01c facts,\nconcepts, and operations that the student has mastered, as well as the depth and robustness\nof the mastery. Knowledge state is dynamic and evolves as the student learns and forgets.\n\nTracking a student\u2019s time-varying knowledge state is essential to an intelligent tutoring sys-\ntem. Knowledge state pinpoints the student\u2019s strengths and de\ufb01ciencies and helps determine\nwhat material the student would most bene\ufb01t from studying or practicing. In short, e\ufb03cient\nand e\ufb00ective personalized instruction requires inference of knowledge state [20, 25].\n\nKnowledge state can be decomposed into atomic elements, often referred to as knowledge\ncomponents [7, 13], though we prefer the term skills. Skills include retrieval of speci\ufb01c facts,\ne.g., the translation of \u2018dog\u2019 into Spanish is perro, as well as operators and rules in a domain,\ne.g., dividing each side of an algebraic equation by a constant to transform 3(x + 2) = 15\ninto x + 2 = 5, or calculating the area of a circle with radius r by applying the formula\n\n1\n\n\f\u03c0r2. When an exercise or question is posed, students must apply one or more skills, and\nthe probability of correctly applying a skill is dependent on their knowledge state.\n\nTo predict a student\u2019s performance on an exercise, we thus must: (1) determine which skill\nor skills are required to solve the exercise, and (2) infer the student\u2019s knowledge state for\nthose skills. With regard to (1), the correspondence between exercises and skills, which\nwe will refer to as an expert labeling, has historically been provided by human experts.\nAutomated techniques have been proposed, although they either rely on an expert labeling\nwhich they then re\ufb01ne [5] or treat the student knowledge state as static [3]. With regard\nto (2), various dynamical latent state models have been suggested to infer time-varying\nknowledge state given an expert labeling. A popular model, Bayesian knowledge tracing\nassumes that knowledge state is binary\u2014the skill is either known or not known [6]. Other\nmodels posit that knowledge state is continuous and evolves according to a linear dynamical\nsystem [21].\n\nOnly recently have methods been suggested that simultaneously address (1) and (2), and\nwhich therefore perform skill discovery. Nearly all of this work has involved matrix factor-\nization [24, 22, 14]. Consider a student \u00d7 exercise matrix whose cells indicate whether a\nstudent has answered an exercise correctly. Factorization leads to a vector for each student\ncharacterizing the degree to which the student has learned each of Nskill skills, and a vec-\ntor for each exercise characterizing the degree to which that exercise requires each of Nskill\nskills. Modeling student learning presents a particular challenge because of the temporal\ndimension: students\u2019 skills improve as they practice. Time has been addressed either via\ndynamical models of knowledge state or by extending the matrix into a tensor whose third\ndimension represents time.\n\nWe present an approach to skill discovery that di\ufb00ers from matrix factorization approaches in\nthree respects. First, rather than ignoring expert labeling, we adopt a Bayesian formulation\nin which the expert labels are incorporated into the prior. Second, we explore a nonparamet-\nric approach in which the number of skills is determined from the data. Third, rather than\nallowing an exercise to depend on multiple skills and to varying degrees, we make a stronger\nassumption that each exercise depends on exactly one skill in an all-or-none fashion. With\nthis assumption, skill discovery is equivalent to the partitioning of exercises into disjoint\nsets. Although this strong assumption is likely to be a simpli\ufb01cation of reality, it serves\nto restrict the model\u2019s degrees of freedom compared to factorization approaches in which\neach student and exercise is assigned an Nskill-dimensional vector. Despite the application\nof sparsity and nonnegativity constraints, the best models produced by matrix factorization\nhave had low-dimensional skill spaces, speci\ufb01cally, Nskill \u2264 5 [22, 14]. We conjecture that\nthe low dimensionality is not due to the domains being modeled requiring at most 5 skills,\nbut rather to over\ufb01tting for Nskill > 5. With our approach of partitioning exercises into\ndisjoint skill sets, we can a\ufb00ord Nskill (cid:29) 5 without giving the model undue \ufb02exibility. We\nare aware of one recent approach to skill discovery [8, 9] which shares our assumption that\neach exercise depends on a single skill. However, it di\ufb00ers from our approach in that it does\nnot try to exploit expert labels and presumes a \ufb01xed number of skills. We contrast our work\nto various alternative approaches toward the end of this paper.\n\n2 A nonparametric model for automatic skill discovery\n\nWe now introduce a generative probabilistic model of student problem-solving in terms of\ntwo components: (1) a prior over the assignment of exercises to skills, and (2) the likelihood\nof a sequence of responses produced by a student on exercises requiring a common skill.\n\n2.1 Weighted CRP: A prior on skill assignments\n\nAny instructional domain (e.g., algebra, geometry, physics) has an associated set of exercises\nwhich students must practice to attain domain pro\ufb01ciency. We are interested in the common\nsituation where an expert has identi\ufb01ed, for each exercise, a speci\ufb01c skill which is required\nfor its solution (the expert labeling). It may seem unrealistic to suppose that each exercise\nrequires no more than one skill, but in intelligent tutoring systems [7, 13], complex exercises\n(e.g., algebra word problems) are often broken down into a series of steps which are small\n\n2\n\n\fenough that they could plausibly require only one skill (e.g., adding a constant to both\nsides of an algebraic equation). Thus, when we use the term \u2018exercise\u2019, in some domains we\nare actually referring to a step of a compound exercise. In other domains (e.g., elementary\nmathematics instruction), the exercises are designed speci\ufb01cally to tap what is being taught\nin a lesson and are thus narrowly focused.\n\nWe wish to exploit the expert labeling to design a nonparametric prior over assignments\nof exercises to skills\u2014hereafter, skill assignments\u2014and we wish to vary the strength of the\nbias imposed by the expert labeling. With a strong bias, the prior would assign nonzero\nprobability to only the expert labeling. With no bias, the expert labeling would be no more\nlikely than any other. With an intermediate bias, which provides soft constraints on the\nskill assignment, a suitable model might improve on the expert labeling.\n\nWe considered various methods, including fragmentation-coagulation processes [23] and the\ndistance-dependent Chinese restaurant process [4]. In this article, we describe a straightfor-\nward approach based on the Chinese restaurant process (CRP) [1], which induces a distri-\nbution over partitions. The CRP is cast metaphorically in terms of a Chinese restaurant in\nwhich each entering customer chooses a table at which to sit. Denoting the table at which\ncustomer i sits as Yi, customer i can take a seat at an occupied table y with P (Yi = y) \u221d ny\nor at an empty table with P (Yi = Ntable + 1) \u221d \u03b1, where Ntable is the number of occupied\ntables and ny is the number of customers currently seated at table y.\n\nThe weighted Chinese restaurant process (WCRP) [10] extends this metaphor by suppos-\ning that customers each have a \ufb01xed a\ufb03liation and are biased to sit at tables with other\ncustomers having similar a\ufb03liations. The WCRP is nothing more than the posterior over\ntable assignments given a CRP prior and a likelihood function based on a\ufb03liations. In the\nmapping of the WCRP to our domain, customers correspond to exercises, tables to distinct\nskills, and a\ufb03liations to expert labels. The WCRP thus partitions the exercises into groups\nsharing a common skill, with a bias to assign the same skill to exercises having the same\nexpert label.\nThe WCRP is speci\ufb01ed in terms of a set of parameters \u03b8 \u2261 {\u03b81, . . . , \u03b8Ntable}, where \u03b8y\nrepresents the a\ufb03liation associated with table y. In our domain, the a\ufb03liation corresponds\nto one of the expert labels: \u03b8y \u2208 {1, . . . , Nskill}. From a generative modeling perspective,\nthe a\ufb03liation of a table in\ufb02uences the a\ufb03liations of each customer seated at the table. Using\nXi to denote the a\ufb03liation of customer i\u2014or equivalently, the expert label associated with\nexercise i\u2014we make the generative assumption:\n\nP (Xi = x|Yi = y, \u03b8) \u221d \u03b2\u03b4x,\u03b8y + 1 \u2212 \u03b2 ,\n\nwhere \u03b4 is the Kronecker delta and \u03b2 is the previously mentioned bias. With \u03b2 = 0, a\ncustomer is equally likely to have any a\ufb03liation; with \u03b2 = 1, all customers at a table will\nhave the table\u2019s a\ufb03liation. With uniform priors on \u03b8y, the conditional distribution on \u03b8y is:\n\nwhere X(y) is the set of a\ufb03liations of customers seated at table y and na\nis the number of customers at table y with a\ufb03liation a.\n\nXi\u2208X(y) \u03b4xi,a\n\nP (\u03b8y|X(y)) \u221d (1 \u2212 \u03b2)\u2212n\n\n\u03b8y\ny\n\ny \u2261(cid:80)\n\nMarginalizing over \u03b8, the WCRP speci\ufb01es a distribution over table assignments for a new\ncustomer: an occupied table y \u2208 {1, . . . , Ntable} is chosen with probability\n\nP (Yi = y|Xi, X(y)) \u221d ny\n\n1 + \u03b2(\u03baxi\n1 + \u03b2(Nskill\n\ny \u2212 1)\n\u22121 \u2212 1)\n\n, with \u03baa\n\ny \u2261\n\n(cid:80)Nskill\n(1 \u2212 \u03b2)\u2212na\n\u02dca=1 (1 \u2212 \u03b2)\u2212n\u02dca\n\ny\n\ny\n\n.\n\n(1)\n\n\u03baa\ny is a softmax function that tends toward 1 if a is the most common a\ufb03liation among\ncustomers at table y, and tends toward 0 otherwise. In the WCRP, an empty table Ntable +1\nis selected with probability\n\n(2)\nWe choose to treat \u03b1 not as a constant but rather de\ufb01ne \u03b1 \u2261 \u03b1(cid:48)(1 \u2212 \u03b2) where \u03b1(cid:48) becomes\nthe free parameter of the model that modulates the expected number of occupied tables,\nand the term 1 \u2212 \u03b2 serves to give the model less freedom to assign new tables when the\n\nP (Yi = Ntable + 1) \u221d \u03b1.\n\n3\n\n\fa\ufb03liation bias is high. (We leave the constant in the denominator of Equation 1 so that \u03b1\nhas the same interpretation regardless of \u03b2.)\n\nFor \u03b2 = 0, the WCRP reduces to the CRP and expert labels are ignored. Although the\nWCRP is unde\ufb01ned for \u03b2 = 1, it is de\ufb01ned in the limit \u03b2 \u2192 1, and it produces a seating\narrangement equivalent to the expert labels with probability 1. For intermediate \u03b2, the\nexpert labels serve as an intermediate constraint. For any \u03b2, the WCRP seating arrangement\nspeci\ufb01es a skill assignment over exercises.\n\n2.2 BKT: A theory of human skill acquisition\n\nIn the previous section, we described a prior over skill assignments. Given an assignment,\nwe turn to a theory of the temporal dynamics of human skill acquisition. Suppose that a\nparticular student practices a series of exercises, {e1, e2, . . . , et, . . . , eT}, where the subscript\nindicates order and each exercise et depends on a corresponding skill, st.1 We assume that\nwhether or not a student responds correctly to exercise et depends solely on the student\u2019s\nmastery of st. We further assume that when a student works on et, it has no e\ufb00ect on\nthe student\u2019s mastery of other skills \u02dcs, \u02dcs (cid:54)= st. These assumptions\u2014adopted by nearly\nall past models of student learning\u2014allow us to consider each skill independently of the\nothers. Thus, for skill \u02dcs, we can select its subset of exercises from the sequence, e\u02dcs = {et |\nst = \u02dcs}, preserving order in the sequence, and predict whether the student will answer\neach exercise correctly or incorrectly. Given the uncertainty in such predictions, models\ntypically predict the joint likelihood over the sequence of responses, P (R1, . . . , R|e\u02dcs|), where\nthe binary random variable Rt indicates the correctness of the response to et.\n\nThe focus of our research is not on developing novel models of skill acquisition. Instead,\nwe incorporate a simple model that is a mainstay of the \ufb01eld, Bayesian knowledge tracing\n(BKT) [6]. BKT is based on a theory of all-or-none human learning [2] which postulates\nthat a student\u2019s knowledge state following trial t, Kt, is binary: 1 if the skill has been\nmastered, 0 otherwise. BKT is a hidden Markov model (HMM) with internal state Kt and\nemissions Rt.\n\nBecause BKT is typically used to model practice over brief intervals, the model assumes\nno forgetting, i.e., K cannot transition from 1 to 0. This assumption constrains the time-\nvarying knowledge state: it can make at most one transition from 0 to 1 over the sequence\nof trials. Consequently, the {Kt} can be replaced by a single latent variable, T , that denotes\nthe trial following which a transition is made, leading to the BKT generative model:\n\n(cid:26)\u03bbL\n\nP (T = t|\u03bbL, \u03bbM ) =\n\n(1 \u2212 \u03bbL)\u03bbM (1 \u2212 \u03bbM )t\u22121\n\nif t = 0\nif t > 0\n\nP (Rt = 1|\u03bbG, \u03bbS, T ) =\n\n(cid:26)\u03bbG\n\n1 \u2212 \u03bbS\n\nif i \u2264 T\notherwise,\n\n(3)\n\n(4)\n\nwhere \u03bbL is the probability that a student has mastered the skill prior to performing the\n\ufb01rst exercise, \u03bbM is the transition probability from the not-mastered to mastered state,\n\u03bbG is the probability of correctly guessing the answer prior to skill mastery, and \u03bbS is the\nprobability of answering incorrectly due to a slip following skill mastery.\n\nAlthough we have chosen to model student learning with BKT, any other probabilistic\nmodel of student learning could be used in conjunction with our approach to skill discov-\nery, including more sophisticated variants of BKT [11] or models of knowledge state with\ncontinuous dynamics [21]. Further, our approach does not require BKT\u2019s assumption that\nlearning a skill is conditionally independent of the practice history of other skills. However,\nthe simplicity of BKT allows one to conduct modeling on a relatively large scale.\n\n1To tie this notation to the notation of the previous section, st \u2261 yet , i.e., the table assignments\nof the WCRP correspond to skills, and exercise et is seated at table yet . Note that i in the previous\nsection was used as an index over distinct exercises, whereas t in this section is used as an index\nover trials. The same exercise may be presented multiple times.\n\n4\n\n\f3 Implementation\n\nWe perform posterior inference through Markov chain Monte Carlo (MCMC) sampling.\nThe conditional probability for Yi given the other variables is proportional to the product\nof the WCRP prior term and the likelihood of each student\u2019s response sequence. The prior\nterm is given by Equations 1 and 2, where by exchangeability we can take Yi to be the\nlast customer to enter the restaurant and where we analytically marginalize \u03b8. For an\nexisting table, the likelihood is given by the BKT HMM emission sequence probability. For\na new table, we must add an extra step to calculating the emission sequence probability\nbecause the BKT parameters do not have conjugate priors. We used Algorithm 8 from [16],\nwhich e\ufb00ectively produces a Monte Carlo approximation to the intractable marginal data\nlikelihood, integrating out over the BKT parameters that could be drawn for the new table.\n\nFor lack of conjugacy and any strong prior knowledge, we give each table\u2019s \u03bbL, \u03bbM , and \u03bbS\nindependent uniform priors on [0, 1]. Because we wish to interpret BKT\u2019s K = 1 state as a\n\u201clearned\u201d state, we parameterize \u03bbG as being a fraction of 1 \u2212 \u03bbS, where the fraction has a\nuniform prior on [0, 1]. We give log(1\u2212\u03b2) a uniform prior on [\u22125, 0] based on the simulations\ndescribed in Section 4.1, and \u03b1(cid:48) is given an improper uniform prior with support on \u03b1(cid:48) > 0.\nBecause of the lack of conjugacy, we explicitly represent each table\u2019s BKT parameters during\nsampling. In each iteration of the sampler, we update the table assignments of each exercise\nand then apply \ufb01ve axis-aligned slice sampling updates to each table\u2019s BKT parameters and\nto the hyperparameters \u03b2 and \u03b1(cid:48) [17].\nFor all simulations, we run the sampler for 200 iterations and discard the \ufb01rst 100 as the\nburn-in period. The seating arrangement is initialized to the expert-provided skills; all other\nparameters are initialized by sampling from the generative model. We use the post burn-in\nsamples to estimate the expected posterior probability of a student correctly responding in\na trial, integrating out over uncertainty in all skill assignments, BKT parameterizations,\nand hyperparameters. We explored using more iterations and a longer burn-in period but\nfound that doing so did not yield appreciable increases in training or test data likelihoods.\n\n4 Simulations\n4.1 Sampling from the WCRP\n\nWe generated synthetic exercise-skill assignments via a draw from a CRP prior with \u03b1 = 3\nand Nexercise = 100. Using these assignments as both the ground-truth and expert labels, we\nthen simulated draws from the WCRP to determine the e\ufb00ect of \u03b2 (the expert labeling bias)\nand \u03b1(cid:48) (concentration scaling parameter; see Equation 2) on the model\u2019s behavior. Figure 1a\nshows the reconstruction score, a measure of similarity between the induced assignment and\nthe true labels. This score is the di\ufb00erence between (1) the proportion of pairs of exercises\nthat belong to the same true skill that are assigned to the same recovered skill, and (2)\nthe proportion of pairs of exercises that belong to di\ufb00erent true skills that are assigned to\ndi\ufb00erent recovered skills. The score is in [0, 1], with 0 indicating no better than a chance\nrelationship to the true labels, and 1 indicating the true labels are recovered exactly. The\nreported score is the mean over replications of the simulation and MCMC samples. As \u03b2\nincreases, the recovered skills better approximate the expert (true) skills, independent of\n\u03b1(cid:48). Figure 1b shows the expected interaction between \u03b1(cid:48) and \u03b2 on the number of occupied\ntables (induced skills): only when the bias is weak does \u03b1(cid:48) have an e\ufb00ect.\n\n4.2 Skill recovery from synthetic student data\n\nWe generated data for Nstudent synthetic students responding to Nexercise exercises pre-\nsented in a random order for each student. Using a draw from the CRP prior with \u03b1 = 3,\nwe generated exercise-skill assignments. For each skill, we generated sequences of student\ncorrect/incorrect responses via BKT, with parameters sampled from plausible distributions:\n\u03bbL \u223c Uniform(0, 1), \u03bbM \u223c Beta(10, 30), \u03bbG \u223c Beta(1, 9), and \u03bbS \u223c Beta(1, 9).\nFigure 1c shows the model\u2019s reconstruction of true skills for 24 replications of the simulation\nwith Nstudent = 100 and Nexercise = 200, varying \u03b2, providing a set of expert skill labels\nthat were either the true labels or a permutation of the true labels. The latter conveys no\ninformation about the true labels. The most striking feature of the result is that the model\n\n5\n\n\fFigure 1: (a,b) E\ufb00ect of varying expert labeling bias (\u03b2) and \u03b1(cid:48) on sampled skill assignments\nfrom a WCRP; (c) E\ufb00ect of expert labels and \u03b2 on the full model\u2019s reconstruction of the\ntrue skills from synthetic data\n\nFigure 2: E\ufb00ect of expert labels, Nstudent, Nexercise, and Nskill on the model\u2019s reconstruction\nof the true skills from synthetic data\n\ndoes an outstanding job of reconstructing the true labeling whether the expert labels are\ncorrect or not. Only when the bias \u03b2 is strong and the expert labels are erroneous does the\nmodel\u2019s reconstruction performance falter. The bottom line is that a good expert labeling\ncan help, whereas a bad expert labeling should be no worse than no expert-provided labels.\nIn a larger simulation, we systematically varied Nstudent \u2208 {50, 100, 150, 200}, Nexercise \u2208\n{100, 200, 300}, and assigned the exercises to one of Nskill \u2208 {10, 20, 30} skills via uniform\nmultinomial sampling. Figure 2 shows the result from 30 replications of the simulation\nusing expert labels that were either true or permuted (left and right panels, respectively).\nWith a good expert labeling, skill reconstruction is near perfect with Nstudent \u2265 100 and an\nNexercise : Nskill ratio of at least 10. With a bad expert labeling, more data is required to\nobtain accurate reconstructions, say, Nstudent \u2265 200. As one would expect, a helpful expert\nlabeling can overcome noisy or inadequate data.\n\n4.3 Evaluation of student performance data\n\nWe ran simulations on \ufb01ve student performance datasets (Table 1). The datasets varied\nin the number of students, exercises, and expert skill labels; the students in the datasets\nranged in age from middle school to college. Each dataset consists of student identi\ufb01ers,\nexercise identi\ufb01ers, trial numbers, and binary indicators of response correctness from stu-\ndents undergoing variable-length sequences of exercises over time.2 Exercises may appear\nin di\ufb00erent orders for each student and may occur multiple times for a given student.\n\n2For the DataShop datasets, exercises were identi\ufb01ed by concatenating what they call the prob-\nlem hierarchy, problem name, and the step name columns. Expert-provided skill labels were iden-\nti\ufb01ed by concatenating the problem hierarchy column with the skill column following the same\npractice as in [19, 18]. The expert skill labels infrequently associate an exercise with multiple skills.\nFor such exercises, we treat the combination of skills as one unique skill.\n\n6\n\n00.20.40.60.8100.20.40.60.81expert labeling bias (\u03b2)reconstruction score(a) 00.20.40.60.81510152025expert labeling bias (\u03b2)# occupied tables(b) 00.20.40.60.810.750.80.850.90.951expert labeling bias (\u03b2)reconstruction scorepermuted labelstrue labels(c)\u03b1\u2019 = 2.0\u03b1\u2019 = 5.0\u03b1\u2019 = 10.0\u03b1\u2019 = 2.0\u03b1\u2019 = 5.0\u03b1\u2019 = 10.0reconstruction score0.10.20.30.40.50.60.70.80.9501001502001005010015020020050100150200300 # students # exercisestrue labels10 skills20 skills30 skillsreconstruction score0.10.20.30.40.50.60.70.80.9501001502001005010015020020050100150200300 # students # exercisespermuted labels10 skills20 skills30 skills\fsource\n\ndataset\n\nPSLC DataShop [12]\nPSLC DataShop [12]\nPSLC DataShop [12]\n\n[15]\n\nfractions game\nphysics tutor\n\nengineering statics\nSpanish vocabulary\n\nPSLC DataShop [12]\n\ngeometry tutor\n\n51\n66\n333\n182\n59\n\n#\n\n#\nstudents exercises\n179\n\n# # skills # skills\n\n\u03b2\ntrials (expert) (WCRP) (WCRP)\n0.886\n4,349\n0.947\n4,816 110,041\n0.981\n1,223 189,297\n0.996\n409 578,726\n139\n5,104\n0.997\n\n7.9\n49.4\n99.2\n183\n19.7\n\n45\n652\n156\n221\n18\n\nTable 1: Five student performance datasets used in simulations\n\nWe compared a set of models which we will describe shortly. For each model, we ran ten\nreplications of \ufb01ve-fold cross validation on each dataset. In each replication, we randomly\npartitioned the set of all students into \ufb01ve equally sized disjoint subsets. In each replication-\nfold, we collected posterior samples using our MCMC algorithm given the data recorded for\nstudents in four of the \ufb01ve subsets. We then used the samples to predict the response\nsequences (correct vs.\nincorrect) of the remaining students. On occasion, students in the\ntest set were given exercises that had not appeared in the training set. In those cases, the\nmodel used samples from Equations 1-2 to predict the new exercises\u2019 skill assignments.\n\nThe models we compare di\ufb00er in how skills are assigned to exercises. However, every model\nuses BKT to predict student performance given the skill assignments. Before presenting\nresults from the models, we \ufb01rst need to verify the BKT assumption that students improve\non a skill over time. We compared BKT to a baseline model which assumes a stationary\nprobability of a correct response for each skill. Using the expert-provided skills, BKT\nachieves a mean 11% relative improvement over the baseline model across the \ufb01ve datasets.\nThus, BKT with expert-provided skills is sensitive to the temporal dynamics of learning.\n\nTo evaluate models, we use BKT to predict the test students\u2019 data given the model-speci\ufb01ed\nskill assignment. We calculated several prediction-accuracy metrics, including RMSE and\nmean log loss. We report area under the ROC curve (AUC), though all metrics yield the\nsame pattern of results. Figure 3 shows the mean AUC, where larger AUC values indicate\nbetter performance. Each graph is a di\ufb00erent dataset. The \ufb01ve colored bars represent\nalternative approaches to determining the exercise-skill assignments. LFA uses skills from\nLearning Factors Analysis, a semi-automated technique that re\ufb01nes expert-provided skills\n[5]; LFA skills are available for only the Fractions and Geometry datasets. Single assigns\nthe same skill to all exercises. Exercise speci\ufb01c assigns a di\ufb00erent skill to each exercise.\nExpert uses the expert-provided skills. WCRP(0) uses the WCRP with no bias toward\nthe expert-provided skills, i.e., \u03b2 = 0, which is equivalent to a CRP. WCRP(\u03b2) is our\ntechnique with the level of bias inferred from the data.\n\nThe performance of expert is unimpressive. On Fractions, expert is worse than the single\nbaseline. On Physics and Statics, expert is worse than the exercise-speci\ufb01c baseline.\nWCRP(\u03b2) is consistently better than both the single and exercise-speci\ufb01c baselines\nacross all \ufb01ve datasets. WCRP(\u03b2) also outperforms expert by doing signi\ufb01cantly better\non three datasets and equivalently on two. Finally, WCRP(\u03b2) is about the same as LFA\non Geometry, but substantially better on Fractions. (A comparison between these models\nis somewhat inappropriate. LFA has an advantage because it was developed on Geometry\nand is provided entire data sets for training, but it has a disadvantage because it was not\ndesigned to improve the performance of BKT.) Surprisingly, WCRP(0), which ignores\nthe expert-provided skills, performs nearly as well as WCRP(\u03b2). Only for Geometry was\nWCRP(\u03b2) reliably better (two-tailed t-test with t(49) = 5.32, p < .00001). The last\ncolumn of Table 1, which shows the mean inferred \u03b2 value for WCRP(\u03b2), helps explain\nthe pattern of results. The datasets are arranged in order of smallest to largest inferred\n\u03b2, both in Table 1 and Figure 3. The inferred \u03b2 values do a good job of indicating where\nWCRP(\u03b2) outperforms expert: the model infers that the expert skill assignments are\nuseful for Geometry and Spanish, but less so for the other datasets. Where the expert skill\nassignments are most useful, WCRP(0) su\ufb00ers. On the datasets where WCRP(\u03b2) is\nhighly biased, the mean number of inferred skills (Table 1, column 7) closely corresponds\nto the number of expert-provided skills.\n\n7\n\n\fFigure 3: Mean AUC on test students\u2019 data for six di\ufb00erent methods of determining skill\nassignments in BKT. Error bars show \u00b11 standard error of the mean.\n\n5 Discussion\n\nWe presented a technique that discovers a set of cognitive skills which students use for\nproblem solving in an instructional domain. The technique assumes that when a student\nworks on a sequence of exercises requiring the same skill, the student\u2019s expected performance\nshould monotonically improve. Our technique addresses two challenges simultaneously: (1)\ndetermining which skill is required to correctly answer each exercise, and (2) modeling a\nstudent\u2019s dynamical knowledge state for each skill. We conjectured that a technique which\njointly addresses these two challenges might lead to more accurate predictions of student\nperformance than a technique which was based on expert skill labels. We found strong\nevidence for this conjecture: On 3 of 5 datasets, skill discovery yields signi\ufb01cantly improved\npredictions over \ufb01xed expert-labeled skills; on the other two datasets, the two approaches\nobtain comparable results.\n\nCounterintuitively, incorporating expert labels into the prior provided little or no bene\ufb01t.\nAlthough one expects prior knowledge to play a smaller role as datasets become larger, we\nobserved that even medium-sized datasets (relative to the scale of today\u2019s big data) are\nsu\ufb03cient to support a pure data-driven approach. In simulation studies with both synthetic\ndata and actual student datasets, 50-100 students and roughly 10 exercises/skill provides\nstrong enough constraints on inference that expert labels are not essential.\n\nWhy should the expert skill labeling ever be worse than an inferred labeling? After all, edu-\ncators design exercises to help students develop particular cognitive skills. One explanation\nis that educators understand the knowledge structure of a domain, but have not parsed the\ndomain at the right level of granularity needed to predict student performance. For exam-\nple, a set of exercises may all tap the same skill, but some require a deep understanding\nof the skill whereas others require only a super\ufb01cial or partial understanding. In such a\ncase, splitting the skill into two subskills may be bene\ufb01cial. In other cases, combining two\nskills which are learned jointly may subserve prediction, because the combination results\nin longer exercise histories which provide more context for prediction. These arguments\nsuggest that fragmentation-coagulation processes [23] may be an interesting approach to\nleveraging expert labelings as a prior.\n\nOne limitation of the results we report is that we have yet to perform extensive comparisons\nof our technique to others that jointly model the mapping of exercises to skills and the\nprediction of student knowledge state. Three matrix factorization approaches have been\nproposed, two of which are as yet unpublished [24, 22, 14]. The most similar work to ours,\nwhich also assumes each exercise is mapped to a single skill, is the topical HMM [8, 9]. The\ntopical HMM di\ufb00ers from our technique in that the underlying generative model supposes\nthat the exercise-skill mapping is inherently stochastic and thus can change from trial to\ntrial and student to student. (Also, it does not attempt to infer the number of skills or\nto leverage expert-provided skills.) We have initated collaborations with several authors of\nthese alternative approaches, with the goal of testing the various approaches on exactly the\nsame datasets with the same evaluation metrics.\n\nAcknowledgments This research was supported by NSF grants BCS-0339103 and BCS-\n720375 and by an NSF Graduate Research Fellowship to R. L.\n\n8\n\n.60.65.70.75LFASingleExercise SpecificExpertWCRP(0)WCRP(\u03b2)AUC Fractions .55.60.65.70SingleExercise SpecificExpertWCRP(0)WCRP(\u03b2) Physics .60.65.70.75.80SingleExercise SpecificExpertWCRP(0)WCRP(\u03b2) Statics .70.75.80.85SingleExercise SpecificExpertWCRP(0)WCRP(\u03b2) Spanish .55.60.65.70.75LFASingleExercise SpecificExpertWCRP(0)WCRP(\u03b2) Geometry \fReferences\n[1] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour,\n\npages 1\u2013198. Springer, Berlin, 1985.\n\n[2] R. Atkinson. Optimizing the learning of a second-language vocabulary. Journal of Experimental\n\nPsychology, 96:124\u2013129, 1972.\n\n[3] T. Barnes. The Q-matrix method: Mining student response data for knowledge. In J. Beck,\n\neditor, Proceedings of the 2005 AAAI Educational Data Mining Workshop, 2005.\n\n[4] D. Blei and P. Frazier. Distance dependent Chinese restaurant processes. Journal of Machine\n\nLearning Research, 12:2383\u20132410, 2011.\n\n[5] H. Cen, K. Koedinger, and B. Junker. Learning factors analysis\u2014A general method for cogni-\ntive model evaluation and improvement. In M. Ikeda, K. Ashley, and T. Chan, editors, Intell.\nTutoring Systems, volume 4053 of Lec. Notes in Comp. Sci., pages 164\u2013175. Springer, 2006.\n\n[6] A. Corbett and J. Anderson. Knowledge tracing: Modeling the acquisition of procedural\n\nknowledge. User Modeling & User-Adapted Interaction, 4:253\u2013278, 1995.\n\n[7] A. Corbett, K. Koedinger, and J. Anderson. Intelligent tutoring systems. In M. Helander,\nT. Landauer, and P. Prabhu, editors, Handbook of Human Computer Interaction, pages 849\u2013\n874. Elsevier Science, Amsterdam, 1997.\n\n[8] J. Gonz\u00b4alez-Brenes and J. Mostow. Dynamic cognitive tracing: Towards uni\ufb01ed discovery of\n\nstudent and cognitive models. In Proc. of the 5th Intl. Conf. on Educ. Data Mining, 2012.\n\n[9] J. Gonz\u00b4alez-Brenes and J. Mostow. What and when do students learn? Fully data-driven joint\nestimation of cognitive and student models. In Proc. 6th Intl. Conf. Educ. Data Mining, 2013.\n[10] H. Ishwaran and L. James. Generalized weighted Chinese restaurant processes for species\n\nsampling mixture models. Statistica Sinica, 13:1211\u20131235, 2003.\n\n[11] M. Khajah, R. Wing, R. Lindsey, and M. Mozer. Integrating latent-factor and knowledge-\n\ntracing models to predict individual di\ufb00erences in learning. EDM 2014, 2014.\n\n[12] K. Koedinger, R. Baker, K. Cunningham, A. Skogsholm, B. Leber, and J. Stamper. A data\nrepository for the EDM community: The PSLC DataShop. In C. Romero, S. Ventura, M. Pech-\nenizkiy, and R. Baker, editors, Handbook of Educ. Data Mining, http://pslcdatashop.org, 2010.\n[13] K. Koedinger, A. Corbett, and C. Perfetti. The knowledge-learning-instruction framework:\nBridging the science-practice chasm to enhance robust student learning. Cognitive Science,\n36(5):757\u2013798, 2012.\n\n[14] A. S. Lan, C. Studer, and R. G. Baraniuk. Time-varying learning and content analytics via\nsparse factor analysis. In ACM SIGKDD Conf. on Knowledge Disc. and Data Mining, 2014.\n[15] R. Lindsey, J. Shroyer, H. Pashler, and M. Mozer. Improving student\u2019s long-term knowledge\n\nretention with personalized review. Psychological Science, 25:639\u2013647, 2014.\n\n[16] R. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[17] R. Neal. Slice sampling. The Annals of Statistics, 31(3):705\u2013767, 2003.\n[18] Z. A. Pardos and N. T. He\ufb00ernan. KT-IDEM: Introducing item di\ufb03culty to the knowledge\n\ntracing model. In User Modeling, Adaption and Pers., pages 243\u2013254. Springer, 2011.\n\n[19] Z. A. Pardos, S. Trivedi, N. T. He\ufb00ernan, and G. N. S\u00b4ark\u00a8ozy. Clustered knowledge tracing. In\nS. A. Cerri, W. J. Clancey, G. Papadourakis, and K. Panourgia, editors, ITS, volume 7315 of\nLecture Notes in Computer Science, pages 405\u2013410. Springer, 2012.\n\n[20] A. Ra\ufb00erty, E. Brunskill, T. Gri\ufb03ths, and P. Shafto. Faster teaching by POMDP planning.\n\nIn Proc. of the 15th Intl. Conf. on AI in Education, 2011.\n\n[21] A. Smith, L. Frank, S. Wirth, M. Yanike, D. Hu, Y. Kubota, A. Graybiel, W. Suzuki, and\nE. Brown. Dynamic analysis of learning in behav. experiments. J. Neuro., 24:447\u2013461, 2004.\n[22] J. Sohl-Dickstein. Personalized learning and temporal modeling at Khan Academy. Invited\n\nTalk at NIPS Workshop on Data Driven Education, 2013.\n\n[23] Y. Teh, C. Blundell, and L. Elliott. Modelling genetic variations with fragmentation-\n\ncoagulation processes. In Advances In Neural Information Processing Systems, 2011.\n\n[24] N. Thai-Nghe, L. Drumond, T. Horv\u00b4ath, A. Krohn-Grimberghe, A. Nanopoulos, and\nL. Schmidt-Thieme. Factorization techniques for predicting student performance. In O. Santos\nand J. Botcario, editors, Educ. Rec. Systems and Technologies, pages 129\u2013153. 2011.\n\n[25] J. Whitehill. A stochastic optimal control perspective on a\ufb00ect-sensitive teaching. PhD thesis,\n\nDepartment of Computer Science, UCSD, 2012.\n\n9\n\n\f", "award": [], "sourceid": 765, "authors": [{"given_name": "Robert", "family_name": "Lindsey", "institution": "University of Colorado"}, {"given_name": "Mohammad", "family_name": "Khajah", "institution": "University of Colorado Boulder"}, {"given_name": "Michael", "family_name": "Mozer", "institution": "University of Colorado"}]}