{"title": "Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 321, "page_last": 328, "abstract": null, "full_text": " Trait selection for assessing beef meat quality\n using non-linear SVM\n\n\n\n J.J. del Coz, G. F. Bayon, J. Diez, Carlos Sa ~\n nudo\n O. Luaces, A. Bahamonde Facultad de Veterinaria\n Artificial Intelligence Center University of Zaragoza\n University of Oviedo at Gijon csanudo@posta.unizar.es\n juanjo@aic.uniovi.es\n\n\n\n\n Abstract\n\n In this paper we show that it is possible to model sensory impressions\n of consumers about beef meat. This is not a straightforward task; the\n reason is that when we are aiming to induce a function that maps object\n descriptions into ratings, we must consider that consumers' ratings are\n just a way to express their preferences about the products presented in\n the same testing session. Therefore, we had to use a special purpose\n SVM polynomial kernel. The training data set used collects the ratings of\n panels of experts and consumers; the meat was provided by 103 bovines\n of 7 Spanish breeds with different carcass weights and aging periods.\n Additionally, to gain insight into consumer preferences, we used feature\n subset selection tools. The result is that aging is the most important trait\n for improving consumers' appreciation of beef meat.\n\n\n\n1 Introduction\n\nThe quality of beef meat is appreciated through sensory impressions, and therefore its\nassessment is very subjective. However, it is known that there are objective traits very im-\nportant for the final properties of beef meat; this includes the breed and feeding of animals,\nweight of carcasses, and aging of meat after slaughter. To discover the influence of these\nand other attributes, we have applied Machine Learning tools to the results of an experi-\nence reported in [8]. In the experience, 103 bovines of 7 Spanish breeds were slaughtered\nto obtain two kinds of carcasses, light and standard [5]; the meat was prepared with 3 aging\nperiods, 1, 7, and 21 days. Finally, the meat was consumed by a group, called panel, of 11\nexperts, and assessed by a panel of untrained consumers.\n\nThe conceptual framework used for the study reported in this paper was the analysis of\nsensory data. In general, this kind of analysis is used for food industries in order to adapt\ntheir productive processes to improve the acceptability of their specialties. They need to\ndiscover the relationship between descriptions of their products and consumers' sensory\ndegree of satisfaction. An excellent survey of the use of sensory data analysis in the food\nindustry can be found in [15, 2]; for a Machine Learning perspective, see [3, 9, 6].\n\nThe role played by each panel, experts and consumers, is very clear. So, the experts' panel\nis made up of a usually small group of trained people who rate several traits of products such\n\n\f\nas fibrosis, flavor, odor, etc. . . The most essential property of expert panelists, in addition\nto their discriminatory capacity, is their own coherence, but not necessarily the uniformity\nof the group. Experts' panel can be viewed as a bundle of sophisticated sensors whose\nratings are used to describe each product, in addition to other objective traits. On the other\nhand, the group of untrained consumers (C) are asked to rate their degree of acceptance or\nsatisfaction about the tested products on a given scale. Usually, this panel is organized in\na set of testing sessions, where a group of potential consumers assess some instances from\na sample E of the tested product. Frequently, each consumer only participates in a small\nnumber (sometimes only one) of testing sessions, usually in the same day.\n\nIn general, the success of sensory analysis relies on the capability to identify, with a precise\ndescription, a kind of product that should be reproducible as many times as we need to be\ntested for as many consumers as possible. Therefore, the study of beef meat sensory quality\nis very difficult. The main reason is that there are important individual differences in each\npiece of meat, and the repeatability of tests can be only partially ensured. Notice that from\neach animal there are only a limited amount of similar pieces of meat, and thus we can only\nprovide pieces of a given breed, weight, and aging period. Additionally, it is worthy noting\nthat the cost of acquisition of this kind of sensory data is very high.\n\nThe paper is organized as follows: in the next section we present an approach to deal with\ntesting sessions explicitly. The overall idea is to look for a preference or ranking function\nable to reproduce the implicit ordering of products given by consumers instead of trying\nto predict the exact value of consumer ratings; such function must return higher values\nto those products with higher ratings. In Section 3 we show how some state of the art\nFSS methods designed for SVM (Support Vector Machines) with non-linear kernels can be\nadapted to preference learning. Finally, at the end of the paper, we return to the data set\nof beef meat to show how it is possible to explain consumer behavior, and to interpret the\nrelevance of meat traits in this context.\n\n\n2 Learning from sensory data\n\n\nA straightforward approach to handle sensory data can be based on regression, where sen-\nsory descriptions of each object x E are endowed with the degree of satisfaction r(x)\nfor each consumer (or the average of a group of consumers). However, this approach does\nnot faithfully captures people's preferences [7, 6]: consumers' ratings actually express a\nrelative ordering, so there is a kind of batch effect that often biases their ratings. Thus, a\nproduct could obtain a higher (lower) rating depending on if it is assessed together with\nworse (better) products. Therefore, information about batches tested by consumers in each\nrating session is a very important issue. On the other hand, more traditional approaches,\nsuch as testing some statistical hypotheses [16, 15, 2] require all available food products in\nsample E to be assessed by the set of consumers C, a requisite very difficult to fulfill.\n\nIn this paper we use an approach to sensory data analysis based on learning consumers'\npreferences, see [11, 14, 1], where training examples are represented by preference judg-\nments, i.e. pairs of vectors (v, u) indicating that, for someone, object v is preferable to\nobject u. We will show that this approach can induce more useful knowledge than other\napproaches, like regression based methods. The main reason is due to the fact that prefer-\nence judgments sets can represent more relevant information to discover consumers' pref-\nerences.\n\n\n2.1 A formal framework to learn consumer preferences\n\nIn order to learn our preference problems, we will try to find a real ranking function f that\nmaximizes the probability of having f (v) > f (u) whenever v is preferable to u [11, 14, 1].\n\n\f\nOur input data is made up of a set of ratings (ri(x) : x Ei) for i C. To avoid the batch\neffect, we will create a preference judgment set P J = {vj > uj : j = 1, . . . , n} suitable\nfor our needs just considering all pairs (v, u) such that objects v and u were presented in\nthe same session to a given consumer i, and ri(v) > ri(u).\n\nThus, following the approach introduced in [11], we look for a function F : d d\n R R R\nsuch that\n x, y d\n R , F (x, y) > 0 F (x, 0) > F (y, 0). (1)\n\nThen, the ranking function f : d\n R R can be simply defined by f (x) = F (x, 0).\n\nAs we have already constructed a set of preference judgments P J , we can specify F by\nmeans of the restrictions\n\n F (vj, uj) > 0 and F (uj, vj) < 0, j = 1, . . . , n. (2)\n\nTherefore, we have a binary classification problem that can be solved using SVM. We\nfollow the same steps as Herbrich et al. in [11], and define a kernel K as follows\n\n K(x1, x2, x3, x4) = k(x1, x3) - k(x1, x4) - k(x2, x3) + k(x2, x4) (3)\n\nwhere k(x, y) = (x), (y) is a kernel function defined as the inner product of two\nobjects represented in the feature space by their images. In the experiments reported in\nSection 4, we will employ a polynomial kernel, defining k(x, y) = ( x, y + c)g, with\nc = 1 and g = 2. Notice that, finally we can express the ranking function f in a non-linear\nform: n\n (1) (2)\n f (x) = izi(k(x , x) - k(x , x)) (4)\n i i\n i=1\n\n\n\n3 Feature subset selection methods in a non-linear environment\n\nWhen dealing with sensory data, it is important to know not only which classifier is the\nbest and how accurate it is, but also which features are relevant for the tastes of consumers.\nProducers can focus on these features to improve the quality of the final product. Addition-\naly, reductions on the number of features often lead to a cheaper data acquisition labour,\nmaking these systems suitable for industrial operation [9].\n\nThere are many feature subset selection methods applied to SVM classification. If our\ngoal is to find a linear separator, RFE (Recursive Feature Elimination) [10] will be a good\nchoice. It is a ranking method that returns an ordering of the features. RFE iteratively\nremoves the less useful feature. This process is repeated until there are no more features.\nThus, we obtain an ordered sequence of features.\n\nFollowing the main idea of RFE, we have used two methods capable of ordering features in\nnon-linear scenarios. We must also point that, in this case, preference learning data sets are\nformed by pairs of objects (v, u), and each object in the pair has the same set of features.\nThus, we must modify the ranking methods so they can deal with the duplicated features.\n\n\n3.1 Ranking features for non-linear preference learning\n\nMethod 1.- This method orders the list of features according to their influence in the\nvariations of the weights. It is a gradient-like method, introduced in [17], and found to be\na generalization of RFE to the non-linear case. It removes in each iteration the feature that\nminimizes the ranking value\n\n K(s xk, s xj)\n R1(i) = | i w 2| = kjzkzj , i = 1, . . . , d (5)\n si\n k,j\n\n\f\nwhere s is a scaling factor used to simplify the computation of partial derivatives. Due to\nthe fact that we are working on a preference learning problem, we need 4 copies of the\nscaling factor. In this formula, for a polynomial kernel k(x, y) = ( x, y + c)g and a\nvector s such that i, si = 1 we have that\n k(s x, s y) = 2g(xiyi)(c + x, y )g-1. (6)\n si\n\n\nMethod 2.- This method, introduced in [4], works in an iterative way; removing each\ntime the feature which minimizes the loss of predictive performance. When using this\nmethod for preference learning with the kernel of equation (3) the ranking criterion can be\nexpressed as\n \n\n (1),i (2),i (1),i (2),i\n R2(i) = z , x , x , x )\n k j zj K(xj j k k (7)\n k j\n\nwhere xi denotes a vector describing an object where the value for the i-th feature was\nreplaced by its mean value. Notice that a higher value of R2(i), that is, a higher accuracy\non the training set when replacing feature i-th, means a lower relevance of that feature.\nTherefore, we will remove the feature yielding the highest ranking value, as opposite to the\nranking method described previously.\n\n\n3.2 Model selection on an ordered sequence of feature subsets\n\nOnce we have an ordering of the features, we must select the subset Fi which maximizes\nthe generalization performance of the system. The most common choice for a model se-\nlection method is cross-validation (CV), but its efficiency and high variance [1] lead us to\ntry another kind of methods. We have used ADJ (ADJusted distance estimate)[19]. This\nis a metric-based method that selects one from a nested sequence of complexity-increasing\nmodels. We construct a sequence of subsets F1 F2 . . . Fd, where Fi represents the\nsubset containing only the i most relevant features. Then we can create a nested sequence\nof models fi, each one of these induced by SVM from the corresponding Fi.\n\nThe key idea is the definition of a metric on the space of hypothesis. Thus, given two\ndifferent hypothesis f and g, their distance is calculated as the expected disagreement in\ntheir predictions. Given that these distances can only be approximated, ADJ establish a\nmethod to compute ^\n d(g, t), an adjusted distance estimate between any hypothesis f and\nthe true target classification function t. Therefore, the selected hypothesis is\n f ^\n k = arg min d(fl, t). (8)\n fl\n\n\nThe estimation of distance, ^\n d, is computed by means of the expected disagreement in the\npredictions in a couple of sets: the training set T , and a set U of unlabeled examples, that is,\na set of cases sampled from the same distribution of T but for which the pretended correct\noutput is not given. The ADJ estimation is given by\n\n def dU (fk, fl)\n ADJ (fl, t) = dT (fl, t) max (9)\n k