{"title": "Hybrid-MST: A Hybrid Active Sampling Strategy for Pairwise Preference Aggregation", "book": "Advances in Neural Information Processing Systems", "page_first": 3475, "page_last": 3485, "abstract": "In this paper we present a hybrid active sampling strategy for pairwise preference aggregation, which aims at recovering the underlying rating of the test candidates from sparse and noisy pairwise labeling. Our method employs Bayesian optimization framework and Bradley-Terry model to construct the utility function, then to obtain the Expected Information Gain (EIG) of each pair. For computational efficiency, Gaussian-Hermite quadrature is used for estimation of EIG. In this work, a hybrid active sampling strategy is proposed, either using Global Maximum (GM) EIG sampling or Minimum Spanning Tree (MST) sampling in each trial, which is determined by the test budget. The proposed method has been validated on both simulated and real-world datasets, where it shows higher preference aggregation ability than the state-of-the-art methods.", "full_text": "Hybrid-MST: A Hybrid Active Sampling Strategy for\n\nPairwise Preference Aggregation\n\nJing Li\n\nLS2N/IPI Lab\n\nUniversity of Nantes\n\njingli.univ@gmail.com\n\nRafal K. Mantiuk\nComputer Laboratory\n\nUniversity of Cambridge\n\nrkm38@cam.ac.uk\n\nJunle Wang\nTuring Lab\n\nTencent Games\n\nwangjunle@gmail.com\n\nSuiyi Ling, Patrick Le Callet\n\nLS2N/IPI Lab\n\nUniversity of Nantes\n\nsuiyi.ling, patrick.lecallet@univ-nantes.fr\n\nAbstract\n\nIn this paper we present a hybrid active sampling strategy for pairwise preference\naggregation, which aims at recovering the underlying rating of the test candidates\nfrom sparse and noisy pairwise labelling. Our method employs Bayesian optimiza-\ntion framework and Bradley-Terry model to construct the utility function, then\nto obtain the Expected Information Gain (EIG) of each pair. For computational\nef\ufb01ciency, Gaussian-Hermite quadrature is used for estimation of EIG. In this work,\na hybrid active sampling strategy is proposed, either using Global Maximum (GM)\nEIG sampling or Minimum Spanning Tree (MST) sampling in each trial, which is\ndetermined by the test budget. The proposed method has been validated on both\nsimulated and real-world datasets, where it shows higher preference aggregation\nability than the state-of-the-art methods.\n\n1\n\nIntroduction\n\nPreference aggregation from annotators\u2019 pairwise labeling on the test candidates is a traditional but\nstill active research topic. As the name implies, the objective of preference aggregation is to infer\nthe underlying rating or ranking of the test candidates according to annotator\u2019s (users or players)\nbinary label, e.g. which one is better? In particular, recently, with the access of big data, preference\naggregation from pairwise labeling has been widely applied in recommendation systems such as on\nmovie, music, news, books, research articles, restaurant, products according to user\u2019s preference\nselection; or in social networks for aggregating social opinions; or in sports race, chess and online\ngames to infer the global ranking of the players, etc.\nIn some applications, such as game players matching systems (e.g. MSR\u2019s TrueSkill system[1]),\nfriends-making website and subjective image/video quality assessment (IQA/VQA) [2], discovering\nthe underlying scores of the test candidates is more important than the rank order so the system could\nknow the intensity of the preference from users, eventually to assign matching players to the on-line\ngame players, or recommend the possible friends who have the same interests to the users, or to\nquantitatively evaluate the performance of different coding/rendering/display techniques in IQA/VQA\ndomain. However, as the size of the test candidates n gets bigger, which is happening nowadays,\nthe number of required pairwise labeling grows exponentially O(n2) leading to the unfeasible\nimplementation. Thus, there is an urgent need to reduce the number of pairwise comparisons, that is,\nselecting part of the pairs but without loosing the aggregation accuracy.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we present a hybrid active sampling strategy for pairwise labeling based on Bradley-Terry\n(BT) model[3], which can convert pairwise preference data to scale values. This work considers\nnot only about inferring ranking but also recovering the underlying rating. The term Hybrid\nexplains that different sampling strategies are used in this method determined by the test budget.\nActive learning recipe is adopted in our strategy by maximizing the information gain according to\nLindley\u2019s Bayesian optimal framework[4]. To capture the latent rating information, the minimum\nspanning tree (MST) is employed where the pairwise comparison is considered as a undirected graph.\nThe MST guarantees the strong connection and eventually leads to higher prediction precision by BT\nmodel. In addition, the MST allows for a parallel implementation on pairwise comparison through\ncrowdsourcing platform (such as Amazon MTurk), i.e. multiple annotators could work at the same\ntime. Source code is public available in Github 1.\nThe main contributions of our work are highlighted as follows: 1) Batch mode facility: When the\nnumber of test candidates is n, the proposed Hybrid-MST active sampling strategy allows for n \u2212 1\nparallel pairwise comparison each time. 2) Erroneous tolerance: We didn\u2019t model annotator\u2019s\nbehavior in this work, however, the utilization of MST to some extent tolerates the malicious\nlabeling from spammers (who give wrong/random answers). 3) Low computational complexity:\nCompared to the state-of-the-art method that considers numerous parameters and deals with both\nactive sampling and noise removing (e.g. Crowd-BT [5]), Hybrid-MST has much less time complexity.\n4) Application \ufb02exibility: Hybrid-MST is applicable in all conditions where aggregation on ranking\nor rating or both is required. It is also conductible in both small-scale lab test environment or\nlarge-scale crowdsourcing platform.\nThe remainder of this paper is organized as follows. State-of-the-art work is introduced in Section 2.\nThe proposed Hybrid-MST strategy is presented in Section 3 containing both theoretical analysis\nand Monte Carlo simulation analysis. Extensive experimental validation on simulated dataset and\nreal-world datasets are shown in Section 4. Finally, Section 5 concludes this work.\n\n2 Related Work\n\nIn real applications of preference aggregation, annotator\u2019s label could be explicit, for instance, a\nLikert scale score from \u201cexcellent\u201d to \u201cbad\u201d, or implicit, e.g. pairwise comparison voting on two\ntest candidates. The explicit label is more likely to be inconsistent [6][7] and noisy due to diverse\nin\ufb02uence factors [8]. According to a well known phenomenon in psychological study of human choice\nthat \u201chuman response to comparison questions is more stable in the sense that it is not easily affected\nby irrelevant alternatives\u201d[9], obtaining label from pairwise comparison is thus a more appealing\nway for human participated labeling application, such as image quality assessment. Nevertheless, in\nwhatever types of pairwise comparison, pairwise labeling still suffers from noises from a variety of\nsources, such as the human annotator\u2019s expertise, the emotional states of players in a match, or the\nenvironment (external factors) of competition venue. In such case, the challenge changes to how to\ninvert this implicit and in most cases noisy pairwise data back to the true global ranking or rating.\nSeveral models have been proposed to explain the relation between pairwise-comparison responses\nand ranking/rating scale, including the earlier heuristic methods Borda Count[10], and the currently\nwidely used probabilistic permutation model such as the Plackett-Luce (PL) model[11][12], the\nMallows model [13], the Bradley-Terry (BT) model[3], and the Thurstone-Mosteller (TM) model[14].\nWhen facing the large-scale data but with sparse labels, these models might have computational\ncomplexity issues or parameter estimating issues. Thus, in machine learning community, numerous\nstudies have been focusing on optimizing the parameters of these models[15][16], designing ef\ufb01-\ncient algorithms [17][18], providing sharp minimax bounds [19] and proposing novel aggregation\nmodels[9][20][21]. Meanwhile, some researches are aiming at develop novel models to infer the\nlatent scores of the test candidates from pairwise data and eventually obtain the rank ordering[6]\n[22][23][24].\nIt is well known that pairwise comparison needs large number of pairwise data to infer the ranking,\nwhich is in most applications very time consuming. A straightforward way to boost the pairwise\nlabeling procedure is through data sampling. A simple and straightforward pair sampling strategy\nis random sampling such as the \u201cbalanced sub-set\u201d method proposed by Dykstra [25] by putting\nthe test candidates in a form (triangle, or rectangular matrix) only subsets of the test candidates are\n\n1Source code: https://github.com/jingnantes/hybrid-mst\n\n2\n\n\fcompared, and the HRRG (HodgeRank on Random Graph) method proposed by Xu et al. [26] where\nrandom graph is utilized and only connected vertices are compared, meanwhile a Hodge theory based\nrank model (HodgeRank) is proposed to convert the sparse pairwise data to scale ratings. Another\nway to sample pairs is based on empirical observations that comparing closer/similar pairs would be\nmore important than the distant pairs. In [27], the authors proposed to apply the sorting algorithms to\nsample pairs. In [28][29], Li et al. proposed an Adaptive Rectangular Design (ARD) to adaptively\nand iteratively selecting pairs based on the estimated rank ordering of test candidates.\nTo further improve the aggregation performance, the recent studies focused on active learning for\ninformation retrieval. In [30], the authors exploit the underlying low-dimensional Euclidean space\nof the data to discover the ranking using a small number of pairwise comparisons. Some other\nresearches focus on selecting the pairs which could generate the maximum information gain de\ufb01ned\nby a utility function. In [31], the sampling strategy is based on TM model by employing the Bayesian\noptimization framework, while Chen et.al. [5] (Crowd-BT) utilizes the BT model but also considers\nthe annotator\u2019s in\ufb02uence. Xu et al. [32] (Hodge-active) employs the HodgeRank model as well as the\nBayesian information maximization to actively select the pair.\nActive learning based sampling methods have demonstrated their outstanding performance in different\ndatasets. However, they still have at least one of the following drawbacks: 1) The sampling procedure\nis a sequential decision process, which means the generation of next pair is determined only when\nthe previous observation is \ufb01nished. Such sequential mode is not suitable for large-scale (e.g.\ncrowdsourcing) experiments, in which many conditions are tested in parallel. 2) Most of the proposed\nmethods focus on ranking aggregation, which might not be accurate enough for the applications\nthat require ratings scores. 3) Annotator\u2019s unreliability on labeling the pairwise data should be\nconsidered in the active learning process, in other words, the active sampling strategy should be\nrobust to observation errors. A straightforward way is to model annotator\u2019s behavior, as done for the\nCrowd-BT method [5]. However, it is computationally expensive.\nTo resolve the challenges mentioned above, in this paper, we proposed a hybrid active sampling\nstrategy which allows for batch mode labeling and be robust to annotator\u2019s random/inverse labeling\nbehavior to infer the scale ratings. Details are introduced in the following sections.\n\n3 Proposed Methodology\n\nLet us assume that we have n objects A1, A2, ...An to test in a pairwise comparison experiment. The\nunderlying quality scores of these objects are s = (s1, s2, ...sn). In an experiment, the annotator\u2019s\nobserved score for object Ai is ri. ri is a random variable ri = si + \u0001i, where the noise term is a\nGaussian random variable \u0001i \u223c N (0, \u03c32\ni ). In a single trial, if ri > rj, then the annotator selects Ai\nover Aj, and the outcome is registered as yij = 1. If ri < rj, then yij = 0. For the case that ri = rj,\nyij is randomly assigned with 0 or 1 (In real test, the annotators in such condition could randomly\nmake a selection). The probability of selecting Ai over Aj is denoted as P r(Ai (cid:31) Aj).\n\n3.1 Preference aggregation model\n\nThere are already some well-known models to convert the pairwise probability data to cardinal scale\nratings as we mentioned before. In this study, we choose BT model as an example. But this work\ncould be easily extended to generalized linear model (GLM), in which BT model is the logit condition,\nand TM model is the probit condition.\nAccording to BT model, for any two objects Ai and Aj, the probability that Ai is preferred over Aj,\ni.e. P r(Ai (cid:31) Aj) could be represented as:\nP r(Ai (cid:31) Aj) (cid:44) \u03c0ij =\n\n(cid:80)t\n\n\u03c0i \u2265 0,\n\n(1)\n\n\u03c0i\n\ni=1 \u03c0i = 1\n\n,\n\n\u03c0i + \u03c0j\n\nwhere \u03c0i is the merit of the object Ai. The relationship between underlying score si and \u03c0i is\nsi = log(\u03c0i), thus, we obtain:\n\n(2)\nSince we measured is a distance value between two objects, there are in total n \u2212 1 free parameters\nthat need to be estimated. To infer the n \u2212 1 parameters in BT model, the Maximum Likelihood\n\n1 + e\u2212(si\u2212sj )\n\nesi + esj\n\n\u03c0ij =\n\n=\n\nesi\n\n1\n\n3\n\n\fEstimation (MLE) method is adopted in this study. Given the pairwise comparison results arranged\nin a matrix M = (mij)n\u00d7n, where mij represents the total number of trial outcomes Ai (cid:31) Aj, the\nlikelihood function takes the shape:\n\nL(s|M) =\n\n\u03c0mij\nij\n\n(1 \u2212 \u03c0ij)mji\n\n(3)\n\n(cid:89)\n\ni 0}. w(E) are the weights on the edges, in our\nstudy, they are the inverse of the EIG of candidate pairs, i.e. w(E) = 1\nUij\n\n.\n\nA MST Gmst is a subset of the edges of a connected, edge-weighted (un)directed graph that connects\nall the vertices together, without any cycles and with the minimum possible total edge weight. The\ncharacteristics of MST include:\n\n\u2022 If there are n vertices in the graph, then each spanning tree has n \u2212 1 edges.\n\u2022 If each edge has a distinct weight, then there will be only one, unique MST.\n\u2022 If the weights are positive, then a MST is a minimum-cost subgraph connecting all vertices.\n\nThus, MST facilitates the batch mode in real application, the strong connection over all test candidates\nand the maximum sum of information gains of all possible pairs. The pair selection criterion based\non MST method is:\n\n{Ai, Aj} = {Emst | Gmst = (A, Emst)}\n\n(9)\n\nIn this study, we use Prim\u2019s algorithm [36] to \ufb01nd the MST as it is optimal for dense graphs. An\nexample of an undirected weighted graph and its MST is shown in Figure 2.\n\n3.3.3 Threshold setting\n\nIn this section we analyze the performance of the GM and MST methods. Firstly, in GM method, we\ninitialize the pair comparison matrix M by mij = mji = 1, i (cid:54)= j to \ufb01x the resolving issue of BT\nmodel [5]. Then, we design a Monte Carlo simulation experiment, assuming 10, 16, 20 and 40 test\nobjects. The underlying scores are uniformly distributed from 1 to 5, with noise \u0001i \u223c N (0, \u03c32\ni ), \u03c3i is\nuniformly distributed between 0 and 0.7. In a simulated test, if the sampled score ri is higher than rj,\nthen Ai is selected over Aj. We also model the observation errors that might happen in the real test,\ni.e. the subject makes a mistake (inverting the vote) during the test. The probabilities of observation\nerrors are designed as 10%, 20%, 30% and 40%. Therefore, there are in total 16 simulated tests, each\ntest repeats 100 times.\n\n5\n\n\fTo evaluate the aggregation performance of GM and MST, the Pearson Linear Correlation Coef\ufb01cient\n(PLCC) and Kendall\u2019s tau coef\ufb01cient (Kendall) between the designed ground truth scores and the\nMLE scores obtained by BT model are calculated. For easier illustration, in the following section,\nwe de\ufb01ne 1 standard trial number as the total number of comparisons that one observer needs to\ncompare in Full Pair Comparison (FPC), that is, for n objects, 1 standard trial number equals to\nn(n \u2212 1)/2 comparisons.\nBy running Student\u2019s t-test on the performances of GM and MST methods and checking their\nsigni\ufb01cant difference (which one is better), we \ufb01nd that generally, the GM method performs better\nthan the MST method when the standard trial number is less than 1. With the increase of the\ncomparison numbers, the MST method performs better than GM method, especially when the\nobservation errors are large.\nTo bene\ufb01t from both GM and MST methods, we decide to develop a hybrid active sampling strategy\nwith 1 standard trial number as the switching threshold, i.e.:\n\n(cid:26)argmaxi(cid:54)=jUij\n\nif(cid:80)\n\n{Ai, Aj} =\n\nEmst\n\ni,j mij \u2264 n(n\u22121)\notherwise\n\n2\n\n(10)\n\nThe whole Hybrid-MST sampling strategy is summed up in Algorithm 1.\n\nAlgorithm 1 Hybrid-MST sampling algorithm\nInput: Current pairwise observation matrix M, Number of test objects n\nOutput: Pairs for next round {Ai, Aj}\n\nfor all possible pairs {Ai, Aj}, i < j do\n\nComputing EIG Uij according to Equation 6\n\ni,j mij \u2264 n(n\u22121)\nSelect the pair {Ai, Aj} which has the maximum Uij\n\nthen\n\n2\n\nFind MST according to Uij, for all i < j;\nSelect the pairs which are the edges of MST, i.e. {Ai, Aj} = Emst.\n\nif(cid:80)\n\nelse\n\nend if\n\nend for\n\n4 Experiments\n\n4.1 Simulated dataset\n\nIn this experiment, the proposed method is compared with the state-of-the-art methods including\nFPC [37], ARD [28], HRRG [38], Crowd-BT [5], and Hodge-active [32]. A Monte Carlo simulation is\nconducted on 60 conditions (stimuli) whose scores are randomly selected from a uniform distribution\non the interval of [1 5]. The assumptions are exactly the same with the experiment that we did in\nSection 3.3.3 and the observation error is set as 10%.\nTo obtain statistically reliable results, the simulation experiment is conducted 100 times. The\nrelationship between the ground truth and the obtained estimated scores are evaluated by Kendall,\nPLCC, and the Root Mean Square Error (RMSE). Results are shown in Figure 3. It should be noted\nthat as the PLCC, Kendall and RMSE values increase/decrease fast and look saturate when the trial\nnumber is large, it is dif\ufb01cult to visually distinguish the performances of different methods. Thus, in\nthis paper, we rescale the Kendall and PLCC values by Fisher transformation, i.e. y(cid:48) = arctanh(y),\nand the RMSE value by function y(cid:48) = \u2212 1\ny .\n\nQualitative analysis Under the condition that each annotator has a 10% probability that inverses\nthe vote, according to Figure 3, Hodge-active shows the strongest performance than others in ranking\naggregation (Kendall) when the test budget (i.e. the number of comparisons) is small. With the\nincrease of the trial number, the proposed Hybrid-MST method as well as the Crowd-BT shows\ncomparable performance with Hodge-active. Regarding rating aggregation (PLCC and RMSE), the\nproposed Hybrid-MST method performs signi\ufb01cantly better than the others except for that when the\n\n6\n\n\fFigure 3: Monte Carlo simulation results. The color area represents 95% con\ufb01dence intervals of\nthe corresponding evaluated methods over 100 repetitions. For better visualization, the Kendall and\nPLCC are rescaled using Fisher transformation. RMSE is rescaled using y(cid:48) = \u2212 1\ny .\n\ntrial number is small, i.e. less than 2 or 3, the Hodge-active performs slightly better than Hybrid-\nMST. Crowd-BT shows similar performance with ARD in rating aggregation, which is lower than\nHybrid-MST and Hodge-active but higher than HRRG.\n\nSaving budget compared to FPC Following ITU-R BT.500 [39] and ITU-T P.910 [37], 15 stan-\ndard trial number (i.e. 15 annotators to compare all n(n \u2212 1)/2 pairs) is the minimum requirement\nfor FPC to generate reliable results. In this part, we compare how much budget can be saved by active\nsampling methods, i.e. Hybrid-MST, Hodge-active, and Crowd-BT. The mean of Kendall, PLCC and\nRMSE are used in a way that if D pairwise comparisons in Hybrid-MST/Hodge-active/Crowd-BT\ncould achieve the same precision as the FPC with 15 standard trial numbers, the saving budget Bs is:\n\n(cid:32)\n\n(cid:33)\n\nBs =\n\n1 \u2212\n\nD\nn(n\u22121)\n\n2 \u00d7 15\n\n\u00d7 100%\n\n(11)\n\nThe obtained Bs for Kendall, PLCC and RMSE are 77.11%, 74.89% and 74.89% for Hybrid-MST,\nand 84.57%, 68.61%, 71.65% for Hodge-active, respectively. Crowd-BT only has Bs value for\nKendall, which is 78.43%, as it needs more trial number to achieve the same FPC precision in PLCC\nand RMSE, which does not save budget.\n\nComputational cost To evaluate the computational cost of each sampling method, the same Monte\nCarlo simulation test is conducted for n = 10, 20 and 100. The averaged time cost (milliseconds/pair)\nover 100 repetitions for each method is shown in Table 1. All computations are done using MATLAB\nR2014b on a MacBook Pro laptop, with 2.5GHz Intel Core i5, 8GB memory.\nFPC is the simplest method without any learning process and therefore it is with the highest computa-\ntionally ef\ufb01ciency. Besides, ARD, HRRG and Hodge-active also show their advantages in runtime.\nCrowd-BT shows similar runtime with our Hybrid-MST in GM mode. When Hybrid-MST is in MST\nmode, the runtime is approximately n times more ef\ufb01cient than Crowd-BT and GM method. It should\nbe noted that our proposed Hybrid-MST method only uses the GM method in the \ufb01rst standard trial\n(which can be easily reached in large-scale crowdsourcing labeling experiment) and then switches\nto the MST method, thus, in real application, our sampling strategy in most cases is in MST mode,\nwhich is much faster than Crowd-BT. Nevertheless, all runtimes are in a feasible range, even for large\nnumber of conditions and our unoptimized code (where the calculation of EIG for all pairs can be\nexecuted in parallel).\n\nTable 1: Runtime comparison on simulated data (ms/pair)\n\nn\n10\n20\n100\n\nFPC ARD HRRG Crowd-BT Hodge-active\n0.11\n0.10\n0.10\n\n85.69\n188.56\n3033.02\n\n0.34\n0.22\n0.65\n\n1.24\n0.62\n0.16\n\n0.38\n0.34\n0.65\n\nHybrid-MST\nGM\n48.72\n153.61\n3007.08\n\nMST\n6.16\n8.97\n30.04\n\nTo demonstrate the superiority of batch-mode sampling in real applications, we take a typical VQA\nexperiment as an example (which also holds for player matching system, recommendation system,\netc.). The typical presentation structure of sequential sampling methods (HRRG, Crowd-BT, Hodge-\nactive, GM) for one pair comparison procedure is: pair presentation time (T 1) + annotator\u2019s voting\ntime (T 2) + runtime of pairwise sampling algorithm (T 3), where T 1 and T 2 are generally in total\n15 seconds, T 3 is determined by the used algorithm. Sequential sampling methods cannot generate\n\n7\n\n\fFigure 4: Performances of different sampling methods on VQA dataset. Color area represents\n95% con\ufb01dence intervals over 100 times iterations. For better visualization, Kendall and PLCC are\nrescaled using Fisher transformation. RMSE is rescaled using y(cid:48) = \u2212 1\ny .\na new optimal pair of objects to compare until the annotator is done with the previous pair. This\nintroduces unacceptable delay in the system if multiple annotators work at the same time.\nIn contrast, the batch-based Hybrid-MST (in MST mode) can generate multiple pairs, which can be\nworked on in parallel by multiple annotators. Ideally (annotators work synchronously), the whole\nprocedure for n\u2212 1 pairs needs T 1 + T 2 + T 3 seconds. While in the worst case, the annotators work\none after the other (just like in sequential method), which needs T 1 + T 2 + T 3 seconds for only\none pair. To make a comparison, the time cost of a whole pairwise comparison procedure including\nstimuli presentation time and voting time in a typical VQA experiment is shown in Table 2, which\ndemonstrates that our method Hybrid-MST is particularly applicable in large-scale crowdsourcing\nexperiment.\nTable 2: Time cost (seconds) of comparing n \u2212 1 pairs in a typical VQA pair comparison experiment\n(T 1 + T 2 + T 3)\n\nn\n10\n20\n100\n\nCrowd-BT Hodge-active\n\n135.8\n288.6\n1782.0\n\n135.0\n285.0\n1485.1\n\n4.2 Real-world datasets\n\nGM MST(ideal case) MST (the worst case)\n135.4\n287.8\n1782.0\n\n135.1\n285.2\n1487.9\n\nHybrid-MST\n\n15.1\n15.2\n17.9\n\nIn this session, we compare our proposed Hybrid-MST with the state-of-the-art active learning\nmethods, Crowd-BT [5] and Hodge-active [32]. For statistical reliability, each method is conducted\n100 times. Two real-world datasets are used. Details are shown below.\n\nVideo Quality Assessment(VQA) dataset This VQA dataset is a complete and balanced pairwise\ndataset from [38]. It contains 38400 pairwise comparisons for video quality assessment of 10\nreferences from LIVE database [40]. Each reference contains 16 different types of distortions. 209\nannotators attend this test.\nImage Quality Assessment (IQA) dataset This IQA dataset is a complete but imbalanced dataset\nfrom [26]. It contains 43266 pairwise comparison data for quality assessment of 15 references from\n\n8\n\n\fFigure 5: Performances of different sampling methods on IQA dataset. Color area represents 95%\ncon\ufb01dence intervals over 100 times iterations. For better visualization, Kendall and PLCC are\nrescaled using Fisher transformation. RMSE is rescaled using y(cid:48) = \u2212 1\ny .\nLIVE 2008 [41] and IVC 2005 [42] database. Each reference contains 16 different types of distortions.\n328 annotators from Internet attend the test.\nAs there is no ground truth for the real-world dataset, we consider the results obtained by all observers\nas ground truth. Again, Kendall, PLCC and RMSE are used as the evaluation methods. Due to the\nlimitation of spaces, part of the results are shown in Figure 4 and 5.\nIn the real-world datasets where the annotator\u2019s labelings are much more noisy and diverse than our\nsimulated condition, the proposed Hybrid-MST shows higher robustness to these noisy labelling than\nothers. Regarding the ranking aggregation ability (Kendall), though Hodge-active still shows a bit\nstronger performance in ranking aggregation than Hybrid-MST when the trial number is small, it is\nnot as much as in the simulated data. With the increase of the test budget, Hybrid-MST performs\ncomparable or even better than Hodge-active. They both outperform Crowd-BT. Regarding the\nrating aggregation (PLCC and RMSE), Hybrid-MST always outperforms the others signi\ufb01cantly.\nHodge-active performs similar with Crowd-BT in VQA dataset, but much better than Crowd-BT in\nIQA dataset.\nBoth simulated and real-world experiments demonstrate that when the test budget is limited (2-3\nstandard trial numbers) and the objective is ranking aggregation, i.e. we care more about the rank\norder of the test candidates rather than their underlying scores, using Hodge-active is safer than\nHybrid-MST. In all other conditions, Hybrid-MST is de\ufb01nitely more applicable considering both the\naggregation accuracy and batch-mode execution.\n\n5 Conclusions\n\nIn this paper, we present an active sampling strategy called Hybrid-MST for pairwise preference\naggregation. We de\ufb01ne the EIG based on local KLD where Bayes\u2019 theorem is adopted for \ufb01nding\nthe tractable computation form and Gaussian-Hermite quadrature is utilized for ef\ufb01cient estimation.\nPair sampling is a hybrid strategy which takes advantages of both GM method and MST method,\nallowing for better ranking and rating aggregation in small and large trial number conditions. In both\nsimulated experiment and the real-world VQA and IQA datasets, Hybrid-MST shows its outstanding\naggregation ability. In addition, in crowdsourcing platform, the proposed batch-mode MST method\ncould boost the pairwise comparison procedure signi\ufb01cantly by parallel labeling.\n\n9\n\n\fReferences\n[1] R. Herbrich, T. Minka, and T. Graepel, \u201cTrueskillTM: a bayesian skill rating system,\u201d in\n\nAdvances in neural information processing systems, 2007, pp. 569\u2013576.\n\n[2] J. Li, M. Barkowsky, and P. Le Callet, \u201cAnalysis and improvement of a paired comparison\nmethod in the application of 3DTV subjective experiment,\u201d International Conference on Image\nProcessing, pp. 629\u2013632, Sep. 2012.\n\n[3] R. Bradley and M. Terry, \u201cRank analysis of incomplete block designs: I. The method of paired\n\ncomparisons,\u201d Biometrika, vol. 39, no. 3/4, pp. 324\u2013345, Dec. 1952.\n\n[4] D. V. Lindley, \u201cOn a measure of the information provided by an experiment,\u201d The Annals of\n\nMathematical Statistics, pp. 986\u20131005, 1956.\n\n[5] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz, \u201cPairwise ranking aggregation\nin a crowdsourced setting,\u201d in Proceedings of the sixth ACM international conference on Web\nsearch and data mining. ACM, 2013, pp. 193\u2013202.\n\n[6] S. Negahban, S. Oh, and D. Shah, \u201cIterative ranking from pair-wise comparisons,\u201d in Advances\n\nin neural information processing systems, 2012, pp. 2474\u20132482.\n\n[7] J. Li, M. Barkowsky, J. Wang, and P. Le Callet, \u201cExploring the effects of subjective methodology\non assessing visual discomfort in immersive multimedia,\u201d IS&T Electronic Imaging, Human\nVision and Electronic Imaging, Jan. 2018.\n\n[8] P. Le Callet, S. M\u00f6ller, and A. Perkis, \u201cQualinet white paper on de\ufb01nitions of quality of\nexperience v.1.1,\u201d European Network on Quality of Experience in Multimedia Systems and\nServices (COST Action IC 1003), Jun. 2012.\n\n[9] N. Ailon, \u201cReconciling real scores with binary comparisons: A new logistic based model for\n\nranking,\u201d in Advances in Neural Information Processing Systems, 2009, pp. 25\u201332.\n\n[10] P. Emerson, \u201cThe original borda count and partial voting,\u201d Social Choice and Welfare, vol. 40,\n\nno. 2, pp. 353\u2013358, 2013.\n\n[11] R. L. Plackett, \u201cThe analysis of permutations,\u201d Applied Statistics, pp. 193\u2013202, 1975.\n\n[12] R. D. Luce, Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.\n\n[13] C. L. Mallows, \u201cNon-null ranking models. i,\u201d Biometrika, vol. 44, no. 1/2, pp. 114\u2013130, 1957.\n\n[14] L. Thurstone, \u201cA law of comparative judgment,\u201d Psychological review, vol. 34, no. 4, pp.\n\n273\u2013286, 1927.\n\n[15] H. Azari, D. Parks, and L. Xia, \u201cRandom utility theory for social choice,\u201d in Advances in Neural\n\nInformation Processing Systems, 2012, pp. 126\u2013134.\n\n[16] T. Lu and C. Boutilier, \u201cLearning mallows models with pairwise preferences,\u201d in Proceedings\n\nof the 28th international conference on machine learning (icml-11), 2011, pp. 145\u2013152.\n\n[17] H. A. Sou\ufb01ani, W. Chen, D. C. Parkes, and L. Xia, \u201cGeneralized method-of-moments for rank\n\naggregation,\u201d in Advances in Neural Information Processing Systems, 2013, pp. 2706\u20132714.\n\n[18] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, \u201cAn ef\ufb01cient boosting algorithm for combining\n\npreferences,\u201d Journal of machine learning research, vol. 4, no. Nov, pp. 933\u2013969, 2003.\n\n[19] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. J. Wainwright,\n\u201cEstimation from pairwise comparisons: Sharp minimax bounds with topology dependence,\u201d\nThe Journal of Machine Learning Research, vol. 17, no. 1, pp. 2049\u20132095, 2016.\n\n[20] K. Crammer and Y. Singer, \u201cPranking with ranking,\u201d in Advances in neural information\n\nprocessing systems, 2002, pp. 641\u2013647.\n\n[21] T. Qin, X. Geng, and T.-Y. Liu, \u201cA new probabilistic model for rank aggregation,\u201d in Advances\n\nin neural information processing systems, 2010, pp. 1948\u20131956.\n\n10\n\n\f[22] P. Dangauthier, R. Herbrich, T. Minka, and T. Graepel, \u201cTrueskill through time: Revisiting the\nhistory of chess,\u201d in Advances in Neural Information Processing Systems, 2008, pp. 337\u2013344.\n\n[23] C. Cortes, M. Mohri, and A. Rastogi, \u201cMagnitude-preserving ranking algorithms,\u201d in Proceed-\n\nings of the 24th international conference on Machine learning. ACM, 2007, pp. 169\u2013176.\n\n[24] F. Wauthier, M. Jordan, and N. Jojic, \u201cEf\ufb01cient ranking from pairwise comparisons,\u201d in Interna-\n\ntional Conference on Machine Learning, 2013, pp. 109\u2013117.\n\n[25] O. Dykstra, \u201cRank analysis of incomplete block designs: A method of paired comparisons\n\nemploying unequal repetitions on pairs,\u201d Biometrics, vol. 16, no. 2, pp. 176\u2013188, Jun. 1960.\n\n[26] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, \u201cHodgerank on random graphs for\nsubjective video quality assessment,\u201d Multimedia, IEEE Transactions on, vol. 14, no. 3, pp.\n844\u2013857, 2012.\n\n[27] D. A. Silverstein and F. J. E., \u201cQuantifying perceptual image quality,\u201d Proc. IS&T Image\nProcessing, Image Quality, Image Capture, Systems Conference, vol. 1, pp. 242\u2013246, May 1998.\n\n[28] J. Li, M. Barkowsky, and P. Le Callet, \u201cBoosting Paired Comparison methodology in measuring\nvisual discomfort of 3DTV: performances of three different designs,\u201d IS&T/SPIE Electronic\nImaging, Feb. 2013.\n\n[29] \u2014\u2014, \u201cSubjective assessment methodology for preference of experience in 3dtv,\u201d in IVMSP\n\nWorkshop, 2013 IEEE 11th.\n\nIEEE, 2013, pp. 1\u20134.\n\n[30] K. G. Jamieson and R. Nowak, \u201cActive ranking using pairwise comparisons,\u201d in Advances in\n\nNeural Information Processing Systems, 2011, pp. 2240\u20132248.\n\n[31] T. Pfeiffer, X. A. Gao, Y. Chen, A. Mao, and D. G. Rand, \u201cAdaptive polling for information\n\naggregation.\u201d in AAAI, 2012.\n\n[32] Q. Xu, J. Xiong, X. Chen, Q. Huang, and Y. Yao, \u201cHodgerank with information maximization\n\nfor crowdsourced pairwise ranking aggregation,\u201d in AAAI, 2018.\n\n[33] R. A. Bradley, \u201cRank analysis of incomplete block designs: Iii some large-sample results on\nestimation and power for a method of paired comparisons,\u201d Biometrika, vol. 42, no. 3/4, pp.\n450\u2013470, 1955.\n\n[34] P. J. Davis and P. Rabinowitz, Methods of numerical integration. Courier Corporation, 2007.\n\n[35] P. Ye and D. Doermann, \u201cActive sampling for subjective image quality assessment,\u201d in Pro-\nceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.\n4249\u20134256.\n\n[36] R. C. Prim, \u201cShortest connection networks and some generalizations,\u201d Bell Labs Technical\n\nJournal, vol. 36, no. 6, pp. 1389\u20131401, 1957.\n\n[37] ITU-T P.910, \u201cSubjective video quality assessment methods for multimedia applications,\u201d\n\nInternational Telecommunication Union, Apr. 2008.\n\n[38] Q. Xu, T. Jiang, Y. Yao, Q. Huang, B. Yan, and W. Lin, \u201cRandom partial paired comparison\nfor subjective video quality assessment via hodgerank,\u201d in Proceedings of the 19th ACM\ninternational conference on Multimedia. ACM, 2011, pp. 393\u2013402.\n\n[39] ITU-R BT.500-13, \u201cMethodology for the subjective assessment of the quality of television\n\npictures,\u201d International Telecommunication Union, Geneva, Switzerland, Jan. 2012.\n\n[40] \u201cLive video quality assessment database,\u201d http://live.ece.utexas.edu/research/quality/live_video.html.\n\n[41] H. Sheikh, Z. Wang, L. Cormack, and A. Bovik, \u201cLive image quality assessment database\n\nrelease 2,\u201d http://live.ece.utexas.edu/research/quality.\n\n[42] P. Le Callet and F. Autrusseau, \u201cSubjective quality assessment irccyn/ivc database,\u201d 2005,\n\nhttp://www.irccyn.ec-nantes.fr/ivcdb/.\n\n11\n\n\f", "award": [], "sourceid": 1781, "authors": [{"given_name": "JING", "family_name": "LI", "institution": "University of Nantes, LS2N lab"}, {"given_name": "Rafal", "family_name": "Mantiuk", "institution": "University of Cambridge"}, {"given_name": "Junle", "family_name": "Wang", "institution": "Tencent"}, {"given_name": "Suiyi", "family_name": "Ling", "institution": "universit\u00e9 de nantes"}, {"given_name": "Patrick", "family_name": "Le Callet", "institution": "\"Universite de Nantes, France\""}]}