{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Shapley kernel is the sample weight given to each binary vector $z'\\in \\{0,1\\}^M$:\n",
    "\n",
    "$$\n",
    "k(z') = k(M, s) = \\frac{M-1}{(M~choose~s)s(M - s)}\n",
    "$$\n",
    "where $s = |z'|$, the number of ones in $z'$.\n",
    "\n",
    "Let $X$ be the matrix of all possible binary vectors of length $M$ with $2^M$ rows and $M$ columns. We use the Shapley kernel to compute the Shapley values using weighted linear regression:\n",
    "\n",
    "$$\n",
    "\\phi = (X^T W X)^{-1}X^T W y\n",
    "$$\n",
    "\n",
    "where $W$ is a diagonal matrix with the Shapley kernel weights for each row of $X$, and the $y_i = f_x(S_i)$ values are the function outputs for each row of $X$ (where $S_i$ is the set of ones in $X_{i,*}$). Note that $k(M, 0) = k(M, M) = \\infty$, so $W$ is infinity for the all zero row of $X$ and the row of all ones. However, if we set these infinite weights to a large constant, then $X^T W X = \\frac{1}{M-1}I + cJ$ for some positive constant $c$ (where $I$ is the identity matrix and $J$ is the matrix of all ones). As $c \\to \\infty$ the inverted form becomes $(X^T W X)^{-1} = I + \\frac{1}{M-1}(I - J)$\n",
    "\n",
    "The term $X^T W$ is a matrix where all the ones in $X$ have been replaced with $k(M, s)$, where $s$ is the number of ones in that row of $X$. Multiplying $X^T W$ by $(X^T W X)^{-1}$ creates a matrix of weights to apply to the function outputs in $y$. If we only consider the Shapley value of a single feature $\\phi_j$, then we only need to consider a single row of this $2^M \\times M$ matrix, which is equivelent to only using the $j$'th row of $(X^T W X)^{-1}$. When we do this we see that the value of the weight for row $i$ is\n",
    "\n",
    "\\begin{align}\n",
    "k(M, s_i)[\\mathbf{1}_{X_{i,j} = 1}-\\frac{(s_i-\\mathbf{1}_{X_{i,j} = 1})}{M-1}] &= \n",
    "\\frac{M-1}{(M~choose~s_i)s_i(M - s_i)}\\mathbf{1}_{X_{i,j} = 1}-\\frac{(s_i-\\mathbf{1}_{X_{i,j} = 1})}{(M~choose~s_i)s_i(M - s_i)} \\\\\n",
    "&= \\frac{(M-1)(M-s_i)!s_i!}{M!s_i(M - s_i)}\\mathbf{1}_{X_{i,j} = 1}-\\frac{(s_i-\\mathbf{1}_{X_{i,j} = 1})(M-s_i)!s_i!}{M!s_i(M - s_i)} \\\\\n",
    "&= \\frac{(M-1)(M-s_i-1)!(s_i-1)!}{M!}\\mathbf{1}_{X_{i,j} = 1}-\\frac{(s_i-\\mathbf{1}_{X_{i,j} = 1})(M-s_i-1)!(s_i-1)!}{M!} \\\\\n",
    "&= \\frac{(M-s_i-1)!(s_i-1)!}{M!}[(M-1)\\mathbf{1}_{X_{i,j} = 1}-(s_i-\\mathbf{1}_{X_{i,j} = 1})]\n",
    "\\end{align}\n",
    "\n",
    "\n",
    "where $s_i$ is the number of ones in the $i$'th row of $X$, and $\\mathbf{1}_{X_{i,j} = 1}$ is one if $X_{i,j} = 1$ and zero otherwise. When $\\mathbf{1}_{X_{i,j} = 1} = 0$ we get\n",
    "\n",
    "$$\n",
    "-\\frac{(M-s_i-1)!s_i!}{M!}\n",
    "$$\n",
    "\n",
    "When $\\mathbf{1}_{X_{i,j} = 1} = 1$ we get\n",
    "\n",
    "$$\n",
    "\\frac{(M-s_i-1)!(s_i-1)!}{M!}[(M-1)-(s_i-1)] = \\frac{(M-s_i-2)!(s_i-1)!}{M!}\n",
    "$$\n",
    "\n",
    "Taking the dot product of these values with $y$ leads to the following equation\n",
    "\n",
    "$$\n",
    "\\phi_j = \\sum_{S \\subseteq N \\setminus j} \\frac{(M-s_i-1)!s_i!}{M!} [f_x(S \\cup \\{i\\}) - f_x(S)]\n",
    "$$\n",
    "\n",
    "which is a classic form of estimating the Shapley value $\\phi_j$ ($N$ is the set of all features)."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.0",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
