\section{Tracking}
\label{sec:tracking}

In this section, we outline the dynamic programming tracker of \cite{vondrick2010}. We will extend it in Section \ref{sec:active} to construct an efficient active learning algorithm. We begin by describing a method for tracking a single object, given a sparse set of key frame bounding-box annotations. As in \cite{vondrick2010}, we use a visual tracker to interpolate the annotations for the unlabeled in-between frames.
\ignore{
As manually annotating every video frame is infeasible, modern video annotation
protocols instead rely on workers to label key frames and use some form of
interpolation to recover the in-between labels. Since we wish our approach to
be general for all types of videos, we cannot make any assumptions on the
characteristics of the motion. Therefore, given a sparse set of annotations for
a particular object, we will use a visual tracker to recover the annotations
for the unlabeled frames. We use the dynamic-programming tracker of \cite{vondrick2010}. We outline it in this section since we extend it in Section \ref{sec:active} to construct an efficient active learning algorithm.
}
We define $b_t^i$ to be a bounding box at frame $t$ at pixel position $i$. Let
$\zeta$ be the non-empty set of worker annotations, represented as a set of bounding boxes.  Without
loss of generality, assume that all paths are on the interval $0 \le t \le T$.

\subsection{Discriminative Object Templates}

We build a discriminative visual model of the object in order to predict its
location. For every bounding box annotation in $\zeta$, we extract its associated image patch and resize it to the average size in the set. We then extract both histogram of oriented
gradients (HOG) \cite{hog} and color features: $ \phi_n(b_n) = \begin{bmatrix}
HOG & RGB \end{bmatrix}^T$ where RGB are the means and covariances of the color
channels. When trained with a linear classifier, these color features
are able to learn a quadratic decision boundary in RGB-space. In our
experiments, we used a HOG bin size of either 4 or 8 depending on the size of
the object.

We then learn a model trained to discriminate the object against the the background. 
For every annotated frame, we extract an \emph{extremely} large set of negative
bounding boxes that do not significantly overlap with the positive instances.
Given a set of features $b_n$ with labels $y_n \in \{-1, 1\}$ classifying them as
positive or negative, we train a linear SVM by minimizing the loss
function:
\begin{align} w^* = \argmin \frac{1}{2} w \cdot w + C \sum_n^N \max(0,1 - y_n w
\cdot \phi_n(b_n))  \label{eqn:svm} \end{align}
We use liblinear \cite{fan2008liblinear} in our experiments. Training typically took only a
few seconds. 

\subsection{Motion Model}

In order to score a putative interpolated path $b_{0:T} = \{b_0 \dots b_T\}$, we define the energy function $E(b_{0:T})$ comprised of both unary and pairwise terms:
\begin{align}
E(b_{0:T}) &= \sum_{t=0}^T U_t(b_t) + S(b_t,b_{t-1}) \label{eqn:tracking}\\
U_t(b_t) &= \min\left(-w \cdot \phi_t(b_t), \alpha_1\right), \quad
S(b_t,b_{t-1}) = \alpha_2 ||b_t - b_{t-1}||^2 \label{eqn:spring}
\end{align}
where $U_t(b_t)$ is the local match cost and $S_t(b_t, b_{t-1})$ is the
pairwise spring. $U_t(b_t)$ scores how well a particular $b_t$ matches
against the learned appearance model $w$, but truncated by $\alpha_1$ so as to
reduce the penalty when the object undergoes an occlusion. We are able to efficiently compute the dot product $w \cdot \phi_t(b_t)$ using integral images on the RGB weights \cite{crow1984summed}. $S_t(b_t, b_{t-1})$ favors smooth motion and prevents the tracked object
from teleporting across the scene.

\subsection{Efficient Optimization}

We can recover the missing annotations by computing the optimal path as given
by the energy function.  We find the least cost path $b_{0:T}^*$ over the exponential set of all possible paths:
\begin{align}
b_{0:T}^* = \argmin_{b_{0:T}} E(b_{0:T}) \quad
s.t. \quad b_t = b_t^i \quad \forall b_t^i \in \zeta \label{eqn:leastcost}
\end{align}
subject to the constraint that the path crosses through the annotations labeled
by the worker in $\zeta$. We note that these constraints can be
removed by simply redefining $U_t(b_t) = \infty \quad \forall b_t \neq b_t^i$.

A naive approach to minimizing (\ref{eqn:leastcost}) would take $O(K^T)$ for
$K$ locations per frame. However, we can efficiently solve the above
problem in $O(TK^2)$ by using dynamic programming through a forward pass recursion \cite{bellman1954some}:
\begin{align}
C_0^\rightarrow(b_0) &= U_0(b_0) \nonumber\\
C_t^\rightarrow(b_t) &= U_t(b_t) + \min_{b_{t-1}}
C_{t-1}^\rightarrow(b_{t-1}) +
S(b_t,b_{t-1}) \label{eqn:costrecursion}\\
\pi_t^\rightarrow(b_t) &= \argmin_{b_{t-1}} C_{t-1}^\rightarrow(b_{t-1}) +
S(b_t,b_{t-1}) \label{eqn:costptr}
\end{align}
By storing the pointers in (\ref{eqn:costptr}), we are able to
reconstruct the least cost path by backtracking from the last frame $T$. We note that we can further reduce this computation to
$O(TK)$ by applying distance transform speed ups to the pairwise term in (\ref{eqn:spring})
\cite{felzenszwalb2004distance}.

