%\begin{SCfigure}[tb]
%\caption{foobar}
%\label{fig:straightslide}
%\end{SCfigure}
%

\begin{figure}[tb]
\subfloat[One click: Initial frame only]{
\includegraphics[width=0.5\textwidth]{figs/bounce-toy-1click.jpg}
}
\subfloat[Two clicks: Initial and requested frame]{
\includegraphics[width=0.5\textwidth]{figs/bounce-toy-2click.jpg}
}

\begin{center}
\subfloat[Identical objects.]{
\includegraphics[width=0.24\textwidth]{figs/synthetic.jpg}
}
\subfloat[About to intersect.]{
\includegraphics[width=0.24\textwidth]{figs/synthetic-meeting.jpg}
}
\subfloat[Intersection point.]{
\includegraphics[width=0.24\textwidth]{figs/synthetic-intersection.jpg}
}
\subfloat[After intersection.]{
\includegraphics[width=0.24\textwidth]{figs/synthetic-confusion.jpg}
}
\end{center}


\caption{We consider a synthetic video of two nearly identical rectangles
rotating around a point---one clockwise and the other counterclockwise. The
rectangles intersect every 20 frames, at which point the tracker does not know
which direction the true rectangle is following. Did they bounce or pass
through?  (\textbf{a}) Our framework realizes the ambiguity can be resolved by
requesting annotations when they do not intersect. Due to the periodic motion,
a fixed rate tracker may request annotations at the intersection points,
resulting in wasted clicks. The expected label change plateaus because every
point along the maximas provide the same amount of disambiguating information.
(\textbf{b}) Once the requested frame is annotated, that corresponding segment
is resolved, but the others remain ambiguous. In this example, our framework
can determine the true path for a particular rectangle in only 7 clicks, while
a fixed rate tracker may require 13 clicks.}

\label{fig:bounce}
\end{figure}

\begin{figure}
\begin{minipage}[t]{0.49\textwidth}

\centering
%\subfloat[Probability map]{
\includegraphics[width=\textwidth]{figs/straightprob.jpg}
%}
%\subfloat[Translation]{
%\includegraphics[width=0.19\textwidth]{figs/straightexample.jpg}
%}

\caption{Consider two identical rectangles that translate, but never intersect.
Although both objects have the same appearance, our framework does not query
for new annotations because the pairwise cost has made it unlikely that the two
objects switch identities, indicated by a single mode in the probability map.
A probability exclusively using unary terms would be bimodal.}

\label{fig:straightslide}

\end{minipage}
\hfill
\begin{minipage}[t]{0.49\textwidth}

\centering
\includegraphics[width=\textwidth]{figs/clutteredbackground.jpg}

\caption{Consider a white rectangle moving on a white background. Since it is
impossible to distuingish the foreground from the background, our framework
will query for the midpoint and gracefully degrade to a fixed rate labeling.
If the object is extremely difficult to localize, the active learner will
automatically decide the optimal annotation strategy is to use fixed rate key
frames.}

\label{fig:cluttered}

\end{minipage}
\end{figure}

\begin{figure}[tb]
\subfloat[One click: Initial frame only]{
\includegraphics[width=0.5\textwidth]{figs/yi-1click.jpg}
}
\subfloat[Two clicks: Initial and requested frame]{
\includegraphics[width=0.5\textwidth]{figs/yi-2clicks.jpg}
}

\subfloat[Training]{
\includegraphics[width=0.23\textwidth]{figs/yi-training.jpg}
}
\subfloat[Walking, Yes Jacket]{
\includegraphics[width=0.23\textwidth]{figs/yi-walking.jpg}
}
\subfloat[Taking Off Jacket]{
\includegraphics[width=0.23\textwidth]{figs/yi-takeoff.jpg}
}
\subfloat[Walking, No Jacket]{
\includegraphics[width=0.23\textwidth]{figs/yi-blue.jpg}
}
%\subfloat[Crouching, No Jacket]{
%\includegraphics[width=0.20\textwidth]{figs/yi-crouch.jpg}
%}
%\subfloat[Putting On Jacket]{
%\includegraphics[width=0.166\textwidth]{figs/yi-backon.jpg}
%}

\caption{We analyze a video of a man who takes off a jacket and changes his
pose. A tracker trained only on the initial frame will lose the object when his
appearance changes. Our framework is able to determine which additional frame
the user should annotate in order to resolve the track. (\textbf{a}) Our
framework does not expect any significant label change when the person is
wearing the same jacket as in the training frame (black curve).  But, when the
jacket is removed and the person changes his pose (colorful curves), the
tracker cannot localize the object and our framework queries for an additional
annotation. (\textbf{b}) After annotating the requested frame, the tracker
learns the color of the person's shirt and gains confidence in its track
estimate.  A fixed rate tracker may pick a frame where the person is still
wearing the jacket, resulting in a wasted click.  (\textbf{c-f}) The green box
is the predicted path with one click and red box is with two clicks. If there
is no green box, it is the same as the red.}

\label{fig:yi}
\end{figure}

\begin{figure}[tb]
\subfloat[One click: Initial frame only]{
\includegraphics[width=0.5\textwidth]{figs/occlusion-1click.jpg}
}
\subfloat[Two clicks: Initial and requested frame]{
\includegraphics[width=0.5\textwidth]{figs/occlusion-2click.jpg}
}

\subfloat[Training image]{
\includegraphics[width=0.24\textwidth]{figs/occlusion1.jpg}
}
\subfloat[Entering occlusion]{
\includegraphics[width=0.24\textwidth]{figs/occlusion2.jpg}
}
\subfloat[Total occlusion]{
\includegraphics[width=0.24\textwidth]{figs/occlusion3.jpg}
}
%\subfloat[Reappearing]{
%\includegraphics[width=0.19\textwidth]{figs/occlusion4.jpg}
%}
\subfloat[After occlusion]{
\includegraphics[width=0.24\textwidth]{figs/occlusion5.jpg}
}

\caption{We investigate a car from \cite{SangminCVPR2011} that undergoes a
total occlusion and later reappears. The tracker is able to localize the car
until it enters the occlusion, but it cannot recover when the car reappears.
(\textbf{a}) Our framework expects a large label change during the occlusion
and when the object is lost. The largest label change occurs when the object
begins to reappear because this frame would lock the tracker back onto the
correct path. (\textbf{b}) When the tracker receives the requested annotation,
it is able to recover from the occlusion, but it is still confused when the
object is not visible.}

\label{fig:occlusion}
\end{figure}

\begin{figure}[tb]
\subfloat[Initial frame]{
\includegraphics[width=0.24\textwidth]{figs/stationary1.jpg}
}
\subfloat[Rotation]{
\includegraphics[width=0.24\textwidth]{figs/stationary2.jpg}
}
\subfloat[Scale]{
\includegraphics[width=0.24\textwidth]{figs/stationary3.jpg}
}
\subfloat[Estimated]{
\includegraphics[width=0.24\textwidth]{figs/stationary4.jpg}
}

\caption{We examine situations where there are many easy-to-localize objects
(e.g., stationary objects) and only a few difficult instances.  In this
example, red boxes were manually annotated and black boxes are automatically
estimated.  Our framework realizes that the stationary objects are not likely
to change their label, so it focuses annotations on moving objects.}

\label{fig:stationary}
\end{figure}

\section{Active Learning}
\label{sec:active}
Let $\text{curr}_{0:T}$ be the current best estimate for the path
given a set of user annotations $\zeta$. We wish to compute which frame the user should
annotate next $t^*$. In the ideal case, if we had knowledge of the
ground-truth path $b^{gt}_{0:T}$, we should select the frame $t$, that
when annotated with $b^{gt}_t$, would produce a new estimated path
closest to the ground-truth. Let us write $\text{next}_{0:T}(b^{gt}_t)$
for the estimated track given the augmented constraint set $\zeta' = \zeta
\cup b_t^{gt}$. The optimal next frame is:
\begin{align}
%t^* &= \argmin_{0 \le t \le T} \; \argmin_{b_{0:T}} E^*(b_{0:T}) \quad s.t. \quad O(t) \in b_{0:T}
%t^* = \argmin_{{0 \le t \le T}} \min_{b_{0:T}} E(b_{0:T}) \quad
%s.t. \quad b_t = b_t^{gt}
t^{opt} = \argmin_{{0 \le t \le T}} \sum_{j=0}^T \text{err}(b^{gt}_j,\text{next}_j(b^{gt}_t))
\label{eqn:optimalactive}
\end{align}
where $err$ could be squared error or a thresholded overlap
(in which $err$ evaluates to 0 or 1 depending upon if the two
locations sufficiently overlap or not). Unfortunately, we cannot
directly compute  (\ref{eqn:optimalactive}) since we
do not know the true labels ahead of time.
%where $b_t^{gt}$ is the ground truth location of an object at frame
%$t$. In other words, we wish to determine the frame that, once
%annotated with the true object location, would
%result in the new lowest cost path. We believe this is a reasonable
%strategy because we expect the ground-truth path $b_{0:T}^{gt}$ to
%(approximately) minimize the tracking energy function \eqref{eqn:tracking}.
%Indeed, solving this equation would result
%in an optimal active learning strategy for video annotation. 
%But, we cannot directly compute Eqn.\ref{eqn:optimalactive} since we%
%do not know the true labels ahead of time. Instead, we will
%approximate it by determining which frame we expect will change the
%current labeling the most. \deva{ We could start with a selection
%  strategy that selects a frame, that when annotated, reduced the
%  error between the estimated track and the ground-truth track
%  $b_{0:T}^{gt}$. I think this is what you originally had in mind
%  here. Given space, that would be a nice flow of ideas.}

\subsection{Maximum Expected Label Change (ELC)}
%Let $\text{curr}_{0:T}$ be the current estimated track given the current set
%of user annotations $\zeta$ (Eqn \ref{eqn:leastcost}), and let $\text{next}_{0:T}(b_t^i)$ be the
%estimated track given the additional constraint that $\zeta' = \zeta
%\cup \{b_t = b_t^i\}$. 
We make two simplifying assumptions to implement the previous ideal
selection strategy, inspired by the popular maximum expected gradient length
(EGL) algorithm for active learning~\cite{settlestr09} (which selects
an example so as to maximize the expected change in a learned model).
First, we change the minimization to a maximization and replace the ground-truth error with the change in track
label: $err(b_j^{gt},\text{next}_j(b_t^{gt})) \Rightarrow err(\text{curr}_j,\text{next}_j(b_t^{gt}))$. Intuitively, if we make a large change in the estimated track, we are likely to be taking a large step toward the ground-truth solution. However, this requires knowing the ground-truth location $b_t^{gt}$. We
make the second assumption that we have access to an accurate
estimate of $P(b_t^i)$, which is the probability that, if we show the user frame $t$, then they will annotate a particular location $i$.  We can use this distribution to compute an expected change
in track label:
\begin{align}
t^* = \argmax_{0 \le t \le T} \sum_{i = 0}^K P(b_t^i) \cdot \Delta
I(b_t^i) \quad \text{where} \quad \Delta I(b_t^i) = \sum_{j=0}^T\text{err}(\text{curr}_j,\text{next}_j(b_t^i))
\label{eqn:expectedchange}
\end{align}
The above selects the frame, that when annotated, produces the largest
expected track label change.
We now show how to compute $P(b_t^i)$ and $\Delta I(b_t^i)$ using costs and constrained paths,
respectively, from the dynamic-programming based visual tracker described in Section \ref{sec:tracking}. By considering every possible space-time location that a worker
could annotate, we are able to determine which frame we expect could change the
current path the most. Even though this calculation searches over an
exponential number of paths, we
are able to compute it in polynomial time using
dynamic programming. Moreover, (\ref{eqn:expectedchange}) can be parallelized across frames in order to guarantee a rapid response time, often necessary due to the interactive nature of active learning.

\subsection{Annotation Likelihood and Estimated Tracks}
A user has access to global knowledge and video history when
annotating a frame. To capture such global information, we define the annotation likelihood of location $b^i_t$ to be the score of the best track given that additional annotation:
%we define the annotation likelihood of a particular spacetime location $P(b_t^i)$ by computing the score of the best track given that additional constraint:
%scoring the total energy of the putative track $\text{next}_{0:T}(b_t^i)$:
\begin{align}
P(b_t^i) \propto \exp\left({\frac{-\Psi(b_t^i)}{\sigma^2}}\right)
%\quad \text{where} \quad \Psi(b_t^i) &= \min_{b_{0:T}} E\left(b_{0:T}\right) \quad s.t. \quad b_t = b_t^i
\quad \text{where} \quad \Psi(b_t^i) &= E\left(\text{next}_{0:T}(b_t^i)\right)
\end{align}
The above formulation only assigns high probabilities to locations that lie on paths that agree with the global constraints in $\zeta$, as explained
in Fig.\ref{fig:bounce} and Fig.\ref{fig:straightslide}. To compute energies $\Psi(b_t^i)$ for all spacetime locations $b_t^i$, we use a standard two-pass dynamic programming algorithm for computing {\em min-marginals}: 
\ignore{Let $\Psi(b_t^i)$ to be the cost of the least cost path that is
forced to pass through $b_t^i$, and $\text{next}_{0:T}(b_t^i)$ be the
associated track:
\begin{align}
\Psi(b_t^i) &= \min_{b_{0:T}} E\left(b_{0:T}\right) \quad s.t. \quad
b_t = b_t^i\\
\text{next}_{0:T}(b_t^i) &= \argmin _{b_{0:T}} E\left(b_{0:T}\right) \quad s.t. \quad
b_t = b_t^i
\end{align}
These are equivalent to min-marginal energies. A naive approach to
computing $\Psi(b_t^i)$ is to reevaluate (\ref{eqn:costrecursion})
for every $i$ with the constraint that $b_t = b_t^i$, but this is
clearly computationally infeasible due to the enormous number of
possible locations. Instead, $\Psi(b_t^i)$ can be efficiently computed through a two-pass dynamic
programming algorithm for computing min-marginals:
}
\begin{align}
\Psi(b_t^i) = C_t^\rightarrow(b_t^i) + C_t^\leftarrow(b_t^i) - U(b_t^i)
\end{align}
\noindent where $C_t^\leftarrow(b_t^i)$ corresponds to intermediate
costs computed by running the recursive algorithm from
(\ref{eqn:costrecursion}) backward in time. By caching forward and backward pointers $\pi_t^\rightarrow(b_t^i)$ and
$\pi_t^\leftarrow(b_t^i)$, the associated
tracks $\text{next}_{0:T}(b^i_t)$ can be found by backtracking both
forward and backward from any spacetime location $b^i_t$.

%\deva{We should intuitively describe why the above does better than simply using
%  the local scores to approximate the probability that a user would
%  select a location. Some of the toy examples would illustrate this,
%  correct? Maybe that would be an interesting experiment to show in
%the results section.}

\subsection{Label Change}
%Let $\lambda(b_t^i)$ be a function that computes the difference between a
%candidate annotation $b_t^i$ and the predicted location $b_t^*$:
%\begin{align}
%\lambda(b_t^i) = \begin{cases}
%0 & \text{if } \frac{b_t^i \cap b_t^*}{b_t^i \cup b_t^*} \ge 0.5 \\
%1 & \text{otherwise}
%\end{cases}
%\end{align}

%$\lambda(b_t^i)$ is the same criteria as used for evaluation in the PASCAL VOC
%challenge: a bounding box must overlap the current prediction by at least 50\%.
We now describe a dynamic programming algorithm for computing the label change $\Delta I(b_t^i)$ for all possible spacetime
locations $b_t^i$. To do so, we define intermediate quantities
$\Theta_t^\rightarrow(b_t^i)$ which represent the label
change up to time $t$ given the user annotates location $b_t^i$:
\begin{align}
%&\Theta_0^\rightarrow(b_0) = \lambda(b_0) \\
%&\Theta_t^\rightarrow(b_t^i) = \lambda(b_t^i) + \Theta_{t-1}^\rightarrow(b_{t-1}^j) \quad
%s.t. \quad b_{t-1}^j = \argmin_{b_{t-1}^j}
%C_{t-1}^\rightarrow(b_{t-1}^j) + S(b_t^i, b_{t-1}^j)
&\Theta_0^\rightarrow(b_0) = \text{err}(\text{curr}_0,\text{next}_0(b_0)) \\
&\Theta_t^\rightarrow(b_t) = \text{err}(\text{curr}_t,\text{next}_t(b_t)) + \Theta_{t-1}^\rightarrow(\pi^\rightarrow_t(b_t)) \label{eqn:cachedpointer}
\end{align}
%Note that the pointers $pi$ are computed and cached in
%Eqn.\ref{eqn:costptr}, meaning that we can efficiently compute $\Theta_t^\rightarrow(b_t^i)$ in $O(T)$.
We can compute $\Theta_t^\leftarrow(b_t^i)$, the expected label
change due to frames $t$ to $T$ given a user annotation at $b_t^i$, by running the above recursion
backward in time. The total label change is their sum, minus the
double-counted error from frame $t$:
\begin{align}
\Delta I(b_t^i) = \Theta_t^\rightarrow(b_t^i) + \Theta_t^\leftarrow(b_t^i) - \text{err}(\text{curr}_t,\text{next}_t(b_t^i))
\label{eqn:totallabelchange}
\end{align}

(\ref{eqn:totallabelchange}) is sensitive to small spatial shifts; i.e.
$\Delta I(b_t^i) \not\approx \Delta I(b_t^{i+\epsilon})$. %Consider two 
%objects that are passing by and partially occluding each other; a user may
%intend to annotate one object at $b_t^i$ but inadvertently annotate
%another object $b_t^{i+\epsilon}$. To reduce the effect of such human
To reduce the effect of imprecise human labeling (which we encounter
in practice), we replace the label change with a worst-case label
change computed over a neighboring window $N(b_t^i)$:
\begin{align}
\tilde{\Delta} I(b_t^i) = \min_{b_t^j \in N(b_t^i)} \Delta I(b_t^j)
\end{align}
By selecting frames that have a large expected ``worse-case'' label
change, we avoid querying frames that
require precise labeling and instead query for frames that are easy to label (e.g., the user may annotate any location within a small
neighborhood and still produce a large label change).
%If there are $X^2$ locations in $N(b_t^i)$, we are able to minimize in $O(X)$ by using a linearly separable min-filter. Typically, $X = 10$.
%Explain min here. We use linear separability trick here as well.
%\deva{If we are running tight on space, I would drop the last equation
%  and ignore the neighborhood issue - we'll talk about it a journal
%  version!}

\subsection{Stopping Criteria}
Our final active learning algorithm is as follows: we query a frame $t^*$
according to (\ref{eqn:expectedchange}), add the user annotation to the constraint set $\zeta$, retrain the object template
with additional training examples extracted from frame $t^*$ (according
to (\ref{eqn:svm})), and repeat. We stop requesting annotations once we are confident that
additional annotations will not significantly change the predicted path:
\begin{align}
\max_{0 \le t \le T} \sum_{i = 0}^K P(b_t^i) \cdot \Delta I(b_t^i) < \text{tolerance}
\end{align}
We then report $b_{0:T}^*$ as the final annotated track as found in (\ref{eqn:leastcost}). We note, however, that in practice external factors, such as budget, will often
trigger the stopping condition before we have obtained a perfect track. As long
as the budget is sufficiently high, the reported annotations will closely match
the actual location of the tracked object. 

We also note that one can
apply our active learning algorithm {\em in parallel} for multiple
objects in a video.  We maintain separate object models $w$ and constraint sets
$\zeta$ for each object. We select the object and frame with the
maximum expected label change according to (\ref{eqn:expectedchange}) . We
demonstrate that this strategy naturally focuses labeling effort on
the more difficult objects in a video.

%\subsection{Efficient Implementation}
%
%Due to the interactive nature of active learning, we require an approach that
%can quickly request new frames. The usefulness of our framework would be
%diminished if our system took too long to make requests. While our approach is
%computationally expensive, we were able to successfully parallelize it across
%24 cores at 2.67 GHz each during our experiments. With the introduction of
%cloud computing, our system is able to respond swiftly and without unreasonable
%delay.
