\begin{figure}[tb]
\includegraphics[width=\textwidth]{figs/tough-montage-3.jpg}

\caption{A hard scene in a basketball game \cite{vondrick2010}. Players
frequently undergo total and partial occlusion, alter their pose, and are
difficult to localize due to a cluttered background.}

\label{fig:basketball}
\end{figure}

\begin{figure}[tb]

\subfloat[VIRAT Cars \cite{SangminCVPR2011}]{
\includegraphics[width=0.5\textwidth]{figs/performance-virat-4videos.jpg}
}
\subfloat[Basketball Players \cite{vondrick2010}]{
\includegraphics[width=0.5\textwidth]{figs/basketball-30percent.jpg}
}

\caption{We compare active key frames (green curve) vs. fixed rate key frames
(red curve) on a subset (a few thousand frames) of the VIRAT videos and part of
a basketball game. We could improve performance by increasing annotation
frequency, but this also increases the cost. By decreasing the
annotation frequency in the easy sections and instead transferring those clicks
to the difficult frames, we achieve superior performance over the current
methods on the same budget. (\textbf{a}) Due to the large number of stationary
objects in VIRAT, our framework assigns a tremendous number of clicks to moving
objects, allowing us to achieve nearly zero error. (\textbf{b}) By focusing
annotation effort on ambiguous frames, we show nearly a 5\% improvement on
basketball players.}

\label{fig:performance}
\end{figure}

\section{Benchmark Results}
\label{sec:results}

We validate our approach on both the VIRAT challenge video surveillance data
set \cite{SangminCVPR2011} and the basketball game studied in
\cite{vondrick2010}. VIRAT is unique for its enormous size of over three
million frames and up to hundreds of annotated objects in each frame. The
basketball game is extremely difficult due to cluttered backgrounds, motion
blur, frequent occlusions, and drastic pose changes.
%We chose
%to evaluate our approach on these videos because building these data sets
%presented challenges that have pushed the limits of annotation technology. 

We evaluate the performance of our tracker using active key frames versus fixed
rate key frames. A fixed rate tracker simply requests annotations every $T$
frames, regardless of the video content. For active key frames, we use the
annotation schedule presented in section \ref{sec:active}.
%In both
%cases, non key-frame object locations are nonlinearly
%interpolated/extrapolated using constrained dynamic programming
%(\ref{eqn:leastcost}).
Our key frame baseline is the
state-of-the-art labeling protocol used to originally annotate both datasets
\cite{vondrick2010,SangminCVPR2011}. In a given video, we
allow our active learning protocol to iteratively pick a frame and an object to
annotate until the budget is exhausted. We then run the tracker described in
section \ref{sec:tracking} constrained by these key frames and compare its
performance.
%In our analysis, we use the same tracking parameters for both
%active and fixed key frames. 

We score the two key frame schedules by determining how well the tracker is
able to estimate the ground truth annotations. For every frame, we consider a
prediction to be correct as long as it overlaps the ground truth by at least
30\%, a threshold that agrees with our qualitative rating of performance. We
compare our active approach to a fixed-rate baseline for a fixed amount of user
effort: is it better to spend $X$ user clicks on active or fixed-rate
key frames? Fig.\ref{fig:performance} shows the former strategy is better.
Indeed, we can annotate the VIRAT data set for \emph{one tenth} of its original
cost.

