\section{Introduction}
\label{sec:introduction}


\begin{figure}[b]
\includegraphics[width=0.125\textwidth]{figs/viratsamp1.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp2.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp3.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp4.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp5.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp6.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp7.jpg}\includegraphics[width=0.125\textwidth]{figs/viratsamp8.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp9.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp10.jpg}
%
%\includegraphics[width=0.195\textwidth]{figs/viratsamp11.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp12.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp13.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp14.jpg}
%\includegraphics[width=0.195\textwidth]{figs/viratsamp15.jpg}

\caption{Videos from the VIRAT data set \cite{SangminCVPR2011} can have
hundreds of objects per frame. Many of those objects are easily tracked except
for a few difficult cases. Our active learning framework automatically
focuses the worker's effort on the difficult instances (such as occlusion or
deformation).}

\label{fig:virat}
\end{figure}


%\begin{figure}[b]
%\subfloat[Easy]{
%\includegraphics[width=0.19\textwidth]{figs/basketball1.jpg}
%}
%\subfloat[Easy]{
%\includegraphics[width=0.19\textwidth]{figs/basketball2.jpg}
%}
%\subfloat[Easy]{
%\includegraphics[width=0.19\textwidth]{figs/basketball4.jpg}
%}
%\subfloat[Hard]{
%\includegraphics[width=0.19\textwidth]{figs/basketball5.jpg}
%}
%\subfloat[Hard]{
%\includegraphics[width=0.19\textwidth]{figs/basketball6.jpg}
%}

%\caption{We focus on efficient video annotation. Our active learning framework
%will prompt a worker to annotate the difficult examples (\textbf{d}-\textbf{e})
%while skipping the trivial cases (\textbf{a}-\textbf{c}) because a visual
%tracker was already able to automatically estimate those annotations. By
%automatically focusing annotation effort on video frames where computer vision
%is inadequate, we are able to annotate videos with less effort than
%state-of-the-art video annotation protocols}
%
%\label{fig:motivation}
%\end{figure}

With the decreasing costs of personal portable cameras and the rise of
online video sharing services such as YouTube, there is an abundance
of \emph{unlabeled} video readily available. To both train and
evaluate computer vision models for video analysis, this data must be labeled. Indeed, many approaches have demonstrated
the power of data-driven analysis given \emph{labeled} video
footage \cite{liu2008sift,yuen2010data}. 
%The common intuition among
%these data-driven approaches relies
%on the assumption that the database
%contains a labeled result that is
%visually very similar to the
%query. Consequently, the larger the database, the better the performance.

But, annotating massive videos is prohibitively expensive. The twenty-six hour VIRAT
video data set consisting of surveillance footage of cars and people cost
\emph{tens of thousands of dollars} to annotate despite deploying
state-of-the-art annotation protocols \cite{SangminCVPR2011}. Existing
video annotation protocols typically work by having users (possibly on
Amazon Mechanical Turk) label a sparse set of key frames followed by
either linear interpolation \cite{yuen-labelme} or nonlinear tracking \cite{agarwala2004keyframe,vondrick2010}.  

We propose an adaptive key-frame strategy which uses active learning
to {\em intelligently} query a worker to label only certain objects at only certain frames that are likely to improve performance.  This approach exploits the fact, that for real footage, not all objects/frames are ``created equal''; some objects during some frames are ``easy'' to automatically annotate in that they are stationary (such as parked cars in VIRAT \cite{SangminCVPR2011}) or moving in isolation (such a single
basketball player running down the court during a fast break
\cite{vondrick2010}). In these cases, a few user clicks are enough to
constrain a visual tracker to produce accurate tracks. Rather, user
clicks should be spent on more ``hard'' objects/frames that are visually
ambiguous, such as occlusions or cluttered backgrounds.
%For example, in the benchmark VIRAT
%dataset, many cars are stationairy for long periods of time when
%parked. An intelligent keyframe qeury would not focus annotation
%effort on these easy frames, where a simple visual tracker
%This cost is not
%acceptable nor sustainable. The computer vision community has invested
%significant effort to develop methods that can efficiently label videos.
%Sorokin and Forsyth have demonstrated that Amazon's Mechanical Turk provides an
%on-demand, scalable workforce capable of completing vision annotation tasks for
%relatively low wages \cite{sorokin51utility}.  Although the ``Turk
%philosophy'' suggests that all tasks should be replaced with human effort,
%manually annotating every frame is inefficient for video due to its dynamic yet
%redundant nature. In LabelMe video, Yuen et.\ al propose letting workers label
%a sparse set of frames and adopt a homography-preserving linear interpolation
%scheme to recover the missing annotations \cite{yuen-labelme}.  Observing that
%linear interpolation requires frequent annotations when motion is nonlinear,
%Vondrick et.\ al argues that human labor is more expensive than computation and
%demonstrates that employing a tracker can benefit the video annotation protocol
%given a fixed budget \cite{vondrick2010}. Yet, despite these advances, building
%the labeled video databases necessasry for high-performance, data-driven
%techniques remains unreasonably expensive.

{\bf Related work (Active learning):}  We refer the reader to the excellent
survey in \cite{settlestr09} for a contemporary review of active learning. Our
approach is an instance of active structured prediction
\cite{culotta2005reducing,culotta2006corrective}, since we train object models
that predict a complex, structured label (an object track) rather than a binary
class output.  However, rather than training a single car model over several
videos (which must be invariant to instance-specific properties such as color
and shape), we train a separate car model for each car instance to be tracked.
From this perspective, our training examples are individual frames rather than
videos. But notably, these examples are \emph{non-i.i.d}; indeed, temporal
dependencies are crucial for obtaining tracks from sparse labels. We believe
this property makes video a prime candidate for active learning, possibly
simplifying its theoretical analysis \cite{settlestr09,balcan2006agnostic}
because one does not face an adversarial ordering of data.  Our approach is
similar to recent work in active labeling \cite{bransonstrong}, except we
determine which part of the label the user should annotate in order to improve
performance the most.  Finally, we use a novel query strategy appropriate for
video: rather than use expected information gain (expensive to compute for
structured predictors) or label entropy (too coarse of an approximation), we
use the expected label change to select a frame.
%Through efficient dynamic
%programming algorithms, 
We select the frame, that when labeled, will produce
the largest change in the estimated track of an object.

\ignore{ Discussion of related work in active learning and how ours is non-trivially
different. \cite{settles2008multiple} proposes the maximum expected gradient
length (greatest model change) that we use. \cite{settles2008active} has
varying costs for labeling each instance. \cite{settlestr09} is the survey
paper. \cite{vijayanarasimhanlarge} is Grauman's CVPR paper.
\cite{vijayanarasimhan2010far} is far sighted active learning, which is
non-swapping might be similar to.  To our knowledge, we are the first to
consider an active learning strategy for video annotation.
Our active learning approach is novel in relation to contemporary active
learning strategies. The closest method to ours is probably the maximum
expected gradient length (EGL) first described by Settle et.\ al
\cite{settles2008multiple}. EGL queries for training instances that would
induce the largest change to the \emph{model}.  Our approach queries for frames
that would cause the largest change to the current \emph{label}. The
distinction is subtle. If EGL were applied to video annotation, it would
request that a worker label an entire path for a certain category (e.g., label
every frame for a car) and then find another car to label in perhaps an
entirely different video in order to build a robust car tracker. In our
approach, we are only interested in most efficiently annotating a single car
where each sample is not independent because we try to build an appearance
model that discriminates for the identity of this particular car. Once we have
finished annotating the car, we must restart the active learning process for
another car. Indeed, EGL could use our approach as a subroutine in learning a
car tracker.}
{\bf Related work (Interactive video annotation):} There has also been
work on interactive tracking from the computer vision
community. \cite{buchanan2006interactive} describe efficient data
structures that enable interactive tracking, but do not focus on frame
query strategies as we do.  \cite{yuen-labelme} and
\cite{agarwala2004keyframe} describe systems that allow users to
manually correct drifting trackers, but this requires annotators to
watch an entire video in order to determine such erroneous frames, a significant burden in our experience. %We demonstrate that active learning can
%significantly alleviate this burden.
%In the remainder of this paper, we present our active learning framework and
%show its capabilities over the current state-of-the-art. In section
%\ref{sec:tracking}, we describe a general tracking algorithm that estimates
%nonlinear paths between constrained key points. In section \ref{sec:active}, we
%introduce a novel active learning formulation that can determine which frame a
%worker should label in order to reduce the number of annotations that our
%tracker requires to accurately recover a path. Section \ref{sec:experiments}
%demonstrates the power of our framework by offering intuitive examples of our
%active learning protocol.  Finally, we validate our approach on both the VIRAT
%surveillance video data set \cite{SangminCVPR2011} and a difficult basketball
%game \cite{vondrick2010} in section \ref{sec:results}. We hope the findings in
%this paper will enable the construction of massive, labeled video databases in
%the years to come.

