\section{Human Cost}
\label{sec:humancost}

As video annotation is a tedious and time consuming task, we wish to design our
annotation protocol to request annotations that can be easily annotated by
humans. The purpose of active learning is, after all, to minimize annotation
effort. Consequently, we integrate a human cost objective metric into our
active learning framework that favors frames that are not only informative, but
also easy to annotate.

Let $H(b_t^i)$ be the human cost---a proxy for human effort---of
annotating bounding box $b_t^i$. We can then modify Eqn. \ref{eqn:expectedchange}
to incorporate the human cost:
\begin{align}
t^* &= \argmax_{t} \sum_{b_t^i} P(b_t^i) \cdot (\Delta I(b_t^i) - H(b_t^i)) \\
H(b_t^i) &= H_{jumping}(b_t^i) + H_{swapping}(b_t^i) + H_{uncluttered}(b_t^i)
\end{align}
In the rest of this section, we will discuss a few human cost functions that
are useful for video annotation based on our experience annotating massive
videos.

\subsection{Minimize Large Jumps}

We discovered that it is
difficult for users to annotate when the video playback undergoes significant
temporal jumps. We believe this is the case because users rely on the motion of
the objects to visually decode complex scenes and maintain track identities. Due to this, we penalize large
temporal jumps:
\begin{align}
H_{jumping}(b_t^i) = \begin{cases}
    0 & \text{if } 0 \le t < \alpha_3 \\
    \min\left(\alpha_4 \cdot \exp\left(|t-t_c|\right), \alpha_5\right) & \text{otherwise}
\end{cases}
\label{eqn:humanjump}
\end{align}
where $t_c$ was the last frame annotated

We only wish to discourage large jumps in videos and not completely disable
them. It is conceivable that our local model is able to track the object for a
long duration until the path becomes ambiguous (e.g., an object that is initially stationary and later has chaotic motion). Furthermore, we truncate the large jump penalty because once a large
enough jump is made, the amount of cognitive effort required to decode the
scene will remain the same regardless of the jump length (e.g., jumping forward
a week vs. a month).

\subsection{Penalize Swapping}

Although an optimal active learning strategy may be to alternate between
requesting annotations towards the beginning and end of the video, annotators
have difficulty when the video rapidly switches between two points in time.
Ideally, if we knew all the frames we were going to request, we would show the
frames in order to the user, but we are unable to comply with this constraint
because it is computationally infeasible in our framework. As an approximation,
we instead favor frames that are in the future by slightly penalizing frames in
the past:
\begin{align}
H_{swapping}(b_t^i) = \begin{cases}
    0           & \text{if } t \ge t_c \\
    \alpha_4    & \text{otherwise } \\
\end{cases}
\end{align}
The annotator may be required to rewind the video at some point, but with a
sufficient $\alpha_4$ the amount of swapping will be reasonable.

\subsection{Favor Uncluttered Examples}

Users often have difficulty annotating when the space around the tracked object
is cluttered. In low resolution video, distinguishing the limbs of two people
separating from occlusion can be difficult. We argue that humans are able to
best annotate when the tracked object is not near any other object:
\begin{align}
H_{uncluttered}(b_t^i) = -\alpha_5 \cdot \sum_{j} P(b_t^j) \cdot ||i-j||^2
\end{align}
Calculating $H_{uncluttered}(b_t^i)$ is computationally expensive, especially for high
resolution video. However, as it is easily parallelizable, this calculation can be
efficiently performed on a GPU.
