This is data for the paper:

Reading Tea Leaves: How Humans Interpret Topic Models (NIPS 2009)
http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf

Please address questions about the data to jbg@umiacs.umd.edu

The data are organized by corpus first (at the moment, we're only
releasing wikipedia), then by model, and then by number of topics.

Under the corpus directory are two files, which are consistent across
models and topic numbers:

ledes
        The ledes for the documents that we showed to users on
        Mechanical Turk in the topic intrusion task (the line number
        corresponds to the zero-indexed id of the document)

vocab
        The vocabulary for the models (the line number corresponds to
        the zero-indexed id of the term)

For each corpus, there are files specific to models and the number of
topics.  These are sub-directories within the corpus folder.

word_intrusion_responses

        This is a whitespace delimited table.  The columns are as follows:
        model, corpus, number of topics, topic number for the normal words,
        where in the list of words the intruder was placed, where in the list
        of words the Mechanical Turk user clicked, did the human ask for the
        "right" answer afterward, what was the intruding word, what word did
        the human click on, what were the words displayed (hyphen delimited)

topic_intrusion_responses

        For the topic intrusion task, these are the human responses.
        It is formatted as a CSV file.  The first field is the document id.
        The second field is the document title.  The third field is a colon
        delimited list of all of the topics displayed.  The fourth field is
        the "intruding" topic.  The fifth field is the topic selected by the
        user on Mechanical Turk.

doc_topic_proportion_for_topic_intrusion

        For the topic intrusion experiments shown to human subjects,
        these are the fitted topic proportions discovered by the model
        (the line numbers correspond to zero-indexed document proportions)

topic_intrusion_displayed_topics

        These are the discovered topics as displayed to users on
        Mechanical Turk.  The line number corresponds to the zero-indexed topic
        id.

doc_lhood

        This is the computed likelihood for held-out documents (these do not
        correspond to the document titles used for evaluation on Mechanical
        Turk)

fits/word-prob-in-topic

        This is the log probability of a word in each topic as fit by
        the model.  In gsl matrix format.
