{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.8"
    },
    "colab": {
      "name": "adult_income.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DSUKVEXWOGqm"
      },
      "source": [
        "### Adult Income Dataset\n",
        "This dataset is provided by the UC Irvine ML Repository and can be found [here](https://archive.ics.uci.edu/ml/datasets/adult).  The inputs are demographic data about an individual in the form of 14 attributes, both categorical and numerical, and the task is to predict whether the individual makes more or less than $50,000/year.  This notebook can run without a GPU and will take a few minutes per trial, depending on how many training iterations you want.  "
      ],
      "id": "DSUKVEXWOGqm"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IqNVORQ1Ivn0"
      },
      "source": [
        "### Data loading and pre-processing\n",
        "This data processing script is from Jason Brownlee's blog [here](https://machinelearningmastery.com/imbalanced-classification-with-the-adult-income-dataset/), it uses one hot encoding for all categorical features and [0,1] scaling for numerical features. \n",
        "\n"
      ],
      "id": "IqNVORQ1Ivn0"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "casual-avatar"
      },
      "source": [
        "from pandas import read_csv\n",
        "import numpy as np\n",
        "from sklearn.preprocessing import LabelEncoder\n",
        "from sklearn.preprocessing import OneHotEncoder\n",
        "from sklearn.preprocessing import MinMaxScaler\n",
        "from sklearn.compose import ColumnTransformer\n",
        "from sklearn.linear_model import LogisticRegression as LogReg\n",
        "\n",
        "from time import time\n",
        "import sys\n",
        "\n",
        "# load the dataset\n",
        "def load_dataset(full_path):\n",
        "\t# load the dataset as a numpy array\n",
        "\tdataframe = read_csv(full_path, header=None, na_values='?')\n",
        "\t# drop rows with missing\n",
        "\tdataframe = dataframe.dropna()\n",
        "\t# split into inputs and outputs\n",
        "\tlast_ix = len(dataframe.columns) - 1\n",
        "\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\n",
        "\t# select categorical and numerical features\n",
        "\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\n",
        "\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\n",
        "\t# label encode the target variable to have the classes 0 and 1\n",
        "\ty = LabelEncoder().fit_transform(y)\n",
        "\treturn X.values, y, cat_ix, num_ix\n",
        " \n",
        "# define the location of the datasets\n",
        "train_path = 'adult.data'\n",
        "test_path = 'adult.test'\n",
        "# load the train and test datasets\n",
        "X_train, y_train, cat_ix, num_ix = load_dataset(train_path)\n",
        "X_test, y_test, cat_ix, num_ix = load_dataset(test_path)\n",
        "# set transformer to 1 hot encode categorical variables and scale numerical variables\n",
        "ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)], sparse_threshold=0)\n",
        "# get encoded data inputs\n",
        "data_train = ct.fit_transform(X_train)\n",
        "data_test = ct.transform(X_test)"
      ],
      "id": "casual-avatar",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mIS01g7VJeRd"
      },
      "source": [
        "### Dataset\n",
        "\n",
        "Define a custom pytorch dataset which separates the dataset by a chosen set of attributes, as detailed in the comments.  We reported the cross sectional attributes in the paper.  "
      ],
      "id": "mIS01g7VJeRd"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sunset-afternoon"
      },
      "source": [
        "import torch\n",
        "from torch.utils.data import Dataset, DataLoader, Subset, RandomSampler\n",
        "\n",
        "class AdultDataset(Dataset):\n",
        "    def __init__(self, raw_data, transformed_data, labels, attributes='gender'):\n",
        "        #attributes = 'gender' corresponds to gender: 0=female, 1=male\n",
        "        #attributes = 'both' corresponds to race+gender: \n",
        "        # 0=white, female; 1=other,female; 2=white,male; 3=other,male\n",
        "        self.attributes = attributes\n",
        "        self.data = transformed_data\n",
        "        self.labels = labels\n",
        "        if self.attributes == 'gender':\n",
        "            self.indices = [[]]*2\n",
        "            self.indices[0] = np.argwhere(raw_data[:,9] == ' Female').squeeze()\n",
        "            self.indices[1] = np.argwhere(raw_data[:,9] == ' Male').squeeze()\n",
        "        elif self.attributes == 'both':\n",
        "            self.indices = [[]]*4\n",
        "            self.indices[0] = np.argwhere((raw_data[:,9] == ' Female') * (raw_data[:,8] == ' White')).squeeze()\n",
        "            self.indices[1] = np.argwhere((raw_data[:,9] == ' Female') * (raw_data[:,8] != ' White')).squeeze()\n",
        "            self.indices[2] = np.argwhere((raw_data[:,9] == ' Male') * (raw_data[:,8] == ' White')).squeeze()\n",
        "            self.indices[3] = np.argwhere((raw_data[:,9] == ' Male') * (raw_data[:,8] != ' White')).squeeze()\n",
        "        else:\n",
        "            raise Exception(\"Set attributes to either 'gender' or 'both'\")\n",
        "            \n",
        "    def __len__(self):\n",
        "        return len(self.data)\n",
        "    \n",
        "    def __getitem__(self, idx):\n",
        "        return self.data[idx], self.labels[idx]\n",
        "\n",
        "trainset = AdultDataset(X_train, data_train, y_train, attributes='both')\n",
        "testset = AdultDataset(X_test, data_test, y_test, attributes='both')"
      ],
      "id": "sunset-afternoon",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A9MQxPUOJtqK"
      },
      "source": [
        "### Oracle\n",
        "\n",
        "Define our sampling oracle, which can provide samples of (feature,label) pairs of a specified attribute, sampled uniform at random with replacement over all training examples of that attribute.  "
      ],
      "id": "A9MQxPUOJtqK"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xSnKd0BmMppD"
      },
      "source": [
        "class Oracle_adult():\n",
        "    def __init__(self, dset, device=None, batch_size=50, naive=False):\n",
        "        \n",
        "        if device is None:\n",
        "            device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
        "        self.device = device\n",
        "        self.batch_size = batch_size \n",
        "        self.naive = naive\n",
        "        self.att_indices = dset.indices\n",
        "        m = len(self.att_indices)\n",
        "        self.attribute_iterators = []\n",
        "        if self.naive:\n",
        "            # for the naive scheme, disregard attributes and just sample uniformly\n",
        "            # over the whole training set\n",
        "            sampler = RandomSampler(dset, replacement=True, \n",
        "                                    num_samples=sys.maxsize)\n",
        "            loader = DataLoader(dset, batch_size=self.batch_size, \n",
        "                                sampler=sampler, shuffle=False, num_workers=0)\n",
        "            self.attribute_iterators.append(iter(loader))\n",
        "        else:\n",
        "            for i in range(m):\n",
        "                #select all points in the dataset with attribute i\n",
        "                data_subset = Subset(dset, self.att_indices[i])\n",
        "                #create a random with replacement sampler over those points\n",
        "                #num_samples is how many samples before the iterator ends\n",
        "                sampler = RandomSampler(data_subset, replacement=True, \n",
        "                                        num_samples=sys.maxsize)\n",
        "                loader = DataLoader(data_subset, batch_size=self.batch_size, \n",
        "                                    sampler=sampler, shuffle=False, num_workers=0)\n",
        "                #create a dataloader with the subset of points+random sampler\n",
        "                #and store an iterable of it                    \n",
        "                self.attribute_iterators.append(iter(loader))\n",
        "    def __call__(self, att=0, batch_size=None, return_idx=False):\n",
        "        # these are the indices of the original, complete dataset\n",
        "        # have to be careful to maintain its order if we are going to use them\n",
        "        if return_idx:\n",
        "            return self.att_indices[att]\n",
        "        if self.naive:\n",
        "            att = 0\n",
        "        X, Y = next(self.attribute_iterators[att])\n",
        "        #much faster if batch_size <= self.batch_size, the loop is slow\n",
        "        if batch_size is not None:\n",
        "            for i in range(batch_size // self.batch_size):\n",
        "                x, y = next(self.attribute_iterators[att])\n",
        "                X = torch.cat((X,x))\n",
        "                Y = torch.cat((Y,y))\n",
        "            X = X[:batch_size]\n",
        "            Y = Y[:batch_size]\n",
        "\n",
        "        return X.to(self.device), Y.to(self.device)"
      ],
      "id": "xSnKd0BmMppD",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "020lQA_8MKDD"
      },
      "source": [
        "Runs the algorithm: on each round, fit a logistic regression model to the \n",
        "accumulated training set and then evaluate its performance on the \"validation\" set `features2, use that performance and the specified scheme to select the next attribute to sample from and add to the training set.  Repeat for `TT` iterations, periodically recording test accuracy.  "
      ],
      "id": "020lQA_8MKDD"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "seven-privilege"
      },
      "source": [
        "def Experiment0(model, train_oracle, test_data, test_labels, features1, labels1, \n",
        "                features2, labels2, attributes, \n",
        "                num_test, n0, TT=500, rule='ucb0', param=None, \n",
        "                forced='True', verbose=True):\n",
        "    '''\n",
        "    rule  :string   possible options are\n",
        "                  ucb0 -- standard UCB\n",
        "                  emp  -- empirical, i.e., greedy \n",
        "                  eps   -- epsilon greedy (with epsilon=param)\n",
        "    attributes :list of ints, 0,1,... corresponding to the attribute classes\n",
        "    '''\n",
        "    # 2 for 'gender', 4 for 'both'\n",
        "    m = len(attributes)\n",
        "    # counters for testing frequency\n",
        "    n = (TT-m*n0) // (TT//num_test)\n",
        "    k=0\n",
        "    # track accuracy and mixture distribution for each attribute over time\n",
        "    total_acc = np.zeros((m,n))\n",
        "    Pi = n0 * np.ones(m)\n",
        "    # start the main loop\n",
        "    for t in range(m*n0, TT):\n",
        "        # fit the model to the training set\n",
        "        model.fit(features1, labels1)\n",
        "\n",
        "        #compute accuracy per attribute on the second set\n",
        "        acc = np.zeros(m)\n",
        "        for att in range(m):\n",
        "            x, y = features2[att], labels2[att]\n",
        "            yhat = torch.tensor(model.predict(x))\n",
        "            correct = (yhat==y).sum()\n",
        "            acc[att] = correct / len(y)\n",
        "\n",
        "        # Point Selection Rule\n",
        "        U = torch.zeros(m)\n",
        "        if rule=='ucb0':# Optimistic Rule\n",
        "            if param is None:\n",
        "                param=0.1\n",
        "            U = acc - param/np.sqrt(Pi)\n",
        "        elif rule=='emp':# Empirical Rule\n",
        "            U = acc\n",
        "        elif rule=='eps':\n",
        "            if param is None:\n",
        "                param=0.1 # epsilon value for epsilon-greedy sampling\n",
        "            if np.random.random()<param: # randomly choose one of the attributes\n",
        "                idx = np.random.randint(0,m)\n",
        "                U[idx] = -1\n",
        "            else: # empirical sampling otherwise\n",
        "                U = acc\n",
        "        # choose the attribute with lower value of the index function U \n",
        "        # unless an attribute has fallen under the forcing threshhold\n",
        "        if any(Pi < np.sqrt(t)) and forced:\n",
        "            for att in attributes:\n",
        "                if Pi[att] < np.sqrt(t):\n",
        "                    at = att\n",
        "                    break\n",
        "        else:\n",
        "            val = sys.maxsize\n",
        "            for i,u in enumerate(U):\n",
        "                if u < val:\n",
        "                    at = i\n",
        "                    val = u\n",
        "        #if \n",
        "        if rule=='unif':\n",
        "            at = t % 4\n",
        "        Pi[at] += 1 \n",
        "\n",
        "        #record accuracy of current model on a test set\n",
        "        if t%(TT//num_test)==(TT//num_test - 1):\n",
        "            for j, att in enumerate(attributes):\n",
        "                total_acc[j, k] = (model.predict(test_data[att])==test_labels[att]).mean()\n",
        "            k += 1\n",
        "\n",
        "        \n",
        "        # draw a sample from the chosen attribute \n",
        "        xt, yt = train_oracle(att=at, batch_size=2)\n",
        "        features1 = torch.cat((features1, xt[0:1]), dim=0)\n",
        "        labels1 = torch.cat((labels1, yt[0:1]), dim=0)\n",
        "        features2[at] = torch.cat((features2[at], xt[1:2]), dim=0)\n",
        "        labels2[at] = torch.cat((labels2[at], yt[1:2]), dim=0)\n",
        "    \n",
        "    return total_acc, Pi/Pi.sum()"
      ],
      "id": "seven-privilege",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "adTc2ln6Mg9x"
      },
      "source": [
        "Specify parameters of the experiment here, options for scheme are `ucb0`, `emp`, for empirical/greedy, `eps` for epsilon-greedy, `unif`, and `naive`, which is what we referred to as `Uncurated` in the paper.  `TT` specifies the number of training iterations to run and `param` specifies the exploration parameter for both UCB and epsilon greedy.  This returns test accuracy over time for each trial, in `acc`, and the final mixture distribution selected by the sampling scheme for each trial in `pi`.  "
      ],
      "id": "adTc2ln6Mg9x"
    },
    {
      "cell_type": "code",
      "metadata": {
        "tags": [],
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "great-figure",
        "outputId": "e5371edc-1893-4a0a-afe3-0a3244e4cba4"
      },
      "source": [
        "# Run the experiment \n",
        "def runExperiment0(train_oracle, \n",
        "    test_data, test_labels, num_test,\n",
        "    attributes,\n",
        "    n0=10,\n",
        "    rule='ucb0',\n",
        "    forced=True,\n",
        "    param=0.2,\n",
        "    n_trials=7,\n",
        "    TT=500,\n",
        "    verbose=False\n",
        "):\n",
        "\n",
        "    start_time = time()\n",
        "    m = len(attributes)\n",
        "    n = (TT-m*n0) // (TT//num_test)\n",
        "    #accuracy and mixture dist for each attribute, over all trials, each TT//num_test epochs\n",
        "    acc = np.zeros((1,n_trials,m,n))\n",
        "    pi = np.zeros((1,n_trials,m))\n",
        "    for i in range(n_trials):\n",
        "        print('Starting trial {}/{} with rule = {}, forced = {}'.format(\n",
        "           i+1, n_trials, rule, forced))\n",
        "        \n",
        "        # Initialize the model\n",
        "        model = LogReg(max_iter=5000)\n",
        "        # Get the initial training datasets (2*n0)\n",
        "        x,y = train_oracle(att=0, batch_size=n0)\n",
        "        features1 = x\n",
        "        labels1 = y\n",
        "        x,y = train_oracle(att=0, batch_size=n0)\n",
        "        features2 = [x]\n",
        "        labels2 = [y]\n",
        "        for att in range(1,m):\n",
        "            x,y = train_oracle(att=att, batch_size=n0)\n",
        "            features1 = torch.cat((features1, x), dim=0)\n",
        "            labels1 = torch.cat((labels1, y), dim=0)\n",
        "            x,y = train_oracle(att=att, batch_size=n0)\n",
        "            features2.append(x)\n",
        "            labels2.append(y) \n",
        "\n",
        "        # Run the experiment to get one set of accuracies and pi vector\n",
        "        temp_acc, pi_temp = Experiment0(model=model, train_oracle=train_oracle, test_data=test_data, test_labels=test_labels,\n",
        "            features1=features1, labels1=labels1, features2=features2, labels2=labels2,\n",
        "            num_test=num_test, attributes=attributes, n0=n0,  TT=TT,\n",
        "            rule=rule, param=param, forced=forced, verbose=verbose\n",
        "        )\n",
        "        acc[0,i] = temp_acc\n",
        "        pi[0,i] = pi_temp\n",
        "        end_time=time()\n",
        "        print('Completed {} trials in {:.2f} seconds \\n \\n'.format(i+1, -start_time+end_time)) \n",
        "\n",
        "\n",
        "    return acc, pi\n",
        "  \n",
        "\n",
        "\n",
        "TT = 5000\n",
        "num_test=50\n",
        "rule = 'ucb0'\n",
        "naive = rule == 'naive'\n",
        "forced = True\n",
        "trials = 10\n",
        "param = 0.1\n",
        "#initialize the oracles\n",
        "train_oracle = Oracle_adult(trainset, naive=naive)\n",
        "test_oracle = Oracle_adult(testset)\n",
        "test_data = []\n",
        "test_labels = []\n",
        "#split the test set by attribute, this is only for tracking purposes, it doesn't interact with training at all\n",
        "for i in range(4):\n",
        "    idx = test_oracle.att_indices[i]\n",
        "    temp_data = data_test[idx]\n",
        "    temp_labels = y_test[idx]\n",
        "    test_data.append(temp_data)\n",
        "    test_labels.append(temp_labels)\n",
        "\n",
        "acc, pi = runExperiment0(train_oracle, test_data, test_labels, num_test, attributes=range(4), rule=rule, TT=TT, n_trials=trials, param=param, forced = forced)\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "id": "great-figure",
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Starting trial 1/10 with rule = ucb0, forced = True\n",
            "Completed 1 trials in 6.90 seconds \n",
            " \n",
            "\n",
            "Starting trial 2/10 with rule = ucb0, forced = True\n",
            "Completed 2 trials in 14.31 seconds \n",
            " \n",
            "\n",
            "Starting trial 3/10 with rule = ucb0, forced = True\n",
            "Completed 3 trials in 21.60 seconds \n",
            " \n",
            "\n",
            "Starting trial 4/10 with rule = ucb0, forced = True\n",
            "Completed 4 trials in 28.95 seconds \n",
            " \n",
            "\n",
            "Starting trial 5/10 with rule = ucb0, forced = True\n",
            "Completed 5 trials in 36.08 seconds \n",
            " \n",
            "\n",
            "Starting trial 6/10 with rule = ucb0, forced = True\n",
            "Completed 6 trials in 43.28 seconds \n",
            " \n",
            "\n",
            "Starting trial 7/10 with rule = ucb0, forced = True\n",
            "Completed 7 trials in 50.38 seconds \n",
            " \n",
            "\n",
            "Starting trial 8/10 with rule = ucb0, forced = True\n",
            "Completed 8 trials in 57.68 seconds \n",
            " \n",
            "\n",
            "Starting trial 9/10 with rule = ucb0, forced = True\n",
            "Completed 9 trials in 65.03 seconds \n",
            " \n",
            "\n",
            "Starting trial 10/10 with rule = ucb0, forced = True\n",
            "Completed 10 trials in 72.26 seconds \n",
            " \n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}