Model configurations and packages:
  - pip install --upgrade pip
  - pip install tensorflow-gpu==2.0
  - pip install py-params==0.8.3
  - pip install params-flow==0.7.4
  - pip install bert-for-tf2==0.13.5
  - pip install sentencepiece
  - pip install sklearn

Run program as:

- PYTHONHASHSEED=1234 python run_uast.py --path $$PT_DATA_DIR --task SST-2 --model_dir $$PT_OUTPUT_DIR --seq_len 32 --sample_scheme easy_bald_class_weight --sup_labels 60 --N_base 10

Parameters:
  - PYTHONHASHSEED is used to seed the random number generator for selecting training samples. Use different seeds for different runs.
  - path is the path of data_directory
  - task is the path for specific dataset within data_directory containing train.tsv, test.tsv and unlabeled data 
  - sup_labels shows the total number of labeled samples for training and validation per class
  - valid_split fraction of sup_labels to be used for validation for each class
  - sample_scheme for self-training
    -- "uniform" for random sampling
    -- "easy_bald_class_weight" for selecting easy_samples+bald_measure+class_selection+confident_learning ('class' and 'weight' flags are used for class_selection and confident_learning)
    -- "difficult_bald_class_weight" for doing the above but with hard samples
    -- "easy_bald_class_weight_soft" for using aggregate predictions. This works better than "easy_bald_class_weight" for classes > 2
    -- refer to code in uast.py for other combinations
  - N_base is used to fine-tune the base model N times with different seeds and select the best one based on validation loss. This is fast and effective for fine-tuning with few labels since each run exhibits high variance. 
  - other hyper-parameter declarations in run_easy.py


  Data:
    - data_directory contains data files for each task with train, test, labels, unlabeled data
    - bert_output contains pre-trained language model checkpoints for each task. To this end, we use BERT pre-trained (uncased_L-12_H-768_A-12) checkpoint and continued pre-training on unlabeled data for each task.
    - vocab is the wordpiece tokenized vocab file for above BERT checkpoint included in the directory of each task
    - Uploaded to https://drive.google.com/drive/folders/1KzUdbRzBh3gzPx-HoIGalTBC20Sz9x6J?usp=sharing