# Tasks Assessing Protein Embeddings (TAPE)

Code for the TAPE paper.

## Organization

The repository is organized into a several modules: `tasks`, `models`, `data_utils`, `experiments`, `losses`, `analysis`, and `__main__`. The `tasks` module contains all the tasks in the TAPE paper. Tasks are built using the `TaskBuilder` class, located in `tasks/TaskBuilder.py`. Loss functions and data loading for individual tasks are defined within the class for that task. The `models` module contains all the models in the TAPE paper. Models are built using the `ModelBuilder` class, located in `models/ModelBuilder.py`. `data_utils` contains serialization and deserialization functions for all datasets in TAPE. `experiments` contains the `rinokeras` experiment class, which includes hyperparameters for how training is performed (learning rate, gradient clipping, etc.). `losses` has some simple helper functions for simultaneously computing loss and accuracy. `analysis` contains some simple helper functions for loading and cleaning the results of saved experiments. Finally, `__main__` runs the actual experiments.

## Data

We make all data available on AWS at [http://proteindata.s3.amazonaws.com/tape_data.tar.gz](http://proteindata.s3.amazonaws.com/tape_data.tar.gz). Total compressed size of all data is around 7GB. Uncompressed, the total size is around 50GB. The data can be downloaded and extracted with

    wget -c http://proteindata.s3.amazonaws.com/tape_data.tar.gz
    tar -xzf tape_data.tar.gz -C tape

## Code Setup

`tape` can be installed via

    pip install -e .

Requirements are listed in the `setup.py` file if you have any issues installing them.

## Usage

We use [Sacred](https://sacred.readthedocs.io/en/latest/index.html) to configure and store logging information.

Sacred options are specified by running `python -m tape with <args>`. A number of common configuration options are implemented as named configs. For example, to run the `transformer` model on the `masked_language_modeling` task, simply run

    python -m tape with model=transformer tasks=masked_language_modeling

Additional arguments can be specified by adding e.g. `transformer.n_layers=6`, `training.learning_rate=1e-4`, `gpu.device=0,1,2`, etc.

The tasks available to be run are specified in `tape/tasks/TaskBuilder.py`. Available models are specified in `tape/models/ModelBuilder.py`.

## Results

Results will be stored in `tape/results`. Each run will be placed in a timestamped directory. `tape` sources will automatically be saved, along with the config and per-epoch metrics.

## Other notes

There are some discrepencies between the task names and dataset names when compared to our paper. We plan to fix these discrepancies before publication. Until then, note that the 'Stability' task is referred to as 'denovo_engineering', and that the 'fluorescence' task is referred to as 'gfp3'.
