{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quantizing Neural Machine Translation Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We continue our quest to quantize every Neural Network!  \n",
    "On this chapter: __Google's Neural Machine Translation model__.  \n",
    "A brief summary - using stacked LSTMs and attention mechanism, this model encodes a sentence into a list of vectors and then decodes it to the other language tokens until an end token is reached.  \n",
    "To read more - refer to <a id=\"ref-1\" href=\"#cite-wu2016google\">Google's paper</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Quantizing Neural Machine Translation Models](#Quantizing-Neural-Machine-Translation-Models)\n",
    "\t* [Getting the resources](#Getting-the-resources)\n",
    "\t* [Loading the model](#Loading-the-model)\n",
    "\t* [Evaulation of the model](#Evaulation-of-the-model)\n",
    "\t* [Quantizing the model](#Quantizing-the-model)\n",
    "\t\t* [Collecting the statistics](#Collecting-the-statistics)\n",
    "\t\t* [Defining the Quantizer](#Defining-the-Quantizer)\n",
    "\t\t* [Quantizing the model](#Quantizing-the-model)\n",
    "\t\t* [Evaluating the quantized model](#Evaluating-the-quantized-model)\n",
    "\t\t* [Finding the right quantization](#Finding-the-right-quantization)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this project, we modified the [`mlperf/training/rnn_translator`](https://github.com/mlperf/training/tree/master/rnn_translator) project to enable quantization of the GNMT model.  \n",
    "The instructions to download and setup the required environment for this task are in `README.md` (located in the current directory).  \n",
    "Download the pretrained model using the command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment the line below to download the pretrained model:\n",
    "#! wget https://zenodo.org/record/2581623/files/model_best.pth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, you should have everything ready to start quantizing!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing The Model For Quantization\n",
    "\n",
    "In order to be able to fully quantize the model, we modify it according to the instructions laid out in the Distiller [documentation](https://nervanasystems.github.io/distiller/prepare_model_quant.html). This mostly amounts to making sure every quantize-able operation is invoked via a dedicated PyTorch Module. You can compare the code under `seq2seq/models` in this example with the [original](https://github.com/mlperf/training/tree/master/rnn_translator/pytorch/seq2seq/models).\n",
    "\n",
    "For example, in `seq2seq/models/attention.py`, we added the following code to the `__init__` function of the `BahdanauAttention` class:\n",
    "\n",
    "```python\n",
    "# Adding submodules for basic ops to allow quantization:\n",
    "self.eltwiseadd_qk = EltwiseAdd()\n",
    "self.eltwiseadd_norm_bias = EltwiseAdd()\n",
    "self.eltwisemul_norm_scaler = EltwiseMult()\n",
    "self.matmul_score = Matmul()\n",
    "self.context_matmul = BatchMatmul()\n",
    "```\n",
    "\n",
    "We're creating modules for operations that were invoked directly in the `forward` function in the original code. This enables Distiller to detect these operations and replace them with quantized counterparts.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import distiller\n",
    "from distiller.modules import DistillerLSTM\n",
    "from distiller.quantization import PostTrainLinearQuantizer\n",
    "from ast import literal_eval\n",
    "from itertools import zip_longest\n",
    "from copy import deepcopy\n",
    "\n",
    "from seq2seq import models\n",
    "from seq2seq.inference.inference import Translator\n",
    "from seq2seq.utils import AverageMeter\n",
    "import subprocess\n",
    "import os\n",
    "import seq2seq.data.config as config\n",
    "from seq2seq.data.dataset import ParallelDataset\n",
    "import logging\n",
    "from seq2seq.utils import AverageMeter\n",
    "# Import utilities from the example:\n",
    "from translate import grouper, write_output, checkpoint_from_distributed, unwrap_distributed\n",
    "from itertools import takewhile\n",
    "from tqdm import tqdm\n",
    "import logging\n",
    "logging.disable(logging.INFO)  # Disables mlperf output\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(action='default', module='distiller.quantization')\n",
    "warnings.filterwarnings(action='default', module='distiller.quantization.range_linear')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define some constants\n",
    "batch_first=True\n",
    "batch_size=128\n",
    "beam_size=10\n",
    "cov_penalty_factor=0.1\n",
    "dataset_dir='./data'\n",
    "input='./data/newstest2014.tok.clean.bpe.32000.en'\n",
    "len_norm_const=5.0\n",
    "len_norm_factor=0.6\n",
    "max_seq_len=80\n",
    "model='model_best.pth'\n",
    "output='output_file'\n",
    "print_freq=1\n",
    "reference='./data/newstest2014.de'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loading the model\n",
    "checkpoint = torch.load('./model_best.pth', map_location={'cuda:0': 'cpu'})\n",
    "vocab_size = checkpoint['tokenizer'].vocab_size\n",
    "model_config = dict(vocab_size=vocab_size, math=checkpoint['config'].math,\n",
    "                    **literal_eval(checkpoint['config'].model_config))\n",
    "model_config['batch_first'] = batch_first\n",
    "model = models.GNMT(**model_config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GNMT(\n",
       "  (encoder): ResidualRecurrentEncoder(\n",
       "    (rnn_layers): ModuleList(\n",
       "      (0): LSTM(1024, 1024, batch_first=True, bidirectional=True)\n",
       "      (1): LSTM(2048, 1024, batch_first=True)\n",
       "      (2): LSTM(1024, 1024, batch_first=True)\n",
       "      (3): LSTM(1024, 1024, batch_first=True)\n",
       "    )\n",
       "    (dropout): Dropout(p=0.2)\n",
       "    (embedder): Embedding(32317, 1024, padding_idx=0)\n",
       "    (eltwiseadd_residuals): ModuleList(\n",
       "      (0): EltwiseAdd()\n",
       "      (1): EltwiseAdd()\n",
       "    )\n",
       "  )\n",
       "  (decoder): ResidualRecurrentDecoder(\n",
       "    (att_rnn): RecurrentAttention(\n",
       "      (rnn): LSTM(1024, 1024, batch_first=True)\n",
       "      (attn): BahdanauAttention(\n",
       "        (linear_q): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        (linear_k): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        (dropout): Dropout(p=0)\n",
       "        (eltwiseadd_qk): EltwiseAdd()\n",
       "        (eltwiseadd_norm_bias): EltwiseAdd()\n",
       "        (eltwisemul_norm_scaler): EltwiseMult()\n",
       "        (tanh): Tanh()\n",
       "        (matmul_score): Matmul()\n",
       "        (softmax_att): Softmax()\n",
       "        (context_matmul): BatchMatmul()\n",
       "      )\n",
       "      (dropout): Dropout(p=0)\n",
       "    )\n",
       "    (rnn_layers): ModuleList(\n",
       "      (0): LSTM(2048, 1024, batch_first=True)\n",
       "      (1): LSTM(2048, 1024, batch_first=True)\n",
       "      (2): LSTM(2048, 1024, batch_first=True)\n",
       "    )\n",
       "    (embedder): Embedding(32317, 1024, padding_idx=0)\n",
       "    (classifier): Classifier(\n",
       "      (classifier): Linear(in_features=1024, out_features=32317, bias=True)\n",
       "    )\n",
       "    (dropout): Dropout(p=0.2)\n",
       "    (eltwiseadd_residuals): ModuleList(\n",
       "      (0): EltwiseAdd()\n",
       "      (1): EltwiseAdd()\n",
       "    )\n",
       "    (attention_concats): ModuleList(\n",
       "      (0): Concat()\n",
       "      (1): Concat()\n",
       "      (2): Concat()\n",
       "    )\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "state_dict = checkpoint['state_dict']\n",
    "if checkpoint_from_distributed(state_dict):\n",
    "    state_dict = unwrap_distributed(state_dict)\n",
    "\n",
    "model.load_state_dict(state_dict)\n",
    "torch.cuda.set_device(0)\n",
    "model = model.cuda()\n",
    "model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaulation of the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = checkpoint['tokenizer']\n",
    "\n",
    "\n",
    "test_data = ParallelDataset(\n",
    "    src_fname=os.path.join(dataset_dir, config.SRC_TEST_FNAME),\n",
    "    tgt_fname=os.path.join(dataset_dir, config.TGT_TEST_FNAME),\n",
    "    tokenizer=tokenizer,\n",
    "    min_len=0,\n",
    "    max_len=150,\n",
    "    sort=False)\n",
    "\n",
    "def get_loader():\n",
    "    return test_data.get_loader(batch_size=batch_size,\n",
    "                                   batch_first=True,\n",
    "                                   shuffle=False,\n",
    "                                   num_workers=0,\n",
    "                                   drop_last=False,\n",
    "                                   distributed=False)\n",
    "def get_translator(model):\n",
    "    return Translator(model,\n",
    "                       tokenizer,\n",
    "                       beam_size=beam_size,\n",
    "                       max_seq_len=max_seq_len,\n",
    "                       len_norm_factor=len_norm_factor,\n",
    "                       len_norm_const=len_norm_const,\n",
    "                       cov_penalty_factor=cov_penalty_factor,\n",
    "                       cuda=True)\n",
    "torch.cuda.empty_cache()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate(model, test_path, num_batches=None):\n",
    "    test_file = open(test_path, 'w', encoding='UTF-8')\n",
    "    model.eval()\n",
    "    translator = get_translator(model)\n",
    "    stats = {}\n",
    "    loader = get_loader()\n",
    "    total_batches = len(loader)\n",
    "    if num_batches is None:\n",
    "        num_batches = total_batches\n",
    "    num_batches = min(num_batches, total_batches)\n",
    "    loader = iter(loader)\n",
    "    for i in tqdm(range(num_batches)):\n",
    "        src, tgt, indices = next(loader)\n",
    "        src, src_length = src\n",
    "        if translator.batch_first:\n",
    "            batch_size = src.size(0)\n",
    "        else:\n",
    "            batch_size = src.size(1)\n",
    "        bos = [translator.insert_target_start] * (batch_size * beam_size)\n",
    "        bos = torch.LongTensor(bos)\n",
    "        if translator.batch_first:\n",
    "            bos = bos.view(-1, 1)\n",
    "        else:\n",
    "            bos = bos.view(1, -1)\n",
    "        src_length = torch.LongTensor(src_length)\n",
    "        stats['total_enc_len'] = int(src_length.sum())\n",
    "        src = src.cuda()\n",
    "        src_length = src_length.cuda()\n",
    "        bos = bos.cuda()\n",
    "        with torch.no_grad():\n",
    "            context = translator.model.encode(src, src_length)\n",
    "            context = [context, src_length, None]\n",
    "            if beam_size == 1:\n",
    "                generator = translator.generator.greedy_search\n",
    "            else:\n",
    "                generator = translator.generator.beam_search\n",
    "            preds, lengths, counter = generator(batch_size, bos, context)\n",
    "        stats['total_dec_len'] = lengths.sum().item()\n",
    "        stats['iters'] = counter\n",
    "        preds = preds.cpu()\n",
    "        lengths = lengths.cpu()\n",
    "        output = []\n",
    "        for idx, pred in enumerate(preds):\n",
    "            end = lengths[idx] - 1\n",
    "            pred = pred[1: end]\n",
    "            pred = pred.tolist()\n",
    "            out = translator.tok.detokenize(pred)\n",
    "            output.append(out)\n",
    "        output = [output[indices.index(i)] for i in range(len(output))]\n",
    "        for line in output:\n",
    "            test_file.write(line)\n",
    "            test_file.write('\\n')\n",
    "        total_tokens = stats['total_dec_len'] + stats['total_enc_len']\n",
    "    test_file.close()\n",
    "    if num_batches < total_batches:\n",
    "        print(\"Can't calculate BLEU when evaluating partial dataset\")\n",
    "        return\n",
    "    # run moses detokenizer\n",
    "    print(\"Calculating BLEU score...\")\n",
    "    detok_path = os.path.join(dataset_dir, config.DETOKENIZER)\n",
    "    detok_test_path = test_path + '.detok'\n",
    "\n",
    "    with open(detok_test_path, 'w') as detok_test_file, \\\n",
    "            open(test_path, 'r') as test_file:\n",
    "        subprocess.run(['perl', detok_path], stdin=test_file,\n",
    "                       stdout=detok_test_file, stderr=subprocess.DEVNULL)\n",
    "    # run sacrebleu\n",
    "    reference_path = os.path.join(dataset_dir,\n",
    "                                  config.TGT_TEST_TARGET_FNAME)\n",
    "    sacrebleu = subprocess.run(['sacrebleu --input {} {} --score-only -lc --tokenize intl'.\n",
    "                                format(detok_test_path, reference_path)],\n",
    "                               stdout=subprocess.PIPE, shell=True)\n",
    "    bleu = float(sacrebleu.stdout.strip())\n",
    "    print('BLEU on test dataset: {}'.format(bleu))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [00:48<00:00,  1.77s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 22.16\n"
     ]
    }
   ],
   "source": [
    "evaluate(model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quantizing the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we already noted, we modified the model from `mlperf` to a modular implementation so we can quantize each and every operation in the graph.  \n",
    "However, the default `nn.LSTM` was implemented in C++/CUDA, and we don't have usual access to it's operations hence we can't quantize it properly. This is why we'll convert the `nn.LSTM` to a `DistillerLSTM`, which is an entirely modular implementation of the LSTM - identical in functionality to the original `nn.LSTM`.  \n",
    "This is done by simply calling `DistillerLSTM.from_pytorch_impl` for a single `nn.LSTM` and  \n",
    "`convert_model_to_distiller_lstm` for an entire model containing multiple different LSTMs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [01:54<00:00,  4.02s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 22.16\n"
     ]
    }
   ],
   "source": [
    "from distiller.modules import convert_model_to_distiller_lstm\n",
    "model = convert_model_to_distiller_lstm(model)\n",
    "evaluate(model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Collecting the statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The quantizer uses statistics to define the range of the quantization. We collect these statistics using a `QuantCalibrationStatsCollector` instance like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [46:47<00:00, 102.09s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "  0%|          | 0/24 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BLEU on test dataset: 22.16\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [12:23<00:00, 26.94s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 22.16\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from distiller.data_loggers import collect_quant_stats\n",
    "\n",
    "stats_file = './acts_quantization_stats.yaml'\n",
    "\n",
    "if not os.path.isfile(stats_file): # Collect stats.\n",
    "    model_copy = deepcopy(model)\n",
    "    distiller.utils.assign_layer_fq_names(model_copy)\n",
    "    \n",
    "    def eval_for_stats(model):\n",
    "        evaluate(model, output + '.temp', num_batches=None)\n",
    "    collect_quant_stats(model_copy, eval_for_stats, save_dir='.')\n",
    "    del model_copy\n",
    "    torch.cuda.empty_cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining the Quantizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A distiller `Quantizer` object replaces each submodule in a model with its quantized counterpart, using a \n",
    "`replacement_factory`.  \n",
    "`Quantizer.replacement_factory` is a dictionary which maps from a module type (e.g. `nn.Linear` and `nn.Conv`) to a function. This function takes a module and quantization configuration, and returns a quantized version of the same module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Replacing 'Conv2d' modules using 'replace_param_layer' function\n",
      "Replacing 'Conv3d' modules using 'replace_param_layer' function\n",
      "Replacing 'Linear' modules using 'replace_param_layer' function\n",
      "Replacing 'Concat' modules using 'replace_non_param_layer' function\n",
      "Replacing 'EltwiseAdd' modules using 'replace_non_param_layer' function\n",
      "Replacing 'EltwiseMult' modules using 'replace_non_param_layer' function\n",
      "Replacing 'Matmul' modules using 'replace_non_param_layer' function\n",
      "Replacing 'BatchMatmul' modules using 'replace_non_param_layer' function\n",
      "Replacing 'Embedding' modules using 'replace_embedding' function\n"
     ]
    }
   ],
   "source": [
    "# Basic quantizer defintion\n",
    "quantizer = PostTrainLinearQuantizer(deepcopy(model), \n",
    "                                    mode=\"SYMMETRIC\",  # As was suggested in GNMT's paper\n",
    "                                    model_activation_stats=stats_file)\n",
    "# We take a look at the replacement factory:\n",
    "for t, rf in quantizer.replacement_factory.items():\n",
    "    if rf is not None:\n",
    "        print(\"Replacing '{}' modules using '{}' function\".format(t.__name__, rf.__name__))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quantizing the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is done by simply calling `quantizer.prepare_model()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: Logging before flag parsing goes to stderr.\n",
      "W1028 15:38:47.074526 140475898566400 range_linear.py:1275] /data2/users/gjacob/work/distiller/distiller/quantization/range_linear.py:1275: UserWarning: Model contains a bidirectional DistillerLSTM module. Automatic BN folding and statistics optimization based on tracing is not yet supported for models containing such modules.\n",
      "Will perform specific optimization for the DistillerLSTM modules, but any other potential opportunities for optimization in the model will be ignored.\n",
      "  'opportunities for optimization in the model will be ignored.', UserWarning)\n",
      "\n",
      "W1028 15:38:47.385626 140475898566400 quantizer.py:287] /data2/users/gjacob/work/distiller/distiller/quantization/quantizer.py:287: UserWarning: Module 'decoder.embedder' references to same module as 'encoder.embedder'. Replacing with reference the same wrapper.\n",
      "  UserWarning)\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GNMT(\n",
       "  (encoder): ResidualRecurrentEncoder(\n",
       "    (rnn_layers): ModuleList(\n",
       "      (0): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=True)\n",
       "      (1): DistillerLSTM(2048, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "      (2): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "      (3): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "    )\n",
       "    (dropout): Dropout(p=0.2)\n",
       "    (embedder): RangeLinearEmbeddingWrapper(\n",
       "      (wrapped_module): Embedding(32317, 1024, padding_idx=0)\n",
       "    )\n",
       "    (eltwiseadd_residuals): ModuleList(\n",
       "      (0): RangeLinearQuantEltwiseAddWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=63.530453, output_zero_point=0.000000\n",
       "        (wrapped_module): EltwiseAdd()\n",
       "      )\n",
       "      (1): RangeLinearQuantEltwiseAddWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=42.430592, output_zero_point=0.000000\n",
       "        (wrapped_module): EltwiseAdd()\n",
       "      )\n",
       "    )\n",
       "  )\n",
       "  (decoder): ResidualRecurrentDecoder(\n",
       "    (att_rnn): RecurrentAttention(\n",
       "      (rnn): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "      (attn): BahdanauAttention(\n",
       "        (linear_q): RangeLinearQuantParamLayerWrapper(\n",
       "          weights_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=4.219996, output_zero_point=0.000000\n",
       "          weights_scale=53.064903, weights_zero_point=0.000000\n",
       "          (wrapped_module): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        )\n",
       "        (linear_k): RangeLinearQuantParamLayerWrapper(\n",
       "          weights_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=4.546572, output_zero_point=0.000000\n",
       "          weights_scale=40.979481, weights_zero_point=0.000000\n",
       "          (wrapped_module): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        )\n",
       "        (dropout): Dropout(p=0)\n",
       "        (eltwiseadd_qk): RangeLinearQuantEltwiseAddWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=2.974982, output_zero_point=0.000000\n",
       "          (wrapped_module): EltwiseAdd()\n",
       "        )\n",
       "        (eltwiseadd_norm_bias): RangeLinearQuantEltwiseAddWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=2.961179, output_zero_point=0.000000\n",
       "          (wrapped_module): EltwiseAdd()\n",
       "        )\n",
       "        (eltwisemul_norm_scaler): RangeLinearQuantEltwiseMultWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=611.199402, output_zero_point=0.000000\n",
       "          (wrapped_module): EltwiseMult()\n",
       "        )\n",
       "        (tanh): RangeLinearFakeQuantWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=False\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=127.000000, output_zero_point=0.000000\n",
       "          (wrapped_module): Tanh()\n",
       "        )\n",
       "        (matmul_score): RangeLinearQuantMatmulWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=6.327397, output_zero_point=0.000000\n",
       "          (wrapped_module): Matmul()\n",
       "        )\n",
       "        (softmax_att): RangeLinearFakeQuantWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=False\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=128.141968, output_zero_point=0.000000\n",
       "          (wrapped_module): Softmax()\n",
       "        )\n",
       "        (context_matmul): RangeLinearQuantMatmulWrapper(\n",
       "          output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "          requires_quantized_inputs=True\n",
       "            inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "          scale_approx_mult_bits=None\n",
       "          preset_activation_stats=True\n",
       "            output_scale=46.205608, output_zero_point=0.000000\n",
       "          (wrapped_module): BatchMatmul()\n",
       "        )\n",
       "      )\n",
       "      (dropout): Dropout(p=0)\n",
       "    )\n",
       "    (rnn_layers): ModuleList(\n",
       "      (0): DistillerLSTM(2048, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "      (1): DistillerLSTM(2048, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "      (2): DistillerLSTM(2048, 1024, num_layers=1, dropout=0.00, bidirectional=False)\n",
       "    )\n",
       "    (embedder): RangeLinearEmbeddingWrapper(\n",
       "      (wrapped_module): Embedding(32317, 1024, padding_idx=0)\n",
       "    )\n",
       "    (classifier): Classifier(\n",
       "      (classifier): RangeLinearQuantParamLayerWrapper(\n",
       "        weights_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=6.324166, output_zero_point=0.000000\n",
       "        weights_scale=40.831116, weights_zero_point=0.000000\n",
       "        (wrapped_module): Linear(in_features=1024, out_features=32317, bias=True)\n",
       "      )\n",
       "    )\n",
       "    (dropout): Dropout(p=0.2)\n",
       "    (eltwiseadd_residuals): ModuleList(\n",
       "      (0): RangeLinearQuantEltwiseAddWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=63.689533, output_zero_point=0.000000\n",
       "        (wrapped_module): EltwiseAdd()\n",
       "      )\n",
       "      (1): RangeLinearQuantEltwiseAddWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=42.814404, output_zero_point=0.000000\n",
       "        (wrapped_module): EltwiseAdd()\n",
       "      )\n",
       "    )\n",
       "    (attention_concats): ModuleList(\n",
       "      (0): RangeLinearQuantConcatWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=46.205608, output_zero_point=0.000000\n",
       "        (wrapped_module): Concat()\n",
       "      )\n",
       "      (1): RangeLinearQuantConcatWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=46.205608, output_zero_point=0.000000\n",
       "        (wrapped_module): Concat()\n",
       "      )\n",
       "      (2): RangeLinearQuantConcatWrapper(\n",
       "        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)\n",
       "        requires_quantized_inputs=True\n",
       "          inputs_quant_auto_fallback=True, forced_quant_settings_for_inputs=None\n",
       "        scale_approx_mult_bits=None\n",
       "        preset_activation_stats=True\n",
       "          output_scale=46.205608, output_zero_point=0.000000\n",
       "        (wrapped_module): Concat()\n",
       "      )\n",
       "    )\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_input = (torch.ones(1, 2).to(dtype=torch.long),\n",
    "               torch.ones(1).to(dtype=torch.long),\n",
    "               torch.ones(1, 2).to(dtype=torch.long))\n",
    "quantizer.prepare_model(dummy_input)\n",
    "quantizer.model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you'd like to know how these functions replace the modules - I recommend reading the source code for them in  \n",
    "`{DISTILLER_ROOT}/distiller/quantization/range_linear.py:PostTrainLinearQuantizer`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the quantized model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [15:15<00:00, 33.37s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 18.04\n"
     ]
    }
   ],
   "source": [
    "#torch.cuda.empty_cache()\n",
    "evaluate(quantizer.model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the right quantization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see here, we quantized our model entirely and it lost some accuracy, so we want to apply more strategies to quantize better.  \n",
    "Symmetric quantization means our range is the biggest we can hold our activations in:\n",
    "$$\n",
    "    M = \\max \\{ |\\text{acts}|\\},\\, \\text{range}_{symmetric} = [-M, M]\n",
    "$$\n",
    "This way we waste resolution. However, if we use assymetric quantization - we may get better results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [17:19<00:00, 33.57s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 18.4\n"
     ]
    }
   ],
   "source": [
    "# Basic quantizer defintion\n",
    "quantizer = PostTrainLinearQuantizer(deepcopy(model), \n",
    "                                    mode=\"ASYMMETRIC_SIGNED\",  \n",
    "                                    model_activation_stats=stats_file)\n",
    "quantizer.prepare_model()\n",
    "evaluate(quantizer.model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here - we quantized asymmetrically, meaning our range still holds all the activations, but it's smaller than in the symmetrical case.  \n",
    "The formula is:\n",
    "$$\n",
    "    \\text{range}_{asymmetric} = \\left[\\min\\{ \\text{acts}\\}, \\max \\{ \\text{acts}\\}\\right] \n",
    "    \\subset \\text{range}_{symmetric}\n",
    "$$\n",
    "And we indeed got a slightly better result.  \n",
    "However - some part of the activations during the evaluations are outliers, meaning they are way outside the range of most of their buddies. We're going to intercept this in two ways -\n",
    "1. Quantize each channel separately, that way we achieve more accuracy. We'll add the argument `per_channel_wts=True`.\n",
    "2. Limit the quantization range to a smaller one, thus clamping these outliers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll try using the same technique as in `quantize_lstm.ipynb` - clipping the activations according to the average range recorded for them:  \n",
    "\n",
    "\n",
    "$$\n",
    "    m = \\underset{b\\in\\text{batches}}{\\text{avg}}\\left\\{\\min_{b}\\{\\text{acts}\\}\\right\\},\\,\n",
    "    M = \\underset{b\\in\\text{batches}}{\\text{avg}}\\left\\{\\max_{b}\\{\\text{acts}\\}\\right\\}\n",
    "$$\n",
    "\n",
    "\n",
    "$$\n",
    "    \\text{range}_{clipped} = [m,M] \\subset \\text{range}_{asymmetric} \\subset \\text{range}_{symmetric}\n",
    "$$\n",
    "\n",
    "This is done by specifying `clip_acts=\"AVG\"` in the quantizer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [15:55<00:00, 34.44s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 9.44\n"
     ]
    }
   ],
   "source": [
    "# Basic quantizer defintion\n",
    "quantizer = PostTrainLinearQuantizer(deepcopy(model), \n",
    "                                    mode=\"ASYMMETRIC_SIGNED\",  \n",
    "                                    model_activation_stats=stats_file,\n",
    "                                    per_channel_wts=True,\n",
    "                                    clip_acts=\"AVG\")\n",
    "quantizer.prepare_model()\n",
    "evaluate(quantizer.model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oh no! This is bad... turns out that by clamping the outliers we actually \"removed\" useful features from important layers like the attention layer. In the attention layer we have a softmax which relies on high values to pass a correct score of importance of features. Let's try clipping all the other values, except in the attention layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [15:22<00:00, 33.23s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 16.71\n"
     ]
    }
   ],
   "source": [
    "# No clipping in the attention layer\n",
    "overrides_yaml = \"\"\"\n",
    ".*att_rnn.attn.*:\n",
    "    clip_acts: NONE # Quantize without clipping\n",
    "\"\"\"\n",
    "overrides = distiller.utils.yaml_ordered_load(overrides_yaml)\n",
    "# Basic quantizer defintion\n",
    "quantizer = PostTrainLinearQuantizer(deepcopy(model), \n",
    "                                    mode=\"ASYMMETRIC_SIGNED\",  \n",
    "                                    model_activation_stats=stats_file,\n",
    "                                    overrides=overrides,\n",
    "                                    per_channel_wts=True,\n",
    "                                    clip_acts=\"AVG\")\n",
    "quantizer.prepare_model()\n",
    "evaluate(quantizer.model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The accuracy is somewhat \"restored\", by still we would like to get a score as close to the original model as possible. How about leaving the `classifier` asymmetric, without clipping it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 24/24 [16:43<00:00, 39.45s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating BLEU score...\n",
      "BLEU on test dataset: 21.48\n"
     ]
    }
   ],
   "source": [
    "# No clipping in the attention layer and in the final classifier\n",
    "overrides_yaml = \"\"\"\n",
    ".*att_rnn.attn.*:\n",
    "    clip_acts: NONE # Quantize without clipping\n",
    "decoder.classifier.classifier:\n",
    "    clip_acts: NONE # Quantize without clipping\n",
    "\"\"\"\n",
    "overrides = distiller.utils.yaml_ordered_load(overrides_yaml)\n",
    "# Basic quantizer defintion\n",
    "quantizer = PostTrainLinearQuantizer(deepcopy(model), \n",
    "                                    mode=\"ASYMMETRIC_SIGNED\",  \n",
    "                                    model_activation_stats=stats_file,\n",
    "                                    overrides=overrides,\n",
    "                                    per_channel_wts=True,\n",
    "                                    clip_acts=\"AVG\")\n",
    "quantizer.prepare_model()\n",
    "evaluate(quantizer.model, output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, some good results! So now we know better which layers are sensitive to clipping and which are complimented by it.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "<a id=\"cite-wu2016google\"/><sup><a href=#ref-1>[^]</a></sup>Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Wolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others. 2016. _Google's neural machine translation system: Bridging the gap between human and machine translation_.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--bibtex\n",
    "\n",
    "@article{wu2016google,\n",
    "  title={Google's neural machine translation system: Bridging the gap between human and machine translation},\n",
    "  author={Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Wolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others},\n",
    "  journal={arXiv preprint arXiv:1609.08144},\n",
    "  year={2016}\n",
    "}\n",
    "\n",
    "-->"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
