The code accompanies the paper **How hard is to distinguish graphs with graph neural networks**, submitted to NeurIPS 2020.


## overview

The file `appendix.pdf` contains technical arguments and additional experimental results.  

Further, the folder contains three sub-directories:

* `code` - python code for building datasets, replicating experiments, and visualizing results 
* `datasets` - saved datasets
* `results` - trained models and performance statistics for all experiments 


## code 

The following jupyter notebooks are provided: 

* `gnn_gluedconnected_visualize.ipynb` and `gnn_gluedtree_visualize.ipynb` allow easy visualization of all experimental results. 

* `gnn_gluedconnected.ipynb` and `gnn_gluedtree.ipynb` perform training for the two datasets, respectively. The notebooks start by demoing how a GNN is trained (this is intended for quick experimentation). Then, there is code for exhaustively training GNN performance for different n, depth, and width. This part takes *a long* time (approx. 2 weeks) so the results have been precomputed (experiment name: main). Finally, there is code for training large capacity GNNs (this corresponds to Table 2 in supplementary material). Similar to the second experiment, the results have been already precomputed and stored for easy examination. Re-running the last experiment should take a few days of gpu time. 

* `glueconnected_generator.ipynb` and `gluedtree_generator.ipynb` build the datasets used in the paper. There is also code for computing basic dataset statistics.


To be able to run the above, make sure to install the following python packages: 

```
pip install numpy, networkx, scipy, pickle, matplotlib, scikit-learn
```

as well as [pytorch](https://pytorch.org/) and [pytorch-geometric](https://pytorch-geometric.readthedocs.io/en/latest/).  

It will also be needed to point `datadir` (at the start of most notebooks) to the location of the supplementary root folder: 

```
    datadir = '/..../supplementary'
```

Finally, for rerunning the largest experiments, some additional datasets should be downloaded (see dataset section).


## datasets

There are two types of datasets (as in the paper's theory): 
* gluedconnected is built based on a universe of all connected graphs. 
* gluedtree is built based on a universe of all trees. 

To understand how these are generated I would urge you to consider the explanation and examples within the paper. 

Since the largest datasets could not fit within the alloted 100MB limit, they can be downloaded separately from [here](https://www.dropbox.com/s/jbxb4j35qodkeuh/larger_datasets_neurips2020_how_hard_is_to%20distinguish_graphs_with_gnn.zip?dl=0). They should be unzipped and placed together with the other datasets. 

A remark. There are two types of files stored---those that end on pytorch.pickle and those that do not. The main difference between the two is how the graphs are stored (pytorch geometric dataset vs list of networkx graph objects). Most datasets have been stored in pytorch geometric format since it's much more memory efficient.


## results

There is one pickle file per trained GNN. The file contains the trained model, the training loss as well as training/validation/test accuracies. The easiest way to see the results is by running the `gnn_gluedconnected_visualize.ipynb` and `gnn_gluedtree_visualize.ipynb` notebooks.


Thank you for your time!

The anonymous author

8 June 2020
