Graph attention network

Paper: https://arxiv.org/pdf/1710.10903.pdf

Graph attention network (GAT) is a type of graph neural network with a self attention layer during the graph updates. The attention layer learns relative importance between the incoming source nodes for the target node. As a result, when updating the target node’s value during training, the source nodes with higher importance have a bigger impact on the target node’s updated value.

source node is the node where the edge is pointing from and target node is the edge is pointing to

So how does it work? Let’s go through the equations in the paper with code.

Imagine, we have a 4 nodes graph and each node has an array of 3 elements representing their features

To calculate the new value for node 1, we need to calculate the attention coefficients first.

The equation for calculating attention coefficient between the source node and target node is the following:

$$ e_{ij} = LeakyReLU(\vec{a}^T[\boldsymbol{W}\vec{h}_{i}||\boldsymbol{W}\vec{h}_{j}]) $$

$\vec{h}_i$ is the target node features of the edge with $F$ number of features.
$\vec{h}_j$ is the source node features of the edge with $F$ number of features.
$\boldsymbol{W}$ is the weight matrix applied to every node to transform the input features into higher level features. The shape is $F' \times F$ where $F'$ is the new feature shape.
$\vec{a}^T$ is the weight vector representing the single feed forward layer with LeakyReLU nonlinearity. This is the attention function layer for learning the importance. This vector shape is $2F'$.
Apply the weight matrix to the nodes features and concatenate the result.
Performs matrix multiplication with the concatenated result and the attention weight vector ( $.^T$ for transposition on the feed forward weight vector).
Apply $LeakyReLU$ for nonlinearity.
$e_{ij}$ represents the importance of the source node features to the target node.

The paper injects graph structure by performing masked attention. We only calculate $e_{ij}$ for first order neighbours of the target node (including the target node itself). This means all the nodes that are 1 or less edge away from the target node. In our example, our target node is node 1 therefore our source nodes include node 1, 2 and 3.

Node 1 is the target node so we include itself in the first order neighbours.
Node 2 and 3 are one edge away, both have a directed edge pointing toward node 1.
Node 4 is not part of the node 1 first order neighbours.
- It takes 2 hops to get to node 1, either through node 2 and node 3.
- The edge connected between node 4 and node 1 is directed toward node 4 from node 1.

To make the attention coefficients easily comparable across the different source nodes, softmax function is applied to normalise the attention coefficients.

$$ \alpha_{ij} = softmax_{j}(e_{ij}) = \frac{exp(e_{ij})}{\sum_{k \in N_{i}}exp(e_{ik})} $$

Translating these equations into python code using node 1 as the target node.

import numpy as np

# The node features are column vectors, F is 3.
# Note: all the values are arbitrarily picked for demonstration purposes.
h1 = np.array([[1], [2], [3]])
h2 = np.array([[2], [2], [4]])
h3 = np.array([[-1], [2], [1]])


# The weight matrix, I kept the new shape F' to 3 for simplicity.
# The shape of the matrix is 3 x 3.
W = np.array([[0.1, 0.1, 0.1], [0.3, 0.4, 0.6], [-1, 1, -1]])


# A simple single layer of feed forward network for the attention function.
# The column vector is transposed into a row vector.
# shape = 2F' = 2 x 3 = 6
ff_w = np.array([[0.1, 0.2, 0.7, 0.3, 0.1, 0.6]])


def leaky_relu(x, alpha=0.01):
    return np.maximum(alpha*x, x)

# Unnormalised attention coefficients. 
# @ is the shorthand for matrix multiplication in numpy.
e11 = leaky_relu(ff_w @ (np.concatenate([W @ h1, W @ h1])), alpha=0.2)
e12 = leaky_relu(ff_w @ (np.concatenate([W @ h1, W @ h2])), alpha=0.2)
e13 = leaky_relu(ff_w @ (np.concatenate([W @ h1, W @ h3])), alpha=0.2)


# Applying softmax to normalise the value.
sum = np.exp(e11) + np.exp(e12) + np.exp(e13)
a11 = np.exp(e11) / sum # [[0.233]], rounded to 3dp for display purposes
a12 = np.exp(e12) / sum # [[0.189]]
a13 = np.exp(e13) / sum # [[0.578]]

Once we have the attention coefficients, we can use them to calculate new node 1 features:

$$ \vec{h'}_{i} = σ (\sum_{j \in N_{i}} \alpha_{ij} W\vec{h}_{j})$$

# Using leaky_relu as the nonlinearity.
# Implementation detail: called .item() to get the scalar value from the nested array.
new_h1 = leaky_relu(
    (a11.item() * (W @ h1)) + 
    (a12.item() * (W @ h2)) + 
    (a13.item() * (W @ h3))
)

# new_h1 value is [[0.407], [2.03], [-0.066]]

Multi-head attention

The paper also uses multi-head attention to stabilise the learning process of self attention. This means there are multiple independent attention functions, a separate $\alpha_{ij}$ and $W$ for each $k$ head. We concatenate the output from each head, so the overall output shape is $KF'$ shapes

$$ \vec{h'}_{i} = \Big\Vert_{k=1}^K σ (\sum_{j \in N_{i}} \alpha^k_{ij} W^k\vec{h}_{j})$$

new_h1_k1 = leaky_relu(
    (a11_k1.item() * (W_k1 @ h1)) + 
    (a12_k1.item() * (W_k1 @ h2)) + 
    (a13_k1.item() * (W_k1 @ h3))
)

new_h1_k2 = leaky_relu(
    (a11_k2.item() * (W_k2 @ h1)) + 
    (a12_k2.item() * (W_k2 @ h2)) + 
    (a13_k2.item() * (W_k2 @ h3))
)
new_h1 = np.concatenate([new_h1_k1, new_h1_k2])

If the multi-head attention is on the prediction layer, then we need to average the output and apply the final nonlinearity afterward (either a softmax or logistic sigmoid for classification). This is to keep the output shape as $F'$ to match the number of classes we are classifying or the predicted values.

$$ \vec{h'}_{i} = σ (\frac{1}{K} \sum^K_{k = 1} \sum_{j \in N_{i}} \alpha^k_{ij} W^k\vec{h}_{j}) $$

new_h1 = leaky_relu((new_h1_k1 + new_h1_k2) / 2)

GAT v2

Paper: https://arxiv.org/pdf/2105.14491.pdf

Static attention

In this GATv2 paper, it identifies the original GAT can only compute “static attention”, that is when the attention function can only select the same key regardless of the input query. In the context of GAT, it means the attention function is always favouring a particular source node (key) regardless of the target node (query input).

The paper rearranges the original attention coefficient equation where $\alpha$ is broken into two part $\alpha = [\alpha_{1} || \alpha_{2} ]$. This splits the attention function into two, applying the source node and target node separately

$$ e(h_{i}, h_{j}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{i} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{j}) $$

The attention coefficient is only influence by the source node part $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{j}$ as the target node part $\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{i}$ remains the same when calculating the coefficient for first order neighbours.

As a result, the attention function is learning a ranking between all the nodes in the graph since the output value is only dependent on the given node $\vec{h}_{j}$. Source nodes that are higher in the ranking will always be more important regardless of the target node.

Let’s look at a more concrete example, imagine after training, the attention function learnt the following ranking for all nodes.

$$ \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3} > \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{4} > \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{1} > \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{2} $$

When calculating the attention coefficients for node 1, node 3’s coefficient is the largest because the target node part $\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{1}$ is the same and $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3}$ is greater than $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{1}$ and $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{2}$.

$$ e(h_{1}, h_{1}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{1} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{1}) $$$$ e(h_{1}, h_{2}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{1} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{2}) $$$$ e(h_{1}, h_{3}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{1} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3}) $$

Then looking at calculating the attention coefficients for node 2, even though the target node has changed, it doesn’t make a difference because the target node part will be the same value for each source and target node pair. Consequently, node 3’s coefficient is the largest again because $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3}$ is greater than $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3}$ and $\vec{a}_{2}^T\boldsymbol{W}\vec{h}_{4}$.

$$ e(h_{2}, h_{2}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{2} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{2}) $$$$ e(h_{2}, h_{3}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{2} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{3}) $$$$ e(h_{2}, h_{4}) = LeakyReLU(\vec{a}_{1}^T\boldsymbol{W}\vec{h}_{2} + \vec{a}_{2}^T\boldsymbol{W}\vec{h}_{4}) $$

Dynamic attention

Instead of static attention, the paper proposes “dynamic attention” where the attention function selects different keys based on different query input. This means the attention function is able to favour different source node given the target node.

It’s important to have dynamic attention because the model should be able to calculate the relative importance based on the local context (the target node), not only restricted to a ranking across all nodes like in static attention. For example, the model should be able to express that node 2 has highest importance to node 1 while node 3 has highest importance to node 2.

To achieve dynamic attention, the paper applies the attention layer after the nonlinearity (LeakyReLU). Now the attention function is calculating the coefficient based on the target and source node pair

$$ e_{ij} = \vec{a}^T LeakyReLU([\boldsymbol{W}\vec{h}_{i} || \boldsymbol{W}\vec{h}_{j}]) $$

e11 = ff_w @ leaky_relu(np.concatenate([W @ h1, W @ h1]), alpha=0.2)

No changes to other parts of the layer such as the softmax and the multi-head attention. However, the paper noted that a single GATv2 head generalises better than a multi-head GAT layer.

Unfortunately, I don’t fully understand the proof for the GATv2 layer to allow dynamic attention. Something about the layer being a universal approximator which is able to satisfy the property of dynamic attention. Please read the proof in the paper’s appendix for more information.

Example with TF-GNN

Using the intro_mutag_example, we can replace the example SimpleConv layer with GATv2 and try it out immediately!

# Add this import.
from tensorflow_gnn.models.gat_v2.layers import GATv2Conv


# Replace the layer in _build_model functions.
for i in range(num_message_passing):
    graph = tfgnn.keras.layers.GraphUpdate(
        node_sets={
            "atoms": tfgnn.keras.layers.NodeSetUpdate(
                {
                    # Use GATv2Conv layer instead of SimpleConv.
                    "bonds": GATv2Conv(
                        num_heads=1,
                        per_head_channels=1,
                        receiver_tag=tfgnn.TARGET,
                    )
                },
                tfgnn.keras.layers.NextStateFromConcat(dense(next_state_dim)),
            )
        }
    )(graph)

Multi-head attention#

GAT v2#

Static attention#

Dynamic attention#

Example with TF-GNN#

Multi-head attention

GAT v2

Static attention

Dynamic attention

Example with TF-GNN