Motivation

• The size of the K-hop neighbors grows exponentially to the number of GNN layers (High Memory Cost).

• GNN has to read great amount of data of neighboring nodes to compute the single target node representation, leading to high communication cost in a distributed environment (High Communication Cost).

GNN的每一次特征传播都需要拉取邻居特征，对于k层的GNN来说，每个节点需要拉取的k跳以内邻居节点特征随着层数增加以指数增加，会占用大量内存。对于稠密的连通图，每个节点在每次训练的时候几乎需要拉取全图的节点信息，造成海量的通信开销

• A commonly used approach to tackle the issues is sampling.

• The sampling quality highly influences the model performance, and it still needs communication in each step (reduce neighbors ).

Sampling的方法虽然可以减少邻居数量，减轻内存压力，但是仍要消息传递。

• Some method (e.g., SGC) decouples the feature propagation and the non-linear transformation process.

• The feature propagation is executed during pre-processing.
• Only the nodes of training set get involved in the model training.
• The decoupling methods are more suitable for distributed training.

现有方法分离消息传递和非线性变化，提取收集好特征，后期直接用全连接变化即可。

Challenges

• The decoupling methods adopting fixed layers of feature propagation, leading to a fixed RF of nodes.
• The local information and long-range dependencies cannot be fully leveraged at the same time.
• This makes them lack the flexibility to model the interesting correlations on node features under different reception fields.

Contribution

• Propose a decoupling methods method that can be applied in huge graph and distributed environment.
• Propose three node-adaptive graph learning mechanisms to meet the requires of different nodes with different propagation steps.
• Achieve the STOA performance in the three largest graph benchmarks while maintains high scalability and efficiency.

Preliminaries

Graph-wise Propagation

• Recently studies have observed that non-linear feature transformation contributes little to the performance of the GNNs as compared to feature propagation.
• They reduces GNNs into a linear model operating on K-layers propagated features.

Graph-wise propagation和之前我们介绍的node-wise propagation 的先传播再非线性变化不一样，是直接得到一个K跳内的邻接的邻接矩阵和$\hat{A}^K$，然后一次性计算传播结果，提前收集好K-hop特征$X^K$，然后直接用全连接之类的变化即可。

Layer-wise Propagation

Following SGC, some recent methods adopt layer-wise propagation combining the features with different propagation layers.

The GAMLP Model

GAMLP decomposes the end-to-end GNN training into three parts: feature propagation, feature combination with RF attention, and the MLP training.

As the feature propagation is pre-processed only once, and MLP training is efficient and salable, we can easily scale GAMLP to large graphs. Besides, with the RF attention, each node in GAMLP can adaptively get the suitable

Feature Propagation

Parameter-free K-step feature propagation as:

After K-step feature propagation, we correspondingly get a list of propagated features under different propagation steps:

Average these propagated features in a weighted manner

Receptive Field Attention

Get the attention weight $W_l$ for different layers.

Smoothing Attention

Suppose we execute feature propagation for infinite times. In that case, the node embedding within the same connected component will reach a stationary state (Over smoothing).

Model Training

Reliable Label Utilization (RLU)

To better utilize the predicted soft label (i.e., softmax outputs), we split the whole training process into multiple stages, each containing a full training procedure of the GAMLP model.

Here, we denote the prediction results of m-th stage as:

we adopt the predicted results for the nodes in the validation set and the test set at the last stage to enhance the label propagation.

Reliable Label Distillation

To fully take advantage of the helpful information of the last stage, we also included a knowledge distillation module in our model. we only include the nodes in the reliable node set $𝑉_r$ at (m-1)-th
stage (m > 1) and then define the weighted KL divergence as:

Implementation

Results

The weights of propagated features with larger steps drop faster as the degree grows, which indicates that our attention mechanism could prevent high-degree nodes from including excessive irrelevant nodes which lead to over-smoothing.

Conclusion

• This paper proposed many tricks (i.e., attention, GCNII, label propagation, knowledge distillation) that can be used in our method to boost the performance in the final stage.
• The decoupling and weighted average methods can be extended to solve the huge HIN case (e.g., knowledge graph).