Graph-Bert: Only Attention is Needed for Learning Graph Representations
Graph-Bert: Only Attention is Needed for Learning Graph Representations
Method
Linkless Subgraph Batching
Graph-Bert会在一个linkless subgraph上(从complete graph sample)先进行训练。为了控制sampling过程,这里引入了top-k intimacy sampling,使用intimacy matrix。
这里使用类似pagerank的算法来定义intimacy matrix $$S$$。
$$ S = \alpha \cdot ( I - (1-\alpha) \cdot \bar A )^{-1}
$$
$$\bar A = A D^{-1}$$ 是normalized adjacency matrix。
Node Context: $$\tau (v_i) = { v_j | v_j \in \mathcal{V} \setminus {v_i} \wedge S(i,j) \ge \theta_i }$$。就是某个node的context是其他所有和他相似度大于$$\theta$$的node。
然后根据node context的概念,我们可以将sampled graph batches表示为$$G = g1, g_2, ..., g{|\mathcal{V}|}$$, $$g_i$$是node i对应的node context subgraph。
Node Input Vector Embeddings
input vector包含四个部分
Raw Feature Vector Embedding $$e_j^x = Embed(x_j$$
WL Absolute Role Embedding Weisfeiler-Lehman (WL) algorithm根据structural roles将node进行label。 $$e_j^r = Position-Embed(WL(v_j))$$
Intimacy-Based Relative Positional Embedding $$e_j^p = Position-Embed(P(v_j))$$
Hop-Based Relative Distance Embedding $$e_j^d = Position-Embed(H(v_j, v_i))$$
Graph Transformer-based Encoder
initial node embedding 是 $$h_j^0 = Aggregate(e_j^x, e_j^r, e_j^p, e_j^d)$$。
下面写graph-transformer
$$ H = softmax (\frac{QK^T}{\square{d_h}}) + G-Res(H^{(l-1)}, X_i)\ Q = H^{(l-1)} W_Q^{(l)}\ K = H^{(l-1)} W_K^{(l)}\ V = H^{(l-1)} W_V^{(l)}
$$
$$G-Res$$是graph residual term,而$$X_i$$是raw features。
Graph-Bert Learning
处理两类问题:node attribute reconstruction和graph structure recovery。同时,根据应用场景,分为node classification和graph clustering。
Pre-training
Node Raw Attribute Reconstruction
$$\ell_1 = \frac{1}{|\mathcal{V}|} \sum | x_i - \hat x_i |_2$$
Graph Structure Recovery
$$\ell_2 = \frac{1}{|\mathcal{V}|^2} | S - \hat S |_F$$
Model Transfer and Fine-tuning
Node Classification
$$h_i = softmax(FC(z_i))$$
$$\sum_{v_i} \sum_m -y_i(m) \log \hat y_i (m)$$
Graph Clustering
$$\min{\mu_1, ..., \mu_l} \min_c \sum{j=1}^l \sum_{v_i} | z_i - \mu_j |_2$$