GHashing Semantic Graph Hashing for Approximate Similarity Search in Graph Databases

Objection：

Retrieve graphs from the database similar enough to a query.
- Optimize the pruning stage.

Background:

Graph Edit Distance (GED) is a method to measure the distance between two graphs.
The current similarity search, first compute the lower bound between the query and the every candidate graph and they prune the graph larger tan than the threshold. Then, they compute the exact GED in the remaining graphs.

Challenges:

The computation of Graph Edit Distance (GED) is NP-hard.
Existing methods cannot handle the large databases due to their exact pruning strategy.
- Subgraph isomorphism check to generate the graph index
- GED lower bounds are too loose, which leads to many candidates.

Deep hash retrieval:

Target K: categ 00 Ι Ι0Ι0Ι0ΙOΙ ΙΙ0ΙΙ0Ι0Ι0 οιοιιοιοιοοι 001101101010

Graph Deep Hash:

Train a GNN as hash function to generate the index of the graph, where the similarity computation can be done on the hash index instead of the whole graph.
Prune the graphs using the hash code and embeddings.
Compute the exact GED using GED solver.

Graph Conv Graph Conv Graph Graph Conv P ooling Offline Stage Online Stage Graph Feature Extraction Embe dding Loss.* Embedding Embedding Binarizatlon Graph Graph Graph Graph A ttenti on Ha sh Index F(go F(gJ Pluning Phase Verification

Challenges for Graph Deep Hash:

The threshold for graph similarity search may vary from query to query, it is appropriate to treat it as a regression problem, which requires estimating GED by hamming distance.
How to establish a relationship between hamming distance and GED is a problem.
Lack of positive samples.

Contribution

We provide a first attempt to use a neural-network-based approach to address the similarity search problem for graph databases via graph hashing.
We propose a novel method GHashing, a GNN-based graph hashing approach, which can automatically learn a hash function that maps graphs into binary vectors, to enhance the pruning-verification framework by providing a fast and accurate pruner.
We conduct extensive experiments to show that our method not only has an average F1-score of 0.80 on databases with 5 million graphs but also on average 20× faster than the only state-of-the-art baseline on million-scale datasets.

Preliminaries

•Graph Edit Distance (GED): $ged(g_1,g_2)$ is defined as the minimum number of edit operations to make $g_1$ isomorphic to $g_2$. The operation can be one of the following:

Insert a node
Delete a node
Change the label of the node
Insert an edge
Delete an edge

Graph Similarity Search under GED: Given a graph database $D$ with $N$ graphs, and a query graph $q$ and a threshold $τ$, the graph similarity search under GED measure aims to find ${g│g∈D∧ged(g,q)≤τ}$

Approach:

Hash function :

Graph Conv Offline Stage Graph Conv Graph Attention Conv Pooling Graph Feature Extraction Graph Continuous Embedding Continuous Binarization Graph Conv Graph Conv Graph Attenti on Conv Pooling

Graph neural network:

wEN (v) h (1+1) = (h(vl), m (7) (8)

Graph pooling:

IVI (10) IVI n=l where 0(•) is any activation function, V•V/a is trainable parameters and ReLU(X) = max(X, 0).

Loss function:

$L=λL_{code} (g_1,g_2 )+(1-λ) L_{emb} (g_1,g_2 )$ $L_{code} (g_1,g_2 )=L(\hat y,y)$ $\hat {𝑦} =|𝐻(𝑔_1 )−𝐻(𝑔_2 )|, 𝑦=𝑔𝑒𝑑(𝑔_1,𝑔_2 ), 𝐿 can be any loss function for regression.$

Design of $L_{code}$

GED can be arbitrarily large, the hamming distance between two B-bit hash codes can not be larger than B.
Therefore, L should satisfy following properties:
- Small punishment when both values are large $L(\hat y,y)=L′(min⁡{γ,\hat y},min{γ,y})$
- Asymmetric: $L(\hat y,y)≠L(y,\hat y)$, estimated hamming distance should smaller than GED.
- Minimized when $\hat y= y$
- Convex

Handling discrete constraints

Hamming distance to Euclidean space

Binary regularization

Design of $L_{emb}$

Online Stage

Encode the query graph to $H(q)$
Search the keys whose hamming distance to $H(q)$ is less than $τ+t$, $t$ is set to 1.
Second-stage pruning by computing$||F(g)-F(q)||_2^2>τ+0.5$
Verify every graph remained with exact verification algorithm

Online Stage 010 Hash Key 010 011 101 110 111 Hash Index Pruning Phase Fir-st-Level

Experiment:

Datasets

Dataset AIDS LINUX ALCHEMY IDI 42, 687 38, 661 119, 487 5, 000, 000 5, 000, 000 ave IVI 25 16 21 10 10 ave IEI 55 34 44 66 40 ILOI 62 Lel

Table 2: Recall/lcl ratio (x 10¯5) for GHashing and Naive 2 3 4 6 690 270 150 98 69 52 AIDS GH 30 19 15 11 7.3 6.2 ALCHEMY Naive 7.0 5.5 4.8 4.3 3.8 3.6 23 20 16 14 13 11 GH 15 12 10 8.8 7.9 6.7 N aive 16 12 10 8.6 7.7 6.6 11 4.9 2.8 1.9 1.2 GH- 3.5 2.4 1.7 1.4 1.0 1.0 Naive 4.2 2.7 1.9 1.4 1.2 1.0

Table 3: Recall/lcl ratio for different code lengths on AIDS dataset 24 bit 28 18 13 9.6 7.5 6.1 32 bit 19 15 11 7.1 48 bit 70 43 30 21 15 12

(i)lCl vs.r(ALCHEMY) 冖 3 ) ICI vs.T(AIDS) (j)Pvs.T(ALCHEMY) 冖 f)Pvs.TdlNUX) (b) p vs. T(AIDS) (k) R vs.T(ALCHEMY) (1) Fl vs.r(ALCHEMY) (g) R vs. T(LINUX) (c)Rvs 44m 匕 (h)F1vs r(LINUX) (d)F1vs.r(AIDS)

为什么$\tau$越大recall反而下降了呢？

在问了老师之后，可能的解释是随着$\tau$上升，正例的样本也变多了，所以recall下降了。

Real GED Figure 3: Violin plots of predicted GED given real GED on AIDS dataset. A violin plot is an extension of the box plot, where each "violin" is a plot of the density function of a his- togram produced from the data.

The results in Figure 3 show that the hamming distance is usually smaller than the real GED

Table 4: Costs of offline stage for GHashing and ML-Index Dataset AIDS ALCHEMY LNUX NIL-I n dex Time Space Time Space 759s 56M 696s 678M 443s 136M 248s 960M 1143s 52M N/A N/A 3172s 5.1G N/A N/A 3544s 5.1G N/A N/A

在线查询快，recall高，offline训练时间少，存储空间少。好就完事了！yizhou Sun 牛！