Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Distance Filtering for Graph Representation Learning

Published 3 Dec 2023 in cs.LG and cs.NE | (2312.01538v3)

Abstract: Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing the information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but lack graph inductive bias and have to rely on ad-hoc positional encoding. In this paper, we propose a new architecture to reconcile these challenges. Our approach stems from the recent breakthroughs in long-range modeling provided by deep state-space models: for a given target node, our model aggregates other nodes by their shortest distances to the target and uses a linear RNN to encode the sequence of hop representations. The linear RNN is parameterized in a particular diagonal form for stable long-range signal propagation and is theoretically expressive enough to encode the neighborhood hierarchy. With no need for positional encoding, we empirically show that the performance of our model is comparable to or better than that of state-of-the-art graph transformers on various benchmarks, with a significantly reduced computational cost. Our code is open-source at https://github.com/skeletondyh/GRED.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Shortest path networks for graph property prediction. In Learning on Graphs Conference, 2022.
  2. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In ICML, 2019.
  3. On the bottleneck of graph neural networks and its practical implications. In ICLR, 2021.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Rajendra Bhatia. Matrix analysis. Springer Science & Business Media, 2013.
  6. Guy E Blelloch. Prefix sums and their applications, 1990.
  7. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
  8. Structure-aware transformer for graph representation learning. In ICML, 2022.
  9. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  10. Rethinking attention with performers. In ICLR, 2021.
  11. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  12. Principal neighbourhood aggregation for graph nets. In NeurIPS, 2020.
  13. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  15. Language modeling with gated convolutional networks. In ICML, 2017.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  18. Benchmarking graph neural networks. JMLR, 2022a.
  19. Graph neural networks with learnable structural and positional representations. In ICLR, 2022b.
  20. Long range graph benchmark. In NeurIPS, 2022c.
  21. How powerful are k-hop message passing graph neural networks. In NeurIPS, 2022.
  22. Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 1962.
  23. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2023.
  24. Neural message passing for quantum chemistry. In ICML, 2017.
  25. It’s raw! audio generation with state-space models. In ICML, 2022.
  26. Hippo: Recurrent memory with optimal polynomial projections. In NeurIPS, 2020.
  27. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022a.
  28. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022b.
  29. How to train your hippo: State space models with generalized orthogonal basis projections. In ICLR, 2023.
  30. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022.
  31. Liquid structural state-space models. In ICLR, 2023.
  32. Long short-term memory. Neural computation, 1997.
  33. Global self-attention as a replacement for graph convolution. In SIGKDD, 2022.
  34. Pure transformers are powerful graph learners. In NeurIPS, 2022.
  35. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  36. Reformer: The efficient transformer. In ICLR, 2020.
  37. Rethinking graph transformers with spectral attention. NeurIPS, 34:21618–21629, 2021.
  38. Distance encoding: Design provably more powerful neural networks for graph representation learning. In NeurIPS, 2020.
  39. Approximation and optimization theory for linear continuous-time recurrent neural networks. JMLR, 2022.
  40. Structured state space models for in-context reinforcement learning. In NeurIPS, 2023.
  41. Graph inductive biases in transformers without message passing. In ICML, 2023a.
  42. Mega: moving average equipped gated attention. In ICLR, 2023b.
  43. Path neural networks: Expressive and accurate graph neural networks. In ICML, 2023.
  44. S4nd: Modeling images and videos as multidimensional signals using state spaces. In NeurIPS, 2022.
  45. k-hop graph neural networks. Neural Networks, 2020.
  46. On the universality of linear recurrences followed by nonlinear projections. In ICML Workshop on High-dimensional Learning Dynamics, 2023a.
  47. Resurrecting recurrent neural networks for long sequences. In ICML, 2023b.
  48. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  49. Recipe for a general, powerful, scalable graph transformer. In NeurIPS, 2022.
  50. Simplified state space layers for sequence modeling. In ICLR, 2023.
  51. Social influence analysis in large-scale networks. In SIGKDD, 2009.
  52. Long range arena: A benchmark for efficient transformers. In ICLR, 2021.
  53. Understanding over-squashing and bottlenecks on graphs via curvature. In ICLR, 2022.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Attention is all you need. In NeurIPS, 2017.
  56. Graph attention networks. In ICLR, 2018.
  57. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  58. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  59. Stephen Warshall. A theorem on boolean matrices. Journal of the ACM (JACM), 1962.
  60. Lite transformer with long-short range attention. In ICLR, 2020.
  61. Representing long-range context for graph neural networks with global attention. In NeurIPS, 2021.
  62. How powerful are graph neural networks? In ICLR, 2019.
  63. Do transformers really perform badly for graph representation? In NeurIPS, 2021.
  64. Graph convolutional neural networks for web-scale recommender systems. In SIGKDD, 2018.
  65. Deep sets. In NeurIPS, 2017.
  66. Rethinking the expressive power of gnns via graph biconnectivity. In ICLR, 2023.
  67. Gated recurrent neural networks discover attention. arXiv preprint arXiv:2309.01775, 2023a.
  68. Online learning of long range dependencies. arXiv preprint arXiv:2305.15947, 2023b.
Citations (2)

Summary

  • The paper introduces GRED, which aggregates node features based on shortest distances using a linear recurrent network to overcome traditional MPNN limitations.
  • It employs a parallelizable recurrent unit that removes the need for positional encoding, resulting in reduced computational complexity and training time.
  • Empirical results show that GRED outperforms state-of-the-art graph transformers and MPNNs on benchmarks, with improved efficiency and GPU memory usage.

Introduction to Graph Learning

Graphs are a common method to model complex relationships and structures, such as those found in social networks or molecular biology. Traditional Graph Neural Networks (GNNs), known as Message Passing Neural Networks (MPNNs), face limitations in effectively incorporating the influence of remotely connected nodes, as they require multiple rounds of information exchange to reach distant parts of the graph.

Advancements in Graph Representation

Graph transformers, an evolution of MPNNs, attempt to solve this by implementing global attention mechanisms, where each node in a graph can directly attend to all other nodes. This allows for a significantly expanded scope of information sharing but at the cost of increased computational demands and the necessity for specialized positional encodings to capture the graph structure.

Introducing GRED

In response to these challenges, the proposed model, Graph Recurrent Encoding by Distance (GRED), capitalizes on recent successes in long-range modeling for sequential data. GRED aggregates information from nodes based on their shortest distances to a specific target node and leverages a parallelizable linear recurrent neural network to encode these aggregated features. This architecture eliminates the need for positional encoding and introduces notable reductions in computational complexity.

Practical Achievements

GRED demonstrates comparable or superior performance to state-of-the-art graph transformers on a variety of benchmarks, with significantly increased computational efficiency. It notably outperforms MPNNs due to its ability to encode more expressive features from larger node neighborhoods. Theoretical analysis confirms that GRED's capabilities transcend those of one-hop MPNNs due to its potent linear recurrent unit and injective multiset functions. In practice, GRED achieves results that not only hold up against the best graph transformers but do so with drastically reduced training times and GPU memory consumption, highlighting its potential as a powerful tool for graph representation learning.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 52 likes about this paper.