Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Published 27 Feb 2023 in cs.CL, cs.LG, and cs.NE | (2302.13939v5)

Abstract: As the size of LLMs continue to scale, so does the computational resources required to run it. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV) LLM, we successfully implement `SpikeGPT', a generative LLM with binary, event-driven spiking activation units. We train the proposed model on two model variants: 45M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity O(N2) to linear complexity O(N) with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 20x fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our code implementation is available at https://github.com/ridgerchu/SpikeGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  2. Payal Dhar. The carbon impact of artificial intelligence. Nature Machine Intelligence, 2:423–5, 2020.
  3. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051, 2020.
  4. OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/. Accessed: 2023-02-18.
  5. Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997.
  6. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.
  7. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
  8. Intelligence processing units accelerate neuromorphic learning. arXiv preprint arXiv:2211.10725, 2022.
  9. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019.
  10. Deep learning with spiking neurons: Opportunities and challenges. Frontiers in Neuroscience, 12:774, 2018.
  11. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems (NeurIPS), 34:21056–21069, 2021.
  12. Training spiking neural networks using lessons from deep learning. arXiv preprint arXiv:2109.12894, 2021.
  13. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
  14. A system hierarchy for brain-inspired computing. Nature, 586(7829):378–384, 2020.
  15. Spiking neural networks for frame-based and event-based single object localization. arXiv preprint arXiv:2206.06506, 2022.
  16. Spiking-yolo: spiking neural network for energy-efficient object detection. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 34, pages 11270–11277, 2020.
  17. Object detection with spiking neural networks on automotive event data. In International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2022.
  18. Attention is all you need. In Advances in neural information processing systems (NeurIPS), pages 5998–6008, 2017.
  19. Spiking convolutional neural networks for text classification. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  20. The fine line between dead neurons and sparsity in binarized spiking neural networks. arXiv preprint arXiv:2201.11915, 2022.
  21. Memristor-based binarized spiking neural networks: Challenges and applications. IEEE Nanotechnology Magazine, 16(2):14–23, 2022.
  22. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  23. Towards energy-preserving natural language understanding with spiking neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:439–447, 2022.
  24. Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. In IEEE International Conference on Rebooting Computing (ICRC), pages 1–8, 2016.
  25. Spikeformer: A novel architecture for training high-performance low-latency spiking neural network. arXiv preprint arXiv:2211.10686, 2022.
  26. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
  27. In-context learning and induction heads. CoRR, abs/2209.11895, 2022.
  28. Bo PENG. RWKV-LM. https://github.com/BlinkDL/RWKV-LM, 8 2021.
  29. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  30. Tcja-snn: Temporal-channel joint attention for spiking neural networks. arXiv preprint arXiv:2206.10177, 2022.
  31. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10221–10230, 2021.
  32. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–18, 01 2023.
  33. Language modeling with gated convolutional networks. In International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 933–941, 2017.
  34. Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020.
  35. Matt Mahoney. Large text compression benchmark, 2011.
  36. Pointer sentinel mixture models. In 5th International Conference on Learning Representations (ICLR), 2017.
  37. Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075, 2005.
  38. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013.
  39. Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058, 2004.
  40. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
  41. Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020. Accessed: 2022-05-21.
  42. Stephen Merity. Single headed attention RNN: stop thinking with your head. CoRR, abs/1911.11423, 2019.
  43. Reformer: The efficient transformer. In 8th International Conference on Learning Representations (ICLR), 2020.
  44. Synthesizer: rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743, 2020.
  45. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165, 2020.
  46. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  47. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  48. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  49. Yoon Kim. Convolutional neural networks for sentence classification. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
  50. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (NAACL), pages 4171–4186, 2019.
  51. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  52. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  53. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning (ICML), 2020.
  54. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  55. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
  56. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5484–5495, 2021.
  57. Outlier suppression: Pushing the limit of low-bit transformer language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
Citations (73)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper. Each item is phrased to be concrete and actionable for future work.

Methodology and architecture

  • The positional weight decay mechanism W is “not directly learnable” yet “varies over time with learnable dynamics”; the construction and training dynamics of W (Wd, Wc, Wf, pk) lack theoretical justification and sensitivity analysis, leaving unclear how its design impacts long-range dependency modeling and stability.
  • The token shift operator’s mask W_shift is described both as learnable and deterministically parameterized via (i/E)n/N; there is no ablation or clarity on whether W_shift is learned, fixed, or hybrid, nor its marginal contribution versus alternatives (e.g., learned positional embeddings, rotary embeddings, or induction heads).
  • The SRFFN block uses ReLU2 and a GEGLU-like gating but lacks reasons for the chosen nonlinearity and scaling (H=4E); there are no ablations comparing SRFFN to standard FFN, GLU variants, or different gating functions within the spiking context.
  • Spike thresholds, resets, and membrane decay (U_threshold=1, U_reset=0, β=0.5) are fixed across experiments; there is no study of learned neuronal parameters, adaptive thresholds, or per-layer neuron configurations and their effect on model capacity and gradient flow.
  • The recurrent RWKV formulation introduces divisions by sums of exponentials; numerical stability (e.g., underflow/overflow, denominator near-zero) is not analyzed, and there are no safeguards (log-sum-exp, normalization) reported.
  • The claim that RWKV behaves like “E heads with head size 1” is not formalized or validated; it is unclear how this relates quantitatively to multi-head attention’s expressivity and whether increasing E substitutes for multi-head diversity.
  • There is no use or analysis of normalization layers (LayerNorm/BatchNorm) in the spiking architecture, despite their known importance in stabilizing training of LLMs.

Training and optimization

  • The 216M pretraining pipeline deviates from the proposed binary embedding (removed and “first layer neurons” used for encoding), but the impact of this change is not quantified; an ablation of binary embedding versus neuron-based encoding is needed to understand their trade-offs.
  • Overfitting is observed when sequence length N increases (train BPC improves while test BPC stagnates), yet no targeted regularization strategies (dropout schedules, weight decay, data augmentation, stochastic depth) or curriculum learning are explored to mitigate it.
  • Surrogate gradient choice (arctangent) is fixed; there is no comparison to alternatives (triangular, piecewise-linear, sigmoid-based, derivative of fast-sigmoid), nor analysis of gradient bias and training stability with different surrogates.
  • Backpropagation through time (BPTT) specifics are missing (e.g., truncation or full sequence backprop, gradient checkpointing, sequence-length scheduling), which is critical for memory footprint and stability, especially at N=3072.
  • Learning-rate selection and optimization settings appear uniform (same LR for 45M and 216M) with limited tuning; scaling laws for SNN training hyperparameters and optimizer choices (AdamW, Adafactor, Lion) are not established.
  • The NLU objective (Eq. 21) appears to multiply labels by log probabilities (l_i*log P(C_i)) without defining the loss formulation clearly (one-hot vectors, cross-entropy), leaving ambiguity in reproducibility and correctness of the training objective.
  • Beam search, sampling strategies, and decoding parameters for generation are not specified, nor is the compatibility of spiking activations with common decoding heuristics (temperature, top-k/p) analyzed.

Evaluation and benchmarking

  • Energy efficiency is inferred from SynOps counts; there are no end-to-end measurements of power, latency, or throughput on actual neuromorphic hardware (e.g., Loihi, TrueNorth, or custom ASICs), nor comparisons on commodity GPUs/CPUs with realistic kernels to quantify wall-clock and energy gains.
  • The complexity accounting for SpikeGPT is likely incomplete: SRFFN’s linear maps with H=4E imply per-token O(E2) costs, yet the paper reports overall per-layer complexity as O(N·E); a full model-level complexity profile (including SRFFN and embedding) with constants and memory bandwidth should be provided.
  • Comparisons to baselines mix implementations (custom CUDA kernels for some, PyTorch for others), risking apples-to-oranges; standardized runtime and memory benchmarks on identical hardware/software stacks are needed for fair comparisons.
  • Perplexity results lag on large corpora (WikiText-103) versus GPT-2 models; the paper does not explore scaling behaviors (model size, data size, training duration) or identify bottlenecks for closing the gap, nor provide comprehensive scaling law analyses for SNN-based LLMs.
  • Long-context capabilities are claimed but not directly tested on benchmarks requiring extended context and long-range reasoning (e.g., LAMBADA, PG-19, BookCorpus), nor are memory lengths and context windows stress-tested.
  • Evaluations focus on perplexity and simple classification accuracy; no assessments of generation quality (human ratings), factuality, toxicity/safety, robustness, or in-/few-shot generalization are provided.
  • The outlier analysis is anecdotal (membrane potential outliers) without systematic quantification or correlation to model behavior; an investigation into how spiking dynamics handle outliers compared to ANN activations is missing.

Hardware and systems considerations

  • The “20× fewer SynOps” advantage lacks translation to real energy savings on available hardware; quantifying energy per operation for binary spikes versus float32 MACs across different platforms (GPU, CPU, neuromorphic) is needed.
  • Event-driven sparsity is claimed to reduce memory access costs, but the actual memory access patterns (e.g., scatter/gather, cache behavior, batching impacts) and their performance implications on GPUs/CPUs are not studied.
  • Streaming computation benefits (start processing before sentence completion) are not benchmarked for latency or throughput under realistic deployment pipelines and batching constraints.
  • There is no discussion of compatibility with mixed-precision training/inference, quantization-aware training, or how spiking representations integrate with existing hardware acceleration toolchains.

Theory and analysis

  • There is no formal analysis or proof connecting the recurrent RWKV formulation to the parallel convolutional form, beyond a heuristic derivation; stability, expressivity, and approximation properties remain uncharacterized.
  • Theoretical understanding of gradient flow through spiking RWKV under extreme sparsity is absent; conditions under which vanishing/exploding gradients occur, and mitigation strategies (e.g., gating calibration, residual scaling), need rigorous study.
  • It remains unclear how binarized spikes and recurrent gating affect language-model inductive biases compared to attention; a formal comparison of the representational capacity for sequence transduction tasks is missing.

Scope and applicability

  • The model is tested on modest-scale NLG and NLU tasks; more complex tasks (instruction following, tool use, chain-of-thought, multi-turn dialogue, retrieval-augmented generation) are not explored, limiting understanding of real-world utility.
  • Robustness to adversarial or distributional shifts is not evaluated, despite prior claims that SNNs can be more robust; targeted robustness benchmarks (e.g., adversarial text perturbations, OOD shifts) are needed.
  • Reproducibility is hindered by missing details: data preprocessing, exact training schedules, initialization schemes, tokenizer configurations (for char-level versus BPE), and code availability are not fully specified.

Open design questions

  • How should thresholds, resets, and decay parameters be learned or adapted per layer/feature to optimize capacity without harming sparsity?
  • What are effective normalization or calibration strategies for spiking LLMs (e.g., spike-LayerNorm, membrane potential normalization)?
  • Can hybrid architectures that combine limited attention with spiking RWKV close the performance gap on large corpora while retaining efficiency?
  • How should tokenization (character-level versus subword) interact with binary embeddings and spiking encoders for optimal trade-offs in expressivity and sparsity?
  • What are the best surrogate gradients and training curricula for stable large-scale spiking LLM training at 109+ parameters?

These gaps collectively outline a roadmap for advancing spiking-based LLMs from proof-of-concept toward robust, scalable, and efficient systems that can compete with state-of-the-art attention-based LLMs.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 208 likes about this paper.