Notes of a talk I recently gave on feature superposition (a microscopic phenomenon) and neural scaling laws (a macroscopic phenomenon).

Guiding Question For My Research

Can feature superposition (a microscopic phenomenon) explain neural scaling laws (a macroscopic phenomenon)?

Specific Questions For This Talk

Consider Transformer models of fixed depth but different widths.

  1. Do different models store the same features in different fashion?
  2. OR: Do larger models store more features, giving them higher capability?
  3. In either case, can we see its effect on scaling laws?

Outline

  1. Background
    • Neural Scaling Laws
    • Feature Superposition
  2. This work
    • Feature Importance and Universality
    • Relationship with Scaling Laws

Background

Neural Scaling Laws

Neural scaling laws describe how model performance improves with scale (Kaplan et al., 2020; Hoffman et al., 2023, and others).

Definition: Neural Network Features

Feature Definition Diagrams

Feature Activation Diagram

Example: The Golden-Gate-Bridge feature of Claude 3 Sonnet (Templeton et al., 2024). Activating this feature to 10x its value changes the model behavior.

Golden Gate Bridge Feature Example

Golden Gate Bridge Feature Activation

Definition: Superposition

Superposition Diagrams

Superposition Visualization

This talk: Make this more precise.

How are features learned?

SAE Diagram

Reconstruct activations using autoencoders and let the decomposition be sparse — Sparse Autoencoders (SAEs) (Bricken et al., 2023; Cunningham et al., 2023).

SAE Architecture

New Work

Feature Importance

Can we define a notion of importance of a feature for features of a real model?

Proposal (Definition): Feature Importance

Feature Importance Definition

Let the importance of feature i be its maximum activation value over a large dataset.

Experiments: Transformer Models

Experiment Setup

Model Architecture

Reconstructed Losses

Are important features more universal?

Measure of universality: activation similarity. Roughly, how much do the activations of two features correlate?

More mathematically:

Results:

Feature Importance vs Universality Plot

Relative Feature Importance (x-axis) vs Maximum Activation Similarity (y-axis)

Correlation Analysis

Table of Correlation Coefficients:

width \ width 128 256 512 768
128 - 0.7002 0.7096 0.6975
256 0.7216 - 0.7152 0.7136
512 0.7352 0.7026 - 0.6646
768 0.7640 0.7751 0.7735 -

New Insight #1

Important features tend to be more universal amongst models of fixed depth and various widths.

Dependence of Features Importance on Scaling Laws

Scaling Laws Plot

Feature Packing Comparison

But do they also store more features?

More Features in Larger Models

Yes! Larger models also pack more features!

Feature Storage Analysis

New Insight #2

Larger models store more features. But even for a smaller number of features, they extract more performance.

Answer to Our Questions

Consider Transformer models of fixed depth but different widths.

It’s a combination of both!

Neural Scaling Laws get contribution from both factors:

Quantifying the contribution from both is an interesting problem.

Some Limitations

References