Reverse-engineering a Transformer Trained on Nuclear Physics
A Scientific Case for Mechanistic Interpretability
NuCLR can accurately predict many nuclear properties and generalizes well across tasks, indicating it has learned fundamental physics. Initial investigation reveals NuCLR has acquired some basic nuclear theory, including the Pauli Exclusion principle and magic numbers. Furthermore, NuCLR exhibits intriguing spiral embeddings for protons and neutrons. However, can we fully describe the mechanisms behind NuCLR’s performance? This project aims to fully reverse engineer the NuCLR algorithm to potentially derive new physics insights.
The work is still in progress, but code is available here.
Visualization of the “nuclear” embeddings
One-hot encoded representations of proton number, neutron number, and task are concatenated and input to the first layer. The visualization shows the first three principal components of all observed proton-neutron combinations, with the first two components as x-y axes and the third as color.
Here’s a 3D version with the 4th principal component as color.
In the visualizations below, you can choose which principal components to display as well as filter by neutron or proton numbers.