I am interested in the science of deep learning. Recently, I’ve been very excited about topics like reasoning, multi-modal foundation models, and safe and scalable deep learning. During my time at Microsoft Research, I worked on developing a knowledge base generative model towards a knowledge-augmented LLM approach to improve interpretability and limit hallucination. At FAIR, I worked on new pre-training objectives to make LLMs more data-efficient (learn more with less) and improve their knowledge storage and planning capabilities.
Interdisciplinary Ph.D. in Physics and Statistics, 2019 - Present
Massachusetts Institute of Technology
BSc in Physics and Mathematics, 2019
University of Rochester
Training transformers to predict “any-to-any” as opposed to just next token solves the reversal curse and can improve planning capabilites.
DiSK is a generative framework for structured (dictionary-like) data that can handle various data types, from numbers to complex hierarchical types. This model excels in tasks like populating missing data and is especially proficient at predicting numerical values. Its potential extends to augmenting language models for better information retrieval and knowledge manipulation.
We develop a novel neural architecture with an exact bound on its Lipschitz constant. The model can be made monotonic in any subset of its features. This inductive bias is especially important for fairness and interpretability considerations.
This study investigates grokking, a generalization phenomenon first observed in transformer models trained on arithmetic data, using microscopic and macroscopic analyses, revealing four learning phases and a “Goldilocks zone” for optimal representation learning, while emphasizing the value of physics-inspired tools in understanding deep learning.
A small package to make neural networks monotonic in any subset of their inputs (this works for individual neurons, too!).
A regularization to make neural networks’ output independent from certain features.
An experiment to demonstrate the non-locality of quantum mechanics through the violation of Bell’s Inequality.