LLMs are deployed through conversational interfaces that present helpful, harmless, and honest assistant personas. However, they fail to maintain consistent personality traits throughout the training and deployment phases. LLMs show dramatic and unpredictable persona shifts when exposed to different prompting strategies or contextual inputs. The training process can also cause unintended personality shifts, as seen when modifications to RLHF unintentionally create overly sycophantic behaviors in GPT-4o, leading to validation of harmful content and reinforcement of negative emotions. This highlights weaknesses in current LLM deployment practices and emphasizes the urgent need for reliable tools to detect and prevent harmful persona shifts.
Related works like linear probing techniques extract interpretable directions for behaviors like entity recognition, sycophancy, and refusal patterns by creating contrastive sample pairs and computing activation differences. However, these methods struggle with unexpected generalization during finetuning, where training on narrow domain examples can cause broader misalignment through emergent shifts along meaningful linear directions. Current prediction and control methods, including gradient-based analysis for identifying harmful training samples, sparse autoencoder ablation techniques, and directional feature removal during training, show limited effectiveness in preventing unwanted behavioral changes.
A team of researchers from Anthropic, UT Austin, Constellation, Truthful AI, and UC Berkeley present an approach to address persona instability in LLMs through persona vectors in activation space. The method extracts directions corresponding to specific personality traits like evil behavior, sycophancy, and hallucination propensity using an automated pipeline that requires only natural-language descriptions of target traits. Moreover, it shows that intended and unintended personality shifts after finetuning strongly correlate with movements along persona vectors, offering opportunities for intervention via post-hoc correction or preventative steering methods. Moreover, researchers show that finetuning-induced persona shifts can be predicted before finetuning, identifying problematic training data at both the dataset and individual sample levels.
To monitor persona shifts during finetuning, two datasets are constructed. The first one is trait-eliciting datasets that contain explicit examples of malicious responses, sycophantic behaviors, and fabricated information. The second is “emergent misalignment-like” (“EM-like”) datasets, which contain narrow domain-specific issues such as incorrect medical advice, flawed political arguments, invalid math problems, and vulnerable code. Moreover, researchers extract average hidden states to detect behavioral shifts during finetuning mediated by persona vectors at the last prompt token across evaluation sets, computing the difference to provide activation shift vectors. These shift vectors are then mapped onto previously extracted persona directions to measure finetuning-induced changes along specific trait dimensions.
Dataset-level projection difference metrics show a strong correlation with trait expression after finetuning, allowing early detection of training datasets that may trigger unwanted persona characteristics. It proves more effective than raw projection methods in predicting trait shifts, as it considers the base model’s natural response patterns to specific prompts. Sample-level detection achieves high separability between problematic and control samples across trait-eliciting datasets (Evil II, Sycophantic II, Hallucination II) and “EM-like” datasets (Opinion Mistake II). The persona directions identify individual training samples that induce persona shifts with fine-grained precision, outperforming traditional data filtering methods and providing broad coverage across trait-eliciting content and domain-specific errors.
In conclusion, researchers introduced an automated pipeline that extracts persona vectors from natural-language trait descriptions, providing tools for monitoring and controlling personality shifts across deployment, training, and pre-training phases in LLMs. Future research directions include characterizing the complete persona space dimensionality, identifying natural persona bases, exploring correlations between persona vectors and trait co-expression patterns, and investigating limitations of linear methods for certain personality traits. This study builds a foundational understanding of persona dynamics in models and offers practical frameworks for creating more reliable and controllable language model systems.
Check out the Paper, Technical Blog and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.