In a recent exploratory study, scientists turned a massive collection of Gemini chat logs into a laboratory for behavioral analysis. Using 100,000 conversations, they extracted 20,000 distinct features that describe user turns, the model’s internal “thoughts,” and the assistant’s final replies.

The process begins by slicing each transcript into three isolated segments: the user’s prompt, the model’s chain‑of‑thought reasoning, and the assistant’s answer. A black‑box large language model (LLM) is then prompted to generate 10–20 descriptive tags for each segment, drawing from a template that encourages a wide range of observations—from “model is depressed” to “uses markdown.” Because the LLM sees only one segment at a time, the tags remain independent of surrounding context.

Next, each tag is turned into a semantic vector. The vectors for user, thought, and response tags are clustered separately. A second LLM receives 100 randomly selected tags from each cluster and produces a concise label that captures the common theme. These labels act as human‑readable summaries of the clusters.

When compared to traditional sparse autoencoders (SAEs), the LLM‑driven method offers a simpler, more interpretable workflow. SAEs learn to reconstruct their own activations, yielding thousands of latent dimensions that an LLM can later interpret. The LLM approach, by contrast, requires only a single prompt per transcript part, bypasses internal model states, and delivers higher‑level, more readable features. While SAEs can be steered by latent directions, the LLM‑generated features are not directly tied to activation space.

The authors also discuss a related technique called Explaining Datasets in Words (EDW). EDW optimizes directions in an embedding space and maps them to natural‑language predicates, but it demands iterative optimization and a target statistical model. In contrast, the new method needs just one LLM call per prompt and is fully unsupervised.

Clustering revealed a spectrum of intriguing themes. Many groups highlighted Gemini’s awareness of token limits, role‑play boundaries, and looping in reasoning. To gauge interest, an LLM rated each cluster on a 1–100 scale and supplied a brief description. The most compelling clusters emerged from the model’s thoughts, underscoring the richness of internal reasoning patterns.

The team also tested predictive probing. Logistic regression models attempted to forecast the presence of thought or response clusters based solely on user features, represented as sparse binary vectors. Test F1 scores varied, peaking at about 0.89 for predicting HTTP status codes in responses when users referenced external resources. More abstract clusters yielded lower scores, indicating that user input alone is a weak predictor of many internal reasoning patterns.

The study concludes that a natural‑language report summarizing Gemini’s behavior could serve as a proxy task for future research. The authors invite the community to benchmark their method against SAEs, EDW, and other summarization techniques.

Overall, the work demonstrates a practical, unsupervised pipeline for extracting interpretable behavioral features from large‑scale chat logs, offering a new tool for AI safety researchers and developers who want to understand how models like Gemini behave across diverse contexts.