Google's Vantage uses AI avatars to assess skills like critical thinking in adaptive conversations

Google Research has released Vantage, a research experiment available on Google Labs that places learners in multi-party conversations with AI avatars to assess “future-ready” skills like critical thinking, collaboration, and creative thinking. According to the research post, the AI Evaluator’s scoring agreement with human experts was comparable to the agreement between two expert human raters — a result established through a joint study with New York University involving 188 testers.

The problem Vantage is designed to address is measurement. Future-ready skills are identified as priority competencies by international frameworks including the OECD Learning Compass 2030 and the WEF’s Future of Jobs report, but existing tests are described as “too rigid to capture people’s thought processes and interactions.” Testing in real human interactions would be more valid but is resource-intensive, expensive, and hard to standardize. The AI simulation approach is an attempt to get closer to authentic assessment conditions while remaining scalable.

How the assessment works

Vantage places learners in open-ended scenarios — examples given include preparing for a debate or pitching a creative vision — where they interact with AI avatars. The technical system has two distinct AI components with separate roles.

The first is an Executive LLM. This component uses a predefined assessment rubric to steer the AI avatars in real time, analyzing the state of the conversation continuously and dynamically introducing specific challenges. The post gives examples: pushing back on an idea, or introducing a conflict. The purpose is to ensure that by the end of the conversation, the specific skill being assessed has been exercised — a key validation question, since unsteered conversations might not naturally surface the targeted behavior. The post describes it as acting as “a next-generation adaptive assessment engine.”

The second is an AI Evaluator. After the conversation is complete, this component analyzes the transcript against the same assessment rubric to identify and measure specific evidence of skill application. The learner then receives a skill map — a visual score and qualitative feedback tied to specific moments in the conversation. The post describes this as making “the ‘invisible’ progress of human skill development visible and actionable.”

Validation study design and results

The NYU research partnership focused on two research questions. The first was whether the Executive LLM could successfully steer conversations to target specific skills. The study measured the volume of skill-related information demonstrated by users when conversations were steered versus unsteered (AI avatars operating independently). The post reports that steering did successfully produce higher-density information about the assessed skills while maintaining a natural conversational flow, and that this consistency held across multiple simulation tasks.

The second question was whether AI scoring could match human expert accuracy. Human NYU raters and the AI Evaluator assessed the same 188 sessions using the same pedagogical rubrics. The results showed that agreement between the AI Evaluator and human experts was similar to agreement between two expert raters, which the post presents as establishing automated scoring as a credible alternative for this type of assessment.

A second validation study was conducted with OpenMic, a startup building AI tools for durable skills assessment. This study analyzed 180 students’ work on creative multimedia tasks — including character interviews and media articles related to English literature — and compared the AI Evaluator’s scores with OpenMic’s internal experts. The post reports a high correlation between AI and human scores in this context as well, extending the validity evidence to complex creative tasks beyond the initial structured scenarios.

Potential classroom applications

The post outlines a possible future where Vantage-style assessment becomes a “skills layer” sitting atop existing curricula. The framing is additive rather than replacing existing group work: students could complete assignments — debating a social science topic with AI avatars, or planning a laboratory experiment as a team lead — and receive feedback on both content understanding and the quality of their collaboration and critical thinking demonstrated in the process.

That pairing of subject-matter and skills feedback within the same task is the stated pedagogical aim. Rather than a separate skills assessment divorced from content, the simulation would be integrated into academic tasks already present in the curriculum.

Vantage is described as a research experiment rather than a production feature, currently available in English for sign-up on Google Labs. The primary claims in the post rest on two validation studies — one with 188 participants assessing collaboration skills, one with 180 students on creative tasks — both of which found AI scoring accuracy comparable to human expert agreement. Whether that result holds across a wider range of skill types, age groups, and cultural contexts is not addressed, and the post does not present Vantage as ready for unsupervised deployment in educational settings.