Microsoft has introduced ASSERT, an open-source framework designed to help developers assess whether artificial intelligence systems behave in line with the specific requirements of their applications and services.
According to TechCrunch, the framework, known as Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), was unveiled on Tuesday to simplify the testing of application-specific AI behaviour.
Microsoft said ASSERT uses artificial intelligence to convert high-level descriptions of goals, policies and intended behaviours into detailed, scored tests that can be examined by developers.
The framework translates plain-language descriptions of expected AI behaviour into structured categories of acceptable and unacceptable actions. It then generates test scenarios, runs evaluations against the target system and scores the outcomes.
ASSERT also records the routes taken by AI systems, including intermediate actions and tool calls, allowing developers to identify where failures occur during testing.
Developers can supply additional context, tools and operational constraints to tailor evaluations to the requirements of a particular application.
For instance, a developer may require a document research AI agent not to send emails outside the organisation, restrict confidential information to senior executives and provide concise summaries that take previous context into account. ASSERT can generate tests to determine whether those requirements are consistently met.
Microsoft said the framework addresses a gap left by broader AI evaluations when model behaviour is shaped by an application's policies, tools and operational environment.
Sarah Bird, Chief Product Officer for Responsible AI at Microsoft, said evaluations are essential for understanding AI system behaviour and determining whether it meets an organisation's standards.
She said organisations seeking trustworthy AI systems should assess a wider range of application-specific factors rather than relying solely on general evaluations.
Bird added that ASSERT can be used during development, after deployment and as part of continuous monitoring programmes.
The launch comes as the artificial intelligence sector places greater emphasis on repeatable testing and regression assessments. Research initiatives including Stanford University's HELM, MLCommons' AILuminate and evaluation groups such as METR have introduced benchmarks aimed at measuring AI model performance under different conditions.







