OpenAI says its latest models, o3 and o4-mini, represent a significant advancement over previous versions, which also introduces new risks if misused. Internal evaluations showed that o3, in particular, was better at answering questions related to the creation of certain biological threats. To address this and other safety concerns, OpenAI developed a new system called a “safety-focused reasoning monitor.”
This monitor, which runs on top of o3 and o4-mini, was specifically trained to understand and enforce OpenAI’s content policies. Its role is to detect prompts related to biological and chemical threats and prevent the models from providing any guidance on those subjects.
To train and test the system, OpenAI’s red team spent around 1,000 hours identifying potentially dangerous biorisk-related interactions. In simulations of the monitor’s blocking system, the models refused to respond to harmful prompts 98.7% of the time. However, OpenAI notes that the test didn’t include users who might rephrase their questions after being blocked, which is why human oversight will remain part of their safety process.
While o3 and o4-mini don’t meet OpenAI’s threshold for being considered “high risk” in terms of biorisks, the company acknowledges they were more capable than earlier models like o1 and GPT-4 at answering questions related to biological weapons.
According to OpenAI’s recently updated Preparedness Framework, the company is closely monitoring how its models might be misused to aid in the development of chemical and biological threats.
To address these risks, OpenAI is increasingly turning to automated safety measures. For instance, the company says it uses a reasoning monitor—similar to the one used for o3 and o4-mini—to prevent GPT-4o’s image generator from producing child sexual abuse material (CSAM).
However, some researchers have expressed concerns that OpenAI may not be giving safety the attention it deserves. Metr, one of OpenAI’s red-teaming partners, reported having limited time to evaluate o3’s performance on deceptive behavior benchmarks. Additionally, OpenAI chose not to publish a safety report for its newly released GPT-4.1 model.