Microsoft president Brad Smith takes a moment to reflect before using the word “guardrails” with the ease of someone who has given a great deal of thought to the dangers of the abyss. A conference on the company’s new release is being held at its headquarters in Redmond, Washington, to which this and other international newspapers have been invited, and EL PAÍS has asked how and who determines whether the company’s artificial intelligence can be used in the context of war, such as the current conflict in Iran. Just a few days ago, it was made public that artificial intelligence firm Anthropic has sued the Pentagon for blacklisting it after the company turned down a contract for the defense entity to utilize its technology. It is the current debate that is raging in the world of Big Tech, and a very familiar issue at Microsoft. In 2021, the Pentagon canceled a $10 billion deal with the company following protests by its employees. Microsoft, in fact, has supported Anthropic in its fight.
Smith answers, “We have principles, we define them and we publish them. By definition, those principles create guardrails. And we stay in our lane within them. It’s not just about when we should use technology, but also about when we shouldn’t use it.”
To assist in this process, Microsoft has a crew that hacks its own products: the red team. The name invokes a militaristic legacy. The red teams were first created by armies to simulate enemy attacks and to detect vulnerabilities before a real adversary could do so. In cybersecurity, the practice has been established for decades. But applying it to generative artificial intelligence is something relatively new, and Microsoft is attributed with being a pioneer in the area, having formed its team in 2018. “Before a product is launched, the red teams break the technology so that others can rebuild it to be more solid and secure,” explains Ram Shankar Siva Kumar, who self-identifies as a “data cowboy” and is the leader of the red team. “AI can generate problems from security failures to psychosocial damage. People use Copilot [Microsoft’s AI] in moments of great vulnerability, so observing how these systems can fail before they get to the user is fundamental,” he says.
His AI internal affairs team has already analyzed more than 100 of the company’s products. Microsoft does not release information regarding how many people work in the team, nor how many or which products whose release they have halted. But he does say that the team has the power to do so: “No high-risk AI system is implemented before undergoing an independent test. If our team identifies serious risks that have not been mitigated, the product is not released until those problems are resolved,” says Kumar.
The question the team poses when it analyzes a product before its release is, “How could one use this AI system, for good or bad, within months or years?”
Six principles
The “guardrails” that Smith mentions are the six general principles that guide the team when it comes to examining products: fairness, reliability and safety, privacy and security, transparency, accountability and inclusiveness. Every day, they translate these principles into concrete tools. “If you give an engineer a 50-page document so they can implement these principles, they’re going to get overwhelmed. We have an open-source tool called Pyrit. We built it for ourselves, and then we made it available to the world, because we believe in the health of the ecosystem,” says Kumar.
On the red team there are neuroscientists, linguists, national security experts, cybersecurity specialists, military veterans and even a formerly incarcerated individual “who rehabilitated themselves,” says Kumar. They also speak 17 languages and “some French, Mongolian, Thai and Korean dialects,” according to the team’s leader, which is important given that one of the red team’s obsessions, he says, is for its AI to avoid making mistakes around the world.
Along with Kumar, the red team’s operations are co-directed by Tori Westerhoff, whose background combines cognitive neuroscience — she studied at Yale and was one of the first members of the Wharton Neuroscience Initiative — and national security strategy, having worked at intelligence and defense agencies. “When we receive an assignment,” she explains, “we simulate what could go wrong at the extremes of that technology’s usage curve. My team delves into how to use that product, both as intended and in unintended ways, to identify the most extreme scenarios and help the product team to replicate and mitigate them before anyone can encounter them in the real world,” she says.
One example of her work was the red teaming, as the practice is called internally by her hackers, of GPT-5, the OpenAI (a Microsoft partner) model launched last August. What they did was train another AI to hack the program, automatically and at a scale that would be impossible for humans to accomplish.
When they tested GPT-5, the red team used Pyrit to automatically generate more than two million fake conversations. The AI continuously tried to attack the other AI for days, exploring combinations that would never occur to a human being. Finding these weak spots manually is an extremely slow process, which is why they trained another AI to do the work, “like in Inception,” says Kumar, a reference to the Christopher Nolan movie in which characters enter into dreams within dreams.
However, Westerhoff, Kumar and Daniel Krutz, who directs the company’s Responsible AI office, emphasize one point: “Red teaming can only be automated to a certain extent, and only humans can determine whether an AI-generated response feels off or reflects a bias,” the company states. The judgement is made by the person; the scale comes courtesy of the machine. That division of labor defines the team’s philosophy.
Westerhoff believes, in fact, that only the human mind is capable of “imagining the spaces that have not yet been observed, that are not completely defined or explored. Our work consists of innovating and creating beyond the space that has been systematized.”
The team has identified three areas in which automatization is blind by definition and in which human judgment is essential. The first has to do with subjects: people are needed to evaluate risk in areas like medicine and security. The second has to do with the places in which the AI will be launched. “We need humans to take linguistic differences into account and to redefine what constitutes damage in different political and cultural contexts,” states the business. The third is emotional intelligence. In this last area, only humans can evaluate the range of interactions that users may have with AI systems. A model can pass all automated tests and still produce responses that would be disturbing for a real person in an actual situation.
This way of seeing AI aligns with the vision of Mustafa Suleyman, one of the founders of Deepmind (now part of Google) and CEO of Microsoft. A few days ago, he wrote in the publication Nature that an apparently conscious AI could become a weapon. As artificial intelligence systems increasingly mimic the structure of human language, he argues, we need design standards and laws to prevent them from being mistaken for sentient beings. “They must remain fundamentally accountable to humans and be subject to the well-being of humanity,” writes Suleyman. “AI agents should have no more rights or freedoms than my laptop.”
The central philosophy underpinning the red team’s work is, in short, that “responsible AI is not a filter applied at the end of development, but a foundational part of the process,” says Kumar. These are Smith’s guardrails, which do not actually act as brakes, but as a condition for going fast and not crashing.
Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition
