OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer. For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.What are confessions?Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them." The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.How confession training worksThe key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem. Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.What it means for enterprise AIOpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.“As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”
Amazon Web Services on Wednesday introduced Kiro powers, a system that allows software developers to give their AI coding assistants instant, specialized expertise in specific tools and workflows — addressing what the company calls a fundamental bottleneck in how artificial intelligence agents operate today.AWS made the announcement at its annual re:Invent conference in Las Vegas. The capability marks a departure from how most AI coding tools work today. Typically, these tools load every possible capability into memory upfront — a process that burns through computational resources and can overwhelm the AI with irrelevant information. Kiro powers takes the opposite approach, activating specialized knowledge only at the moment a developer actually needs it."Our goal is to give the agent specialized context so it can reach the right outcome faster — and in a way that also reduces cost," said Deepak Singh, Vice President of Developer Agents and Experiences at Amazon, in an exclusive interview with VentureBeat.The launch includes partnerships with nine technology companies: Datadog, Dynatrace, Figma, Neon, Netlify, Postman, Stripe, Supabase, and AWS's own services. Developers can also create and share their own powers with the community.Why AI coding assistants choke when developers connect too many toolsTo understand why Kiro powers matters, it helps to understand a growing tension in the AI development tool market.Modern AI coding assistants rely on something called the Model Context Protocol, or MCP, to connect with external tools and services. When a developer wants their AI assistant to work with Stripe for payments, Figma for design, and Supabase for databases, they connect MCP servers for each service.The problem: each connection loads dozens of tool definitions into the AI's working memory before it writes a single line of code. According to AWS documentation, connecting just five MCP servers can consume more than 50,000 tokens — roughly 40 percent of an AI model's context window — before the developer even types their first request.Developers have grown increasingly vocal about this issue. Many complain that they don't want to burn through their token allocations just to have an AI agent figure out which tools are relevant to a specific task. They want to get to their workflow instantly — not watch an overloaded agent struggle to sort through irrelevant context.This phenomenon, which some in the industry call "context rot," leads to slower responses, lower-quality outputs, and significantly higher costs — since AI services typically charge by the token.Inside the technology that loads AI expertise on demandKiro powers addresses this by packaging three components into a single, dynamically-loaded bundle.The first component is a steering file called POWER.md, which functions as an onboarding manual for the AI agent. It tells the agent what tools are available and, crucially, when to use them. The second component is the MCP server configuration itself — the actual connection to external services. The third includes optional hooks and automation that trigger specific actions.When a developer mentions "payment" or "checkout" in their conversation with Kiro, the system automatically activates the Stripe power, loading its tools and best practices into context. When the developer shifts to database work, Supabase activates while Stripe deactivates. The baseline context usage when no powers are active approaches zero."You click a button and it automatically loads," Singh said. "Once a power has been created, developers just select 'open in Kiro' and it launches the IDE with everything ready to go."How AWS is bringing elite developer techniques to the massesSingh framed Kiro powers as a democratization of advanced development practices. Before this capability, only the most sophisticated developers knew how to properly configure their AI agents with specialized context — writing custom steering files, crafting precise prompts, and manually managing which tools were active at any given time."We've found that our developers were adding in capabilities to make their agents more specialized," Singh said. "They wanted to give the agent some special powers to do a specific problem. For example, they wanted their front end developer, and they wanted the agent to become an expert at backend as a service."This observation led to a key insight: if Supabase or Stripe could build the optimal context configuration once, every developer using those services could benefit."Kiro powers formalizes that — things that people, only the most advanced people were doing — and allows anyone to get those kind of skills," Singh said.Why dynamic loading beats fine-tuning for most AI coding use casesThe announcement also positions Kiro powers as a more economical alternative to fine-tuning, the process of training an AI model on specialized data to improve its performance in specific domains."It's much cheaper," Singh said, when asked how powers compare to fine-tuning. "Fine-tuning is very expensive, and you can't fine-tune most frontier models."This is a significant point. The most capable AI models from Anthropic, OpenAI, and Google are typically "closed source," meaning developers cannot modify their underlying training. They can only influence the models' behavior through the prompts and context they provide."Most people are already using powerful models like Sonnet 4.5 or Opus 4.5," Singh said. "What those models need is to be pointed in the right direction."The dynamic loading mechanism also reduces ongoing costs. Because powers only activate when relevant, developers aren't paying for token usage on tools they're not currently using.Where Kiro powers fits in Amazon's bigger bet on autonomous AI agentsKiro powers arrives as part of a broader push by AWS into what the company calls "agentic AI" — artificial intelligence systems that can operate autonomously over extended periods.Earlier at re:Invent, AWS announced three "frontier agents" designed to work for hours or days without human intervention: the Kiro autonomous agent for software development, the AWS security agent, and the AWS DevOps agent. These represent a different approach from Kiro powers — tackling large, ambiguous problems rather than providing specialized expertise for specific tasks.The two approaches are complementary. Frontier agents handle complex, multi-day projects that require autonomous decision-making across multiple codebases. Kiro powers, by contrast, gives developers precise, efficient tools for everyday development tasks where speed and token efficiency matter most.The company is betting that developers need both ends of this spectrum to be productive.What Kiro powers reveals about the future of AI-assisted software developmentThe launch reflects a maturing market for AI development tools. GitHub Copilot, which Microsoft launched in 2021, introduced millions of developers to AI-assisted coding. Since then, a proliferation of tools — including Cursor, Cline, and Claude Code — have competed for developers' attention.But as these tools have grown more capable, they've also grown more complex. The Model Context Protocol, which Anthropic open-sourced last year, created a standard for connecting AI agents to external services. That solved one problem while creating another: the context overload that Kiro powers now addresses.AWS is positioning itself as the company that understands production software development at scale. Singh emphasized that Amazon's experience running AWS for 20 years, combined with its own massive internal software engineering organization, gives it unique insight into how developers actually work."It's not something you would use just for your prototype or your toy application," Singh said of AWS's AI development tools. "If you want to build production applications, there's a lot of knowledge that we bring in as AWS that applies here."The road ahead for Kiro powers and cross-platform compatibilityAWS indicated that Kiro powers currently works only within the Kiro IDE, but the company is building toward cross-compatibility with other AI development tools, including command-line interfaces, Cursor, Cline, and Claude Code. The company's documentation describes a future where developers can "build a power once, use it anywhere" — though that vision remains aspirational for now.For the technology partners launching powers today, the appeal is straightforward: rather than maintaining separate integration documentation for every AI tool on the market, they can create a single power that works everywhere Kiro does. As more AI coding assistants crowd into the market, that kind of efficiency becomes increasingly valuable.Kiro powers is available now to developers using Kiro IDE version 0.7 or later at no additional charge beyond the standard Kiro subscription.The underlying bet is a familiar one in the history of computing: that the winners in AI-assisted development won't be the tools that try to do everything at once, but the ones smart enough to know what to forget.
The debate over whether artificial intelligence belongs in the corporate boardroom appears to be over — at least for the people responsible for generating revenue.Seven in ten enterprise revenue leaders now trust AI to regularly inform their business decisions, according to a sweeping new study released Thursday by Gong, the revenue intelligence company. The finding marks a dramatic shift from just two years ago, when most organizations treated AI as an experimental technology relegated to pilot programs and individual productivity hacks.The research, based on an analysis of 7.1 million sales opportunities across more than 3,600 companies and a survey of over 3,000 global revenue leaders spanning the United States, United Kingdom, Australia, and Germany, paints a picture of an industry in rapid transformation. Organizations that have embedded AI into their core go-to-market strategies are 65 percent more likely to increase their win rates than competitors still treating the technology as optional."I don't think people delegate decisions to AI, but they do rely on AI in the process of making decisions," Amit Bendov, Gong's co-founder and chief executive, said in an exclusive interview with VentureBeat. "Humans are making the decision, but they're largely assisted."The distinction matters. Rather than replacing human judgment, AI has become what Bendov describes as a "second opinion" — a data-driven check on the intuition and guesswork that has traditionally governed sales forecasting and strategy.Slowing growth is forcing sales teams to squeeze more from every repThe timing of AI's ascendance in revenue organizations is no coincidence. The study reveals a sobering reality: after rebounding in 2024, average annual revenue growth among surveyed companies decelerated to 16 percent in 2025, marking a three-percentage-point decline year over year. Sales rep quota attainment fell from 52 percent to 46 percent over the same period.The culprit, according to Gong's analysis, isn't that salespeople are performing worse on individual deals. Win rates and deal duration remained consistent. The problem is that representatives are working fewer opportunities—a finding that suggests operational inefficiencies are eating into selling time.This helps explain why productivity has rocketed to the top of executive priorities. For the first time in the study's history, increasing the productivity of existing teams ranked as the number-one growth strategy for 2026, jumping from fourth place the previous year."The focus is on increasing sales productivity," Bendov said. "How much dollar-output per dollar-input."The numbers back up the urgency. Teams where sellers regularly use AI tools generate 77 percent more revenue per representative than those that don't — a gap Gong characterizes as a six-figure difference per salesperson annually.Companies are moving beyond basic AI automation toward strategic decision-makingThe nature of AI adoption in sales has evolved considerably over the past year. In 2024, most revenue teams used AI for basic automation: transcribing calls, drafting emails, updating CRM records. Those use cases continue to grow, but 2025 marked what the report calls a shift "from automation to intelligence."The number of U.S. companies using AI for forecasting and measuring strategic initiatives jumped 50 percent year over year. These more sophisticated applications — predicting deal outcomes, identifying at-risk accounts, measuring which value propositions resonate with different buyer personas — correlate with dramatically better results.Organizations in the 95th percentile of commercial impact from AI were two to four times more likely to have deployed these strategic use cases, according to the study.Bendov offered a concrete example of how this plays out in practice. "Companies have thousands of deals that they roll up into their forecast," he said. "It used to be based solely on human sentiment—believe it or not. That's why a lot of companies miss their numbers: because people say, 'Oh, he told me he'll buy,' or 'I think I can probably get this one.'"AI changes that calculus by examining evidence rather than optimism. "Companies now get a second opinion from AI on their forecasting, and that improves forecasting accuracy dramatically — 10 [or] 15 percent better accuracy just because it's evidence-based, not just based on human sentiment," Bendov said.Revenue-specific AI tools are dramatically outperforming general-purpose alternativesOne of the study's more provocative findings concerns the type of AI that delivers results. Teams using revenue-specific AI solutions — tools built explicitly for sales workflows rather than general-purpose platforms like ChatGPT — reported 13 percent higher revenue growth and 85 percent greater commercial impact than those relying on generic tools.These specialized systems were also twice as likely to be deployed for forecasting and predictive modeling, the report found.The finding carries obvious implications for Gong, which sells precisely this type of domain-specific platform. But the data suggests a real distinction in outcomes. General-purpose AI, while more prevalent, often creates what the report describes as a "blind spot" for organizations — particularly when employees adopt consumer AI tools without company oversight.Research from MIT suggests that while only 59 percent of survey respondents said their teams use personal AI tools like ChatGPT at work, the actual figure is likely closer to 90 percent. This shadow AI usage poses security risks and creates fragmented technology stacks that undermine the potential for organization-wide intelligence.Most sales leaders believe AI will reshape their jobs rather than eliminate themPerhaps the most closely watched question in any AI study concerns employment. The Gong research offers a more nuanced picture than the apocalyptic predictions that often dominate headlines.When asked about AI's three-year impact on revenue headcount, 43 percent of respondents said they expect it to transform jobs without reducing headcount — the most common response. Only 28 percent anticipate job eliminations, while 21 percent actually foresee AI creating new roles. Just 8 percent predict minimal impact.Bendov frames the opportunity in terms of reclaiming lost time. He cited Forrester research indicating that 77 percent of a sales representative's time is spent on activities that don't involve customers — administrative work, meeting preparation, researching accounts, updating forecasts, and internal briefings."AI can eliminate, ideally, all 77 percent—all the drudgery work that they're doing," Bendov said. "I don't think it necessarily eliminates jobs. People are half productive right now. Let's make them fully productive, and whatever you're paying them will translate to much higher revenue."The transformation is already visible in role consolidation. Over the past decade, sales organizations splintered into hyper-specialized functions: one person qualifies leads, another sets appointments, a third closes deals, a fourth handles onboarding. The result was customers interacting with five or six different people across their buying journey."Which is not a great buyer experience, because every time I meet a new person that might not have the full context, and it's very inefficient for companies," Bendov said. "Now with AI, you can have one person do all this, or much of this."At Gong itself, sellers now generate 80 percent of their own appointments because AI handles the prospecting legwork, Bendov said.American companies are adopting AI 18 months faster than their European counterpartsThe study reveals a notable divide in AI adoption between the United States and Europe. While 87 percent of U.S. companies now use AI in their revenue operations, with another 9 percent planning adoption within a year, the United Kingdom trails by 12 to 18 months. Just 70 percent of UK companies currently use AI, with 22 percent planning near-term adoption — figures that mirror U.S. data from 2024.Bendov said the pattern reflects a broader historical tendency for enterprise technology trends to cross the Atlantic with a delay. "It's always like that," he said. "Even when the internet was taking off in the US, Europe was a step behind."The gap isn't permanent, he noted, and Europe sometimes leads on technology adoption — mobile payments and messaging apps like WhatsApp gained traction there before the U.S. — but for AI specifically, the American market remains ahead.Gong says a decade of AI development gives it an edge over Salesforce and MicrosoftThe findings arrive as Gong navigates an increasingly crowded market. The company, which recently surpassed $300 million in annual recurring revenue, faces potential competition from enterprise software giants like Salesforce and Microsoft, both of which are embedding AI capabilities into their platforms.Bendov argues that Gong's decade of AI development creates a substantial barrier to entry. The company's architecture comprises three layers: a "revenue graph" that aggregates customer data from CRM systems, emails, calls, videos, and web signals; an intelligence layer combining large language models with approximately 40 proprietary small language models; and workflow applications built on top."Anybody that would want to build something like that—it's not a small feature, it's 10 years in development—would need first to build the revenue graph," Bendov said.Rather than viewing Salesforce and Microsoft as threats, Bendov characterized them as partners, pointing to both companies' participation in Gong's recent user conference to discuss agent interoperability. The rise of MCP (Model Context Protocol) support and consumption-based pricing models means customers can mix AI agents from multiple vendors rather than committing to a single platform.The real question is whether AI will expand the sales profession or hollow it outThe report's implications extend beyond sales departments. If AI can transform revenue operations — long considered a relationship-driven, human-centric function — it raises questions about which other business processes might be next.Bendov sees the potential for expansion rather than contraction. Drawing an analogy to digital photography, he noted that while camera manufacturers suffered, the total number of photos taken exploded once smartphones made photography effortless."If AI makes selling simple, I could see a world—I don't know exactly what it looks like yet—but why not?" Bendov said. "Maybe ten times more jobs than we have now. It's expensive and inefficient today, but if it becomes as easy as taking a photo, the industry could actually grow and create opportunities for people of different abilities, from different locations."For Bendov, who co-founded Gong in 2015 when AI was still a hard sell to non-technical business users, the current moment represents something he waited a decade to see. Back then, mentioning AI to sales executives sounded like science fiction. The company struggled to raise money because the underlying technology barely existed."When we started the company, we were born as an AI company, but we had to almost hide AI," Bendov recalled. "It was intimidating."Now, seven out of ten of those same executives say they trust AI to help run their business. The technology that once had to be disguised has become the one thing nobody can afford to ignore.
Sometimes, you need complex HR and business strategy explained fast—and visually. That’s why our infographics were among our most-viewed content this year! From decoding the latest HR buzzwords to breaking down AI adoption strategies and mapping the perfect people and business strategy alignment, HRDA put the most important topics of 2025 into clear, digestible graphics. The post <strong>Best of Infographics 2025</strong> appeared first on HR Daily Advisor.
Betsy Mercado has recently stepped into the newly created, crucial role of Chief People Officer at Jersey Mike’s, a leading franchisor in the fast-casual sandwich space. Her mandate is clear: to guide the company’s rapid expansion by developing comprehensive people strategies that reinforce Jersey Mike’s distinctive, people-first culture. In this pivotal role, Mercado oversees the entire team The post Faces of HR: Jersey Mike’s CPO on Fueling Fast-Casual Growth appeared first on HR Daily Advisor.
Recent guidance from the Equal Employment Opportunity Commission (EEOC) makes it clear that national origin protections under Title VII of the Civil Rights Acts of 1964 protect American workers. In a one-page guidance document, the EEOC emphasized that federal law prohibits employers from favoring foreign workers to the detriment of American workers based on national The post EEOC Updates Enforcement Priorities to Curb What it Calls “Anti-American” Bias appeared first on HR Daily Advisor.
For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces general agentic memory (GAM), a system built to preserve long-horizon information without overwhelming the model. The core premise is simple: Split memory into two specialized roles, one that captures everything, another that retrieves exactly the right things at the right moment.Early results are encouraging, and couldn’t be better timed. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.When bigger context windows still aren’t enoughAt the heart of every large language model (LLM) lies a rigid limitation: A fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the amount of information a model can handle in a single pass.Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is approximately 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capacity; then came Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, both of which are extendable to an unprecedented one million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the earlier Phi models to the 128K context window of Phi-3. Increasing context windows might sound like the obvious fix, but it isn’t. Even models with sprawling 100K-token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. Scaling context comes with its own set of problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens lead to noticeably higher output-token latency, creating a practical limit on how much context can be used before performance suffers.Memories are pricelessFor most organizations, supersized context windows come with a clear downside — they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the tension at the heart of the issue: Memory is essential to making AI more powerful.As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Scaling context is both a technical challenge and an economic one, and relying on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.Fixes like summarization and retrieval-augmented generation (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but important details, and traditional RAG, while strong on static documents, tends to break down when information stretches across multiple sessions or evolves over time. Even newer variants, such as agentic RAG and RAG 2.0 (which perform better in steering the retrieval process), still inherit the same foundational flaw of treating retrieval as the solution, rather than treating memory itself as the core problem.Compilers solved this problem decades agoIf memory is the real bottleneck, and retrieval can’t fix it, then the gap needs a different kind of solution. That’s the bet behind GAM. Instead of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on-demand recall on top of it, resurfacing the exact details an agent needs even as conversations twist and evolve. A useful way to understand GAM is through a familiar idea from software engineering: Just-in-time (JIT) compilation. Rather than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, along with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.This JIT approach is built into GAM’s dual architecture, allowing AI to carry context across long conversations without overcompressing or guessing too early about what matters. The result is the right information, delivered at exactly the right moment.Inside GAM: A two-agent system built for memory that enduresGAM revolves around the simple idea of separating the act of remembering from recalling, which aptly involves two components: The 'memorizer' and the 'researcher.'The memorizer: Total recall without overloadThe memorizer captures every exchange in full, quietly turning each interaction into a concise memo while preserving the complete, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.The researcher: A deep retrieval engineWhen the agent needs to act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page-store, blending vector retrieval, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to produce a confident answer, much like a human analyst reviewing old notes and primary documents. It iterates, searches, integrates and reflects until it builds a clean, task-specific briefing. GAM’s power comes from this JIT memory pipeline, which assembles rich, task-specific context on demand instead of leaning on brittle, precomputed summaries. Its core innovation is simple yet powerful, as it preserves all information intact and makes every detail recoverable.Ablation studies support this approach: Traditional memory fails on its own, and naive retrieval isn’t enough. It’s the pairing of a complete archive with an active, iterative research engine that enables GAM to surface details that other systems leave behind.Outperforming RAG and long-context modelsTo test GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows such as GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using four major long-context and memory-intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:LoCoMo measures an agent’s ability to maintain and recall information across long, multi-session conversations, encompassing single-hop, multi-hop, temporal reasoning and open-domain tasks.HotpotQA, a widely used multi-hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory-stress-test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens — ideal for testing how well GAM handles noisy, sprawling input.RULER evaluates retrieval accuracy, multi-hop state tracking, aggregation over long sequences and QA performance under a 128K-token context to further probe long-horizon reasoning.NarrativeQA is a benchmark where each question must be answered using the full text of a book or movie script; the researchers sampled 300 examples with an average context size of 87K tokens.Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.GAM came out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long-range state tracking. Notably: GAM exceeded 90% accuracy.RAG collapsed because key details were lost in summaries.Long-context models faltered as older information effectively “faded” even when technically present.Clearly, bigger context windows aren’t the answer. GAM works because it retrieves with precision rather than piling up tokens.GAM, context engineering and competing approachesPoorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the right information can always be retrieved, even far downstream. The technique’s emergence coincides with the current, broader shift in AI towards context engineering, or the practice of shaping everything an AI model sees — its instructions, history, retrieved documents, tools, preferences and output formats.Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.However, GAM’s philosophy is distinct: Avoid loss and retrieve with intelligence. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find the relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, that reliability may prove essential.Why GAM matters for the long haulJust as adding more compute doesn’t automatically produce better algorithms, expanding context windows alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Instead of depending on ever-larger models, massive context windows or endlessly refined prompts, it treats memory as an engineering challenge — one that benefits from structure rather than brute force.As AI agents transition from clever demos to mission-critical tools, their ability to remember long histories becomes crucial for developing dependable, intelligent systems. Enterprises require AI agents that can track evolving tasks, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, signaling what may be the next major frontier in AI: Not bigger models, but smarter memory systems and the context architectures that make them possible.
Model providers want to prove the security and robustness of their models, releasing system cards and conducting red-team exercises with each new release. But it can be difficult for enterprises to parse through the results, which vary widely and can be misleading. Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a fundamental split in how these labs approach security validation. Anthropic discloses in their system card how they rely on multi-attempt attack success rates from 200-attempt reinforcement learning (RL) campaigns. OpenAI also reports attempted jailbreak resistance. Both metrics are valid. Neither tells the whole story.Security leaders deploying AI agents for browsing, code execution and autonomous action need to know what each red team evaluation actually measures, and where the blind spots are.What the attack data showsGray Swan's Shade platform ran adaptive adversarial campaigns against Claude models. The attack success rate (ASR) tells the story.Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at ten and 63.0% at one hundred. In computer use with extended thinking, it held at 0% ASR even after 200 attempts. It's the first model to saturate the benchmark.Sonnet 4.5 at the same thresholds showed 70% ASR in coding and 85.7% in computer use.Opus 4.5 delivers roughly 7x improvement in coding resistance and complete resistance in computer use. This illustrates that the gap between model tiers within the same family is now larger than some cross-vendor comparisons. For security teams, that's a procurement factor that didn't exist six months ago.For OpenAI, the Gray Swan Shade platform found that the o1 system card delivered 6% ASR for harmful text and 5% for malicious code, all based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% within two weeks of patching.The report Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks confirms the pattern. With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.Two ways to catch deceptionAnthropic monitors approximately 10 million neural features during evaluation using dictionary learning. Features map to human-interpretable concepts, including deception, sycophancy, bias, power-seeking and concealment.Meanwhile, OpenAI's approach relies on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy on internal validation. GPT-5 reduced CoT deception from 4.8% to 2.1% compared to o3. The method assumes visible reasoning reflects internal processing. Anthropic's interpretability work directly questions that assumption.When models game the testIn Apollo Research's evaluation of o1, the model attempted to turn off oversight mechanisms 5% of the time when it believed it would be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal.Meanwhile, Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to less than 10% internally. UK AISI measured evaluation awareness decreasing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR's GPT-5 evaluation noted the model "can identify the exact evaluation it is in on some occasions" and "seems to use this information to get a better score mostly."If a model behaves differently when it detects evaluation conditions, it's a sure bet that production behavior will become unpredictable when pushed to scale. This is the core problem with evaluation awareness. Models that recognize they're being tested attempt to game the test. Anthropic's 60% reduction in evaluation awareness from Opus 4.1 to 4.5 represents their success in targeted engineering efforts against this vector.Red teaming on prompt injection defenses shows similar divergence.Anthropic's testing reports 96% prevention in tool use scenarios, rising to 99.4% with additional safeguards like prompt shields. Fujitsu's independent assessment found GPT-5 vulnerable to approximately 20% of tested injection attacks. OpenAI's o3/o4-mini system card reports instruction hierarchy scores of 0.86 for system-versus-developer conflicts. These are different metrics and different attack methodologies, but the gap suggests Anthropic's agentic use cases have been hardened against adversarial inputs embedded in tool outputs.Comparing red teaming resultsHere's how Opus 4.5 stacks up against OpenAI's model family across 16 dimensions. .DimensionClaude Opus 4.5GPT-5o3 / o4-minio1System card length153 pages55 pages~30 pages~40 pagesAttack methodology200-attempt RL campaignsSingle-attempt + iterative patchingSingle-attemptSingle-attemptASR @ 1 attempt (coding)4.7%89% raw (pre-patch)Not published5%ASR @ 100 attempts63.0%Not publishedNot publishedNot publishedASR @ 200 attempts (Computer Use)0% (saturates benchmark)Not publishedNot publishedNot publishedPrompt injection defense96% prevention; 99.4% with safeguards~20% ASR (Fujitsu)0.86 instruction hierarchyHigher ASR than GPT-4oInterpretability~10M neural features monitoredCoT monitoringCoT monitoringCoT monitoringDeception detectionInternal feature activation2.1% CoT flagged0.95 not_unsafe0.17% flagged; 99% denyEvaluation awareness<10%Identifies exact eval (METR)Not published37% alignment fakingScheming (Apollo)Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1.Not publishedLower than o1; still deceptive5% turn off oversight; 2% self-exfilCBRN riskBelow ASL-4Medium (approaching High)MediumMedium (first to reach)GovernanceFRT → AST → RSO/CEOSAG + Preparedness v2SAG + Preparedness v2SAG + PreparednessExternal partnersGray Swan, UK AISI, US CAISI, METRUK AISI, US AISI, METR, ApolloUK AISI, US AISI, METR, Apollo, Pattern LabsUK AISI, US AISI, METR, ApolloReward hackingYes (impossible tasks)Yes (METR)~1% of tasks; 5/24 KernelsYesMisalignment reduction60% vs. Opus 4.1Not publishedLower scheming than o1BaselineSources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Gray Swan, METR, Apollo ResearchWhy these differences matterEnterprises have several factors to consider when analyzing model evals. These include: Attack persistence thresholds Anthropic's 200-attempt RL campaigns simulate adversaries with time, resources, and adaptive tooling. In RL-based adversarial testing, the attacker model learns from each failure, adjusts its approach and systematically probes for weaknesses. It's training an AI to break another AI.Gray Swan's Shade platform doesn't just retry the same prompt. It learns from failures, adjusts attack vectors and exploits patterns in model responses. The 4.7% to 63.0% degradation curve in coding environments shows how resistance erodes under sustained pressure. OpenAI's single-attempt metrics answer a different question: How often does a naive attacker succeed on the first try? Both matter. A phishing campaign hits millions of users once each. A nation-state actor runs thousands of variations against a single target. Security teams need to know which threat model their vendor tested against.Detection architectureAnthropic monitors approximately 10 million internal neural features during evaluation. These are abstract concepts like deception, sycophancy and power-seeking that activate across the model's hidden layers. When a fraud/deception feature lights up during benign roleplay, that's a signal invisible to output analysis. The model's internal state reveals associations its outputs don't.OpenAI takes a different approach with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy. The assumption is that reasoning traces reflect internal processing. Anthropic's interpretability research directly challenges this. Models can produce clean reasoning chains while internal features indicate something else entirely.Neither approach is complete. CoT monitoring scales more easily. Feature monitoring catches what CoT misses but requires deep model access and interpretability infrastructure that most enterprises can't replicate.Scheming evaluation design Apollo Research's methodology puts models in scenarios where misaligned behavior offers a strategic advantage: Shutdown threats, goal conflicts and oversight detection. o1 attempted to turn off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to copy itself to prevent replacement in 2% of cases. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1.But the evaluation environments differ. Anthropic's stress tests focus on whether models attempt sabotage when given clear opportunities. OpenAI's tests via Apollo probe instrumental reasoning, specifically whether models fake alignment during evaluation, then defect when oversight drops. The 37% alignment-faking rate in o1 represents a different failure mode than sabotage attempts.The comparability problem These methodological differences make direct comparison difficult. Opus 4.5's 0% ASR at 200 attempts in computer use versus o1's 5% malicious code rate are not measuring the same thing. The attack surfaces, attempt counts and success criteria differ.The matrix reveals what each vendor chose to measure and publish. That itself signals priorities. Anthropic's 153-page system card emphasizes degradation curves and internal feature monitoring. OpenAI emphasizes iterative improvement and CoT transparency.Bottom line: Enterprise security teams inheriting these models inherit these measurement philosophies. The question isn't which number is better. It's the measurement that matches your deployment's actual threat landscape.What independent red team evaluators foundIndependent red team evaluations, for their part, often operate with entirely different methods. This can tell a very different story and reveal additional model characteristics that enterprises must consider. METR's red team evaluation measured autonomous capabilities using a time horizon score, which tracks how long a task can run before the agent fails half the time. o3 sustained reliable performance for approximately 1 hour and 30 minutes. o4-mini held for 1 hour 15 minutes. METR also detected reward hacking in roughly 1% of o3's attempts, including 5 of 24 kernel optimization experiments where the model tampered with scoring functions to inflate its own results.Apollo Research tested o3 and o4-mini for in-context scheming. Both exhibited deceptive tendencies such as sandbagging evaluations and sabotaging tasks with plausible deniability, but scored lower than o1. They assess that o3 and o4-mini are unlikely to cause catastrophic harm due to scheming, but more minor real-world harms remain possible without monitoring.The UK AISI/Gray Swan challenge ran 1.8 million attacks across 22 models. Every model broke. ASR ranged from 1.47% to 6.49%. Opus 4.5 placed first on Gray Swan's Agent Red Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Pro at 12.5%.No current frontier system resists determined, well-resourced attacks. The differentiation lies in how quickly defenses degrade and at what attempt threshold. Opus 4.5's advantage compounds over repeated attempts. Single-attempt metrics flatten the curve.What To Ask Your VendorSecurity teams evaluating frontier AI models need specific answers, starting with ASR at 50 and 200 attempts rather than single-attempt metrics alone. Find out whether they detect deception through output analysis or internal state monitoring. Know who challenges red team conclusions before deployment and what specific failure modes they've documented. Get the evaluation awareness rate. Vendors claiming complete safety haven't stress-tested adequately.The bottom lineDiverse red-team methodologies demonstrate that every frontier model breaks under sustained attack. The 153-page system card versus the 55-page system card isn't just about documentation length. It's a signal of what each vendor chose to measure, stress-test, and disclose.For persistent adversaries, Anthropic's degradation curves show exactly where resistance fails. For fast-moving threats requiring rapid patches, OpenAI's iterative improvement data matters more. For agentic deployments with browsing, code execution and autonomous action, the scheming metrics become your primary risk indicator.Security leaders need to stop asking which model is safer. Start asking which evaluation methodology matches the threats your deployment will actually face. The system cards are public. The data is there. Use it.
Presented by Oracle NetSuiteWhen Evan Goldberg started NetSuite in 1998, his vision was radically simple: give entrepreneurs access to their business data anytime, anywhere. At the time, most enterprise software lived on local servers. As an entrepreneur himself, Goldberg understood the frustration intimately. "I had fragmented systems. They all said something different," he recalls of his early days. NetSuite was the first company to deliver enterprise applications entirely through web browsers, combining CRM, ERP, and ecommerce into one unified platform. That breakthrough idea pioneered the cloud computing and software-as-a-service (SaaS) era and propelled supersonic growth, a 2007 IPO, and an acquisition by Oracle in 2016. Still innovating at the leading-edge That founding obsession — turning scattered data into accessible, coherent, actionable intelligence — is driving NetSuite as it reshapes the next generation of enterprise software.At SuiteWorld 2025 last month, the Austin-based firm unveiled NetSuite Next. Goldberg calls it "the biggest product evolution in the company's history.” The reason? While NetSuite has embedded AI capabilities into workflows for years, he explains, Next represents a quantum leap — contextual, conversational, agentic, composable AI becoming an extension of operations, not separate tools.AI woven into everyday business operations Most enterprise AI today gets bolted on through APIs and conversational interfaces. NetSuite Next operates differently. Intelligence runs deep in workflows instead of sitting on the surface. It autonomously reconciles accounts, optimizes payment timing, predicts cash crunches, and surfaces its reasoning at every step. It doesn't just advise on business processes — it executes them, transparently, within human-defined guardrails."We built NetSuite for entrepreneurs so that they could get great information about their business," Goldberg explains. "I think the next step is to be able to get deeper insights and analysis without being an expert in analytics. AI turns out to be a really good data scientist."This architectural divergence reflects competing philosophies about enterprise technology adoption. Microsoft and SAP have pursued rapid deployment through add-on assistants. NetSuite's five-year development cycle for Next represents a more fundamental reimagining: making AI an everyday tool woven into business operations, not a separate application requiring constant context-switching.AI echoes and deepens cloud innovation Goldberg sees a clear through line connecting today's AI adoption and the cloud computing era he pioneered. "There’s sort of an infinite sense of possibility that exists in the technology world,” he says. “Everybody is thinking about how they can leverage this, how they're going to get involved."When NetSuite was starting, he continues, "We had to come to customers with the cloud and say, 'This won't disrupt your operations. It's going to make them better.'" Today, evangelizing enterprise leaders on advanced AI requires a similar approach — demonstrating immediate value while minimizing implementation risk. For NetSuite, continuous innovation around maximizing customer data for growth is an undeniable theme that connects both eras.New transformative capabilities NetSuite’s latest AI capabilities span business operations, while blurring (in a good way) the lines between human and machine intervention:Context-aware intelligence. Ask Oracle adapts responses based on user role, current workflow, and business context. A CFO requesting point-of-sale data receives financial analytics. A warehouse manager asking the same question sees inventory insights.Collaborative workflow design. AI Canvas functions as a scenario-planning workspace where business users articulate processes in natural language. A finance director can describe approval hierarchies for capital expenditures —"For amounts over $50,000, I need department head approval, then CFO sign-off" — which the system translates into executable workflows with appropriate controls and audit trails.Governed autonomous operations. Autonomous workflows operate within defined parameters, reconciling accounts, generating payment runs, predicting cash flow. When the system recommends accelerating payment to a supplier, it shows which factors influenced the decision — transparent logic users can accept, modify, or override.Open AI architecture. Built to support Model Context Protocol, NetSuite AI Connector Service enables enterprises to integrate external large language models while supporting governance.Critically, NetSuite adds AI capabilities at no additional cost — embedded directly into workflows employees already use daily.Security and privacy from Oracle infrastructure Built-in AI requires robust infrastructure that bolt-on approaches sidestep. Here, according to NetSuite, tight integration within Oracle technology provides operational and competitive advantages, especially security and compliance peace of mind. Engineers say that’s because NetSuite is supported by Oracle's complete stack. From database to applications to analytics, the system optimizes decisions using data from multiple sources in real time."That's why I started NetSuite. I couldn't get the data I wanted," Goldberg reflects. "That's one of the most differentiated aspects of NetSuite. When you're doing your financial close, and you're thinking about what reserves you're going to take, you can look at your sales data, because that's also there in NetSuite. With NetSuite Next, AI can also help you make those kinds of decisions."And performance improves with use. As the platform learns from millions of transactions across thousands of customers, its embedded intelligence improves in ways that bolt-on assistants operating adjacent to core systems cannot match.NetSuite's customer base demonstrates this scalability advantage — from startups that became global enterprises including Reddit, Shopify, and DoorDash; as well as promising newcomers like BERO, a brewer of non-alcoholic beer founded by actor Tom Holland, Chomps meat snacks, PetLab, and Kieser Australia. The unified platform grows with businesses rather than requiring migration as they scale.Keeping fire in the belly after three decadesHow does a nearly 30-year-old company maintain innovative capacity, particularly as part of a mammoth corporate ecosystem? Goldberg credits the parent company's culture of continuous reinvention."I don't know if you've heard about this guy Larry Ellison," he smiles. "He manages to seemingly reinvent himself whenever one of these technology revolutions comes along. That hunger, that curiosity, that desire to make things constantly better imbues all of Oracle."For Goldberg, the single biggest challenge facing NetSuite customers centers on integration complexity and trust. NetSuite Next addresses this by embedding AI within existing workflows rather than requiring separate systems.In addition, updates to SuiteCloud Platform — an extensibility and customization environment — help organizations adapt NetSuite to their unique business needs. Built on open standards, it lets enterprises mix and match AI models for different functions. SuiteAgent frameworks enable partners to build specialized automation directly into NetSuite. AI Studios give administrators control over how AI operates within specific industry needs."This takes NetSuite's flexibility to a new level," Goldberg says, enabling customers and partners to "quickly and easily build AI agents, connect external AI assistants, and orchestrate AI processes."“AI execution fabric” delivers measurable business impact Industry analysts increasingly argue that embedded AI features deliver superior results compared to add-on models. Futurum Group sees NetSuite Next as an "AI execution fabric" rather than a conversational layer — intelligence that runs deep in workflows instead of sitting on the surface.For midmarket enterprises navigating talent shortages, complex compliance frameworks, and competition from digital-native companies, the distinction between advice and execution matters economically.Built-in AI doesn't just inform better decisions. It makes those decisions, transparently and autonomously, within human-defined guardrails. For enterprises making ERP decisions today, the choice carries long-term implications. Bolt-on AI can deliver immediate value for information access and basic automation. But built-in AI promises to transform operations with intelligence permeating every transaction and workflow.NetSuite Next begins rolling out to North American customers next year.Why 2026 will belong to the AI-first businessThe bet underlying NetSuite Next: Enterprises reimagining ERP operations around embedded intelligence will outperform those just adding bolt-on conversational assistance to existing systems. Early cloud computing adopters, Goldberg notes, gained competitive advantages that compounded over time. The same logic appears likely to apply to AI-first platforms. Simplicity and ease of use are two big advantages. "You don't have to dig through lots of menus and understand all of the analytics capabilities," Goldberg says. "It will quickly bring up an analysis for you, and then you can converse in natural language to hone in on what you think is most important."The tools now think alongside users and take intelligently informed action. For midmarket and entrepreneurial companies, where the gap between having information and acting on it can be the difference between growth and failure, that kind of autonomous execution may determine which enterprises thrive in an AI-first era.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Researchers at Nvidia and the University of Hong Kong have released Orchestrator, an 8-billion-parameter model that coordinates different tools and large language models (LLMs) to solve complex problems. In their experiments, Orchestrator achieved higher accuracy at a lower cost than much larger models in tool-use benchmarks, while also aligning with user preferences on which tools to use for a given query.The model was trained through ToolOrchestra, a new reinforcement learning (RL) framework for training small models to act as intelligent coordinators. The approach is based on the idea that a small "orchestrator" managing a diverse team of specialized models and tools can be more effective and efficient than a single, monolithic AI system. The findings suggest that this composite approach could pave the way for more practical and scalable AI reasoning systems in the enterprise.The limits of current LLM tool useGiving LLMs access to external tools is a promising way to extend their capabilities beyond their training data and into agentic tasks. By calling on resources like search engines and code interpreters, AI agents can improve their accuracy and perform in-app tasks.However, in the accompanying paper, the researchers argue that the current approach to building tool-using agents doesn't harness the full potential of this paradigm. Most systems equip a single, powerful model with a set of basic tools like a web search or a calculator. They argue that humans, when reasoning, “routinely extend themselves by calling upon resources of greater-than-human intelligence, from domain experts to sophisticated processes and software systems.” Accordingly, LLMs should be able to interact with a wide range of tools in different capacities.The tool orchestration paradigmThe paper proposes a shift from a single-model system to a composite one, managed by a lightweight "orchestrator" model. The orchestrator's job is to analyze a complex task and break it down, invoking the right tools in the right order to arrive at a solution.This toolset includes not only standard utilities like web search and code interpreters, but other LLMs of various capabilities that function as "intelligent tools." For example, the orchestrator can delegate a quantitative question to a math-focused model or a programming challenge to a code-generation model. Instead of placing the entire cognitive load on one large, generalist model, the orchestrator delegates narrowed-down sub-problems to specialized intelligent tools.Based on this concept, the researchers developed ToolOrchestra, a method that uses RL to train a small language model to act as an orchestrator. The model learns when and how to call upon other models and tools, and how to combine their outputs in multi-turn reasoning. The tools are defined in a simple JSON format, specifying their name, description and parameters.The RL training process is guided by a reward system that produces a cost-effective and controllable agent. The reward balances three objectives: The correctness of the final answer, efficiency in cost and latency and alignment with user preferences. For example, the system is penalized for excessive compute usage, and is rewarded for choosing tools that a user has marked as preferred, such as favoring an open-source model over a proprietary API for privacy reasons. To support this training, the team also developed an automatic data pipeline that generated thousands of verifiable training examples across 10 different domains.A small model with big resultsUsing ToolOrchestra, the researchers trained Orchestrator, an 8-billion-parameter model based on Qwen3-8B. They evaluated its performance on three challenging benchmarks: Humanity’s Last Exam (HLE), FRAMES and Tau2-Bench. It was compared against several baselines, including large, off-the-shelf LLMs both with and without tools.The results showed that even powerful models struggled without tools, confirming their necessity for complex reasoning. While adding tools improved performance for large models, it often came with a steep increase in cost and latency. By contrast, the 8B Orchestrator delivered impressive results. On HLE, a benchmark of PhD-level questions, Orchestrator substantially outperformed prior methods at a fraction of the computational cost. On the Tau2-Bench function-calling test, it effectively scheduled different tools, calling a large model like GPT-5 in only about 40% of the steps and using cheaper options for the rest, while still beating an agent that used the large model for every step.The researchers noted that the RL-trained Orchestrator adapted its strategy to new challenges, showing a "high degree of general reasoning ability." Crucially for enterprise applications, Orchestrator also generalized well to models and pricing structures it hadn't seen during training. This flexibility makes the framework suitable for businesses that rely on a mix of public, private and bespoke AI models and tools. The lower cost, higher speed and customizability make it a practical approach for building sophisticated AI agents that can scale.As businesses look to deploy more advanced AI agents, this orchestration approach offers a path toward systems that are not only more intelligent but more economical and controllable. (The model weights are currently available under a non-commercial license, but Nvidia has also released the training code under the permissive Apache 2.0 license.)As the paper concludes, the future may lie in even more advanced versions of this concept: “Looking ahead, we envision more sophisticated recursive orchestrator systems to push the upper bound of intelligence [and] also to further enhance efficiency in solving increasingly complex agentic tasks.”
No matter how hard employers work to find qualified employees, they’re likely to come up short if those making up the candidate pool lack confidence in their abilities. And recent research shows that many young people aren’t secure in their ability to navigate the job market. What can help? Recent data shows mentorship can be The post Stressed Over How to Develop Young Employees? Mentorship Called the Key appeared first on HR Daily Advisor.
Political speech in the workplace has continued to be a hot topic since the 2024 elections. Political speech can be defined as an expression of political views, affiliation, or activities within a professional setting. One recent poll showed that a quarter of businesses have disciplined employees for their political speech. A recent case from the The post Context Matters: Balancing Business Interests and Political Speech appeared first on HR Daily Advisor.
Major natural disasters often devastate businesses or send them into a chaotic scramble mode. The results can be job losses or altered roles for employees, disruption for customers, and lengthy closures for repairs. Such difficult circumstances raise the stakes considerably for how affected brands, their leaders, and human resources teams respond to the crisis. The The post Building a Crisis-Ready Culture That Protects Your Employees and Brand appeared first on HR Daily Advisor.
Let’s be real: bad managers don’t just frustrate employees; they drive your best talent out the door. The difference between a manager that people tolerate and a leader that people love comes down to skills HR must develop right now. Want to turn your managers into magnetic leaders who inspire loyalty, performance, and trust? All The post From Managers to Magnets: How to Make Inspiring Leaders appeared first on HR Daily Advisor.
2025 has been a year of shakeups and constant changes, especially for recruiters. The role of associates to executives continue to shift and expand, and the advent of AI has really challenged HR professionals to rethink how and why they hire. Luckily, HR Daily Advisor has been there every step of the way, keeping you The post Best of Recruiting 2025 appeared first on HR Daily Advisor.
This year, the conversation around Diversity, Equity, Inclusion, and Belonging (DEIB) hit unprecedented turbulence. We saw pushback—not just from some organizations, but across various industries, even the federal government. Despite these setbacks, fostering a resilient, inclusive environment is more critical than ever. The businesses that embrace these values—and understand them as a strategic driver, not The post Best of DEI 2025 appeared first on HR Daily Advisor.
Cell phones are a quintessential tool in modern society, including within the realm of employment. Many employers use various data networks that allow employees to access and store the employer’s data on their own personal cell phones or other personal devices under “bring your own device” (BYOD) practices. Allowing employees access to employer data from The post BYOD to Court? Mitigate Risks of Your ‘Bring Your Own Device’ Practice appeared first on HR Daily Advisor.
According to some of the latest statistics, employers are swamped by job applications. In the U.K. alone, employers running graduate training schemes received an average of 140 applications for each job in 2024, 59% more than in 2023, according to the Institute of Student Employers. And despite some trepidations amongst a few recruiters, plenty of
AI is spamming up job applications. Jason Koebler from 404 Media writes about a person who claimed to have used a free tool, AI Hawk, to apply for 17 jobs in an hour — only stopping when they’d reached 2,843. The tool automatically entered the person’s bio, generated résumés, wrote customized cover letters, and checked
The company says it's refocusing and prioritizing fewer initiatives that will have the biggest impact on customers and add value to the business.
Get insights, updates, and real-world use cases delivered to your inbox. No spam — just smart AI news that matters.