Alea
Back to Podcast Digest
The Pragmatic Engineer··37:39

Uber: Leading engineering through an agentic shift - The Pragmatic Summit

TL;DR

  • Seventy percent of workloads developers pushed into Uber's agentic system were toil tasks (library upgrades, migrations, dead code cleanup), because the well-defined start and end states of these tasks produced much higher accuracy, creating a virtuous adoption cycle.

  • Uber built Minion, an internal background agent platform running on its own CI infrastructure, rather than relying solely on external vendors' cloud-hosted agents -- giving them faster iteration, full access to internal services via MCP, and control over security and configuration.

  • Code review is becoming the new bottleneck as AI-generated code volume surges. Uber addresses this with Code Inbox (smart assignment, risk profiling, notification batching) and U-Review (multi-layer review grading that filters low-value bot comments and deduplicates across systems).

  • AutoCover, a specialized test-generation agent built on LangChain, produces roughly 5,000 merged unit tests per month at nearly 3x the quality of generic agent output, with a separated critic engine that also validates human-written tests.

  • The most effective adoption strategy was not top-down mandates but peer-driven promotion -- engineers sharing wins with other engineers -- because developers trust other developers more than leadership directives.

  • Measuring real business impact remains unsolved: activity metrics like diff velocity and developer NPS are at record highs, but connecting those to revenue is the current frontier, and AI costs have grown at least 6x since 2024, requiring smarter model routing and infrastructure-level cost optimization.

The Breakdown

Anshu, who leads Uber's developer platform organization, and Tai, a principal engineer who has been a driving force behind Uber's AI strategy, present a detailed account of how Uber is navigating its agentic shift. Anshu opens by noting that while AI is not new to Uber -- the fares platform and matching algorithms have used AI methodologies for years -- the integration of AI into the engineering productivity lifecycle is a recent development. CEO Dara Khosrowshahi has named AI one of Uber's six strategic shifts, framing the goal as transforming Uber from an early-AI-powered company into a generative-AI-powered one. The guiding philosophy is not to replace engineers but to make them "superhumans" by offloading toil so they can focus on creative, business-growing work.

Anshu traces the evolution from pair programming to peer programming. In the 2022-2023 era of GitHub Copilot, Uber saw a roughly 10-15% bump in diff velocity from synchronous tab completion and IDE chat. The real inflection came when models became capable enough to run asynchronously on delegated tasks, enabling developers to act as their own tech leads directing multiple AI agents. The capabilities curve is illustrated by a logarithmic diagram showing agent execution times growing from under one second to hours-long autonomous runs. When Uber made its agentic workflows available to developers, 70% of workloads pushed into the system were toil tasks -- library upgrades, dead code cleanup, documentation, migrations -- because accuracy on well-defined start-and-end-state tasks was significantly higher, creating a virtuous adoption cycle.

Tai then walks through the infrastructure stack. At the foundation sits Uber's Michelangelo ML platform, which provides a model gateway for proxying frontier models from OpenAI and Anthropic, along with traditional inference and training capabilities. On top of this, Uber has invested heavily in MCP (Model Context Protocol) deployment. A tiger team from across the company designed a central MCP gateway that proxies both external and internal MCPs, handling authorization, telemetry, and logging, with a registry and sandbox for discovery and experimentation. The Michelangelo platform also offers agent-building capabilities through both SDKs and no-code solutions, with telemetry, tracing, and a registry so agents can be discovered and reused across the organization.

To manage the proliferation of agent clients -- Claude Code, Codex, Cursor, and others -- Uber built a CLI tool called AIFX that provisions, configures, and updates agent clients, installs MCPs from the registry, deploys standard configuration management, and connects to background task infrastructure. Tai then introduces Minion, Uber's formal background agent platform. Minion runs on Uber's CI infrastructure with monorepos pre-checked-out, handles network access to internal services, connects to MCP servers through AIFX, and is accessible through a web interface, Slack, GitHub PRs, a CLI, and APIs. A live demo shows a real bug report being pasted into Minion as a prompt, with a built-in prompt improver flagging low-quality prompts. Seven minutes later, the system produces a completed PR co-authored by the minion bot and the developer who kicked it off.

As more code is generated by agents, developers are spending more time on planning and code review. Uber addresses this with two products. Code Inbox is a unified inbox for PRs that filters noise, shows only actionable items, and uses smart assignment algorithms factoring in code ownership, compliance requirements, time zones, calendar availability, and focus time. It batches Slack notifications, handles automatic reassignment and escalation with strict SLOs, and analyzes the risk profile of each change based on blast radius, service criticality, and surface area. U-Review is a code review assistance platform built in-house because Uber's internal context -- including its ongoing migration from Fabricator to GitHub -- demanded a controlled surface area. U-Review uses a pre-processor with a plugin system for defect detection, best practices, and MCP-sourced context, then passes results through a review grader that filters low-value comments, deduplicates across systems, and categorizes findings. Each layer runs models selected for their performance on that specific task. Over the course of a year, the system achieved higher-quality comments at a higher rate while maintaining a strong ratio of comments actually addressed by developers.

For test generation, Uber built AutoCover, a custom agent on top of an internal LangX SDK (built on LangChain) that generates roughly 5,000 unit tests merged per month across the company at nearly 3x the quality of generic agent-generated tests. AutoCover includes a critic engine that evaluates test quality, and this validator has been separated into an independent tool developers can use for any test, human-written or AI-generated. For large-scale code maintenance, Uber created AutoMigrate, a program with four pillars: problem identification (risk assessment, PR decomposition), code transformation (agents or deterministic tools like OpenRewrite), validation (CI, unit tests, staging/production signal), and campaign management via a platform called Shephard. Shephard tracks migration PRs through a web UI, generates and refreshes PRs on defined cadences, notifies reviewers, and integrates with Code Inbox. Examples shown include using OpenRewrite to migrate Java services to Java 21 and using Minion to generate performance-fix PRs identified by internal analysis tools.

Anshu closes with three non-technical challenges. First, the technology landscape changes constantly, requiring willingness to abandon internal investments -- he notes that a forthcoming Cursor test-coverage feature might make AutoCover obsolete, and that is acceptable as long as Uber delivers impact. Second, adoption has been slower than expected despite the technology working well, because developers are being asked to work in fundamentally new ways. Top-down mandates moved metrics somewhat, but the most effective tactic has been peer-driven wins -- engineers sharing successful examples with other engineers. Third, measurement remains hard: developer experience scores and NPS are at all-time highs, and the gap between power users (20+ active days per week on agent tools) and casual users has widened dramatically since the agentic system launched, but these are activity metrics, not business outcomes. The CFO wants revenue impact, so Uber is instrumenting the full feature pipeline from design to production experiment launch. Finally, costs have grown at least 6x since 2024, driven by GPU and memory expenses. Uber is responding by routing planning tasks to more capable models and execution tasks to cheaper ones, and by letting infrastructure make model selection decisions to reduce developer friction while optimizing spend.