CareerRiver

Research Scientist, Agentic Data & Benchmarking

Institute of Foundation Models · San Francisco Bay Area

📍 Sunnyvale, CA💰 $150,000-$450,000via leverPosted 2026-06-08
Apply on company site ↗
CareerRiver pulls this listing straight from the employer's hiring system — no recruiter middleman, no reposts. Applying takes you directly to Institute of Foundation Models.
About the Institute of Foundation Models   The Institute of Foundation Models (IFM) is a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.   As part of our team, you'll work at the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You'll help build groundbreaking AI systems with the potential to reshape entire industries, and contribute to establishing MBZUAI as a global hub for high-performance computing and deep learning.   About the role   The Agents team trains advanced agentic language models that use reasoning and tool use to complete real tasks on a computer. This is a specialist role at the center of the loop that drives those models:  the data we train on and the benchmarks we measure against.   You'll own the agentic data pipeline end-to-end — sourcing and generating high-quality trajectories, tool-use data, and RL environments — and the evaluation suite that tells us, rigorously and reproducibly, what our agents can actually do. These two halves are inseparable: benchmarks expose where models fail, and targeted data closes the gap. The agents are only as good as the data they learn from and the evals that keep us honest, and this role owns both.   This is a research scientist position for someone who wants depth in data and measurement rather than breadth across the whole stack. You should be the kind of person who reads through datasets line by line, distrusts a metric until it's been validated, and gets satisfaction from making an eval suite that nobody questions.

More San Francisco Bay Area jobs

San Francisco Bay Area jobs · Browse all locations