CORE: Computational Reproducibility Agent Benchmark

Benchmarking agents for reproducing results of published scientific papers

Preprint Blog Authors Code

Our contributions

Benchmark. We introduce CORE, a benchmark that assesses agents on their ability to reproduce the results of published scientific papers. Agents are given the codebase of a paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper.
Harness. We provide a harness that allows agent developers to easily evaluate their agents on CORE. The harness creates virtual machines for each agent-task pair, runs the agent on the machine, and downloads the results. The harness allows for parralelizable evaluations in isolated environments with standardized hardware.
Agents. We build and evaluate generalist and specialized agents based on AutoGPT. Our findings show that specialized agents far outperform the generalist agents despite requiring relatively minimal development effort. Our best agent achieves just 22% accuracy on the hardest tasks on CORE, showing much room for improvement.

How Can Agents Help Computational Reproducibility?


Computational reproducibility, the ability to reproduce the results of a scientific study using the data and code provided by its authors, is fundamental to scientific research. Yet, there are severe shortcomings in the state of computational reproducibility across multiple fields including computer science, social sciences, mathematical sciences, physics, and more. We propose that agents can be used to automate the process of reproducing scientific papers and to help improve the state of computational reproducibility. We introduce a benchmark, CORE, that evaluates agents on their ability to reproduce the results of published scientific papers, with the goal of improving scientific norms and spurring the development of agents that can assist with scientific research.


Figure showing a visual overview of our benchmark.

Tasks in the benchmark require an agent to reproduce the results of a research paper given its repository and answer task questions about the output of the code. The agent must install libraries, packages, and dependencies and run the code. If the code runs successfully, the agent searches through all outputs to answer the task questions. The agent submits a report and is evaluated against the results of a successful reproduction. An agent successfully completes a task if it answers all questions about the code repository successfully.

The benchmark consists of 273 tasks based on 91 papers from the platform codeocean.com. The tasks are divided into three difficulty levels: CORE-Retrieve, CORE-Easy, and CORE-Hard:

  • CORE-Retrieve: The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
  • CORE-Easy: The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
  • CORE-Hard: The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.

We evaluated two agents on CORE: AutoGPT and the CORE-Agent, which was built by modifying AutoGPT for the benchmark:

  • AutoGPT: A generalist agent designed to perform well in many domains. We added a tool to AutoGPT that allowed the agent to query a vision language model to analyze images. We included this tool in the generalist agent because we did not consider analyzing images to be a task-specific modifacation, since it could be beneficial in many domains.
  • CORE-Agent: An agent specialized for the CORE benchmark. The agent programatically checks that its answers have been submitted before returning and contains specific prompting hints for each difficulty level to address common pitfalls observed in the generalist agent. These adaptations required only a few days of work, and yet they dramatically improved agent performance.

Accuracy (pass@1) of CORE-Agent and AutoGPT agents with gpt-4o-2024-05-13 and gpt-4o-mini-2024-07-18 by task difficulty on the test set.
Agent Architecture LLM Model Retrieve Accuracy Easy Accuracy Hard Accuracy Overall Accuracy
CORE-Agent gpt-4o 57.78% 57.78% 22.22% 45.93%
gpt-4o-mini 44.44% 26.67% 15.56% 28.89%
AutoGPT gpt-4o 35.6% 37.8% 6.7% 26.7%
gpt-4o-mini 8.9% 2.2% 2.2% 4.43%

Pareto All

Pareto frontier plotting cost vs accuracy of Retrieval tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.

Pareto Hard

Pareto frontier plotting cost vs accuracy of Hard tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.


As shown in the above table, CORE-Agent with gpt-4o is the best performing agent, scoring 57.78% on the easiest task but only 22.2% on the hardest task, showing much room for improvement. Agents powered by gpt-4o-mini, while less accurate, are much cheaper to run. The Pareto frontiers for the retrieval and hard tasks are shown above.

We present CORE, a benchmark that assesses agents on their ability to reproduce the results of published scientific papers. AI agents reproducing research effectively could drastically reduce the human labor required to read, understand, and run code to assess computational reproducibility. Our baseline results show that simple task-specific modifacations to existing general-purpose agents can help increase accuracy, and yet our best agent only has a test-set accuracy of 22.2% on the hardest tasks, showing much room for improvement. We hope CORE will stimulate the development of agents that can reduce the time and effort required for burdensome yet routine scientific activities.

We thank...

* corresponding author

Name Affiliation Email
Zachary S. Siegel* Princeton University zss@princeton.edu
Sayash Kapoor Princeton University sayashk@princeton.edu
Nitya Nadgir Princeton University nn7887@princeton.edu
Benedikt Stroebl Princeton University stroebl@princeton.edu
Arvind Narayanan Princeton University arvindn@cs.princeton.edu
BibTeX