Computational reproducibility, the ability to reproduce the results of a scientific study using the data and code provided by its authors, is fundamental to scientific research. Yet, there are severe shortcomings in the state of computational reproducibility across multiple fields including computer science, social sciences, mathematical sciences, physics, and more. We propose that agents can be used to automate the process of reproducing scientific papers and to help improve the state of computational reproducibility. We introduce a benchmark, CORE, that evaluates agents on their ability to reproduce the results of published scientific papers, with the goal of improving scientific norms and spurring the development of agents that can assist with scientific research.
Tasks in the benchmark require an agent to reproduce the results of a research paper given its repository and answer task questions about the output of the code. The agent must install libraries, packages, and dependencies and run the code. If the code runs successfully, the agent searches through all outputs to answer the task questions. The agent submits a report and is evaluated against the results of a successful reproduction. An agent successfully completes a task if it answers all questions about the code repository successfully.
The benchmark consists of 273 tasks based on 91 papers from the platform codeocean.com. The tasks are divided into three difficulty levels: CORE-Retrieve, CORE-Easy, and CORE-Hard:
We evaluated two agents on CORE: AutoGPT and the CORE-Agent, which was built by modifying AutoGPT for the benchmark:
Agent Architecture | LLM Model | Retrieve Accuracy | Easy Accuracy | Hard Accuracy | Overall Accuracy |
---|---|---|---|---|---|
CORE-Agent | gpt-4o | 57.78% | 57.78% | 22.22% | 45.93% |
gpt-4o-mini | 44.44% | 26.67% | 15.56% | 28.89% | |
AutoGPT | gpt-4o | 35.6% | 37.8% | 6.7% | 26.7% |
gpt-4o-mini | 8.9% | 2.2% | 2.2% | 4.43% |
Pareto frontier plotting cost vs accuracy of Retrieval tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.
Pareto frontier plotting cost vs accuracy of Hard tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.
As shown in the above table, CORE-Agent with gpt-4o is the best performing agent, scoring 57.78% on the easiest task but only 22.2% on the hardest task, showing much room for improvement. Agents powered by gpt-4o-mini, while less accurate, are much cheaper to run. The Pareto frontiers for the retrieval and hard tasks are shown above.
We present CORE, a benchmark that assesses agents on their ability to reproduce the results of published scientific papers. AI agents reproducing research effectively could drastically reduce the human labor required to read, understand, and run code to assess computational reproducibility. Our baseline results show that simple task-specific modifacations to existing general-purpose agents can help increase accuracy, and yet our best agent only has a test-set accuracy of 22.2% on the hardest tasks, showing much room for improvement. We hope CORE will stimulate the development of agents that can reduce the time and effort required for burdensome yet routine scientific activities.
We thank...