CORE: Computational Reproducibility Agent Benchmark

Benchmark Overview

Figure showing a visual overview of our benchmark. — Tasks in the benchmark require an agent to reproduce the results of a research paper given its repository and answer task questions about the output of the code. The agent must install libraries, packages, and dependencies and run the code. If the code runs successfully, the agent searches through all outputs to answer the task questions. The agent submits a report and is evaluated against the results of a successful reproduction. An agent successfully completes a task if it answers all questions about the code repository successfully.

The benchmark consists of 273 tasks based on 91 papers from the platform codeocean.com. The tasks are divided into three difficulty levels: CORE-Retrieve, CORE-Easy, and CORE-Hard:

CORE-Retrieve: The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
CORE-Easy: The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
CORE-Hard: The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.

Results

We evaluated two agents on CORE: AutoGPT and the CORE-Agent, which was built by modifying AutoGPT for the benchmark:

AutoGPT: A generalist agent designed to perform well in many domains. We added a tool to AutoGPT that allowed the agent to query a vision language model to analyze images. We included this tool in the generalist agent because we did not consider analyzing images to be a task-specific modifacation, since it could be beneficial in many domains.
CORE-Agent: An agent specialized for the CORE benchmark. The agent programatically checks that its answers have been submitted before returning and contains specific prompting hints for each difficulty level to address common pitfalls observed in the generalist agent. These adaptations required only a few days of work, and yet they dramatically improved agent performance.

Accuracy (pass@1) of CORE-Agent and AutoGPT agents with gpt-4o-2024-05-13 and gpt-4o-mini-2024-07-18 by task difficulty on the test set.
Agent Architecture	LLM Model	Retrieve Accuracy	Easy Accuracy	Hard Accuracy	Overall Accuracy
CORE-Agent	gpt-4o	57.78%	57.78%	22.22%	45.93%
CORE-Agent	gpt-4o-mini	44.44%	26.67%	15.56%	28.89%
AutoGPT	gpt-4o	35.6%	37.8%	6.7%	26.7%
AutoGPT	gpt-4o-mini	8.9%	2.2%	2.2%	4.43%

Pareto All — *Pareto frontier plotting cost vs accuracy of Retrieval tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.*

Pareto Hard — *Pareto frontier plotting cost vs accuracy of Hard tasks on the test set. CORE-Agent with GPT-4o is the top performing agent.*

As shown in the above table, CORE-Agent with gpt-4o is the best performing agent, scoring 57.78% on the easiest task but only 22.2% on the hardest task, showing much room for improvement. Agents powered by gpt-4o-mini, while less accurate, are much cheaper to run. The Pareto frontiers for the retrieval and hard tasks are shown above.

Conclusion

We present CORE, a benchmark that assesses agents on their ability to reproduce the results of published scientific papers. AI agents reproducing research effectively could drastically reduce the human labor required to read, understand, and run code to assess computational reproducibility. Our baseline results show that simple task-specific modifacations to existing general-purpose agents can help increase accuracy, and yet our best agent only has a test-set accuracy of 22.2% on the hardest tasks, showing much room for improvement. We hope CORE will stimulate the development of agents that can reduce the time and effort required for burdensome yet routine scientific activities.

Acknowledgments

We thank...

Authors

* corresponding author

Name	Affiliation	Email
Zachary S. Siegel*	Princeton University	zss@princeton.edu
Sayash Kapoor	Princeton University	sayashk@princeton.edu
Nitya Nadgir	Princeton University	nn7887@princeton.edu
Benedikt Stroebl	Princeton University	stroebl@princeton.edu
Arvind Narayanan	Princeton University	arvindn@cs.princeton.edu

References

BibTeX