Computational reproducibility, the ability to reproduce the results of a scientific study using the data and code provided by its authors, is fundamental to scientific research. Yet, there are severe shortcomings in the state of computational reproducibility across multiple fields including computer science, social sciences, mathematical sciences, physics, and more. We propose that agents can be used to automate the process of reproducing scientific papers and to help improve the state of computational reproducibility. We introduce a benchmark, CORE, that evaluates agents on their ability to reproduce the results of published scientific papers, with the goal of improving scientific norms and spurring the development of agents that can assist with scientific research.
The benchmark consists of 273 tasks based on 91 papers from the platform codeocean.com. The tasks are divided into three difficulty levels: CORE-Retrieve, CORE-Easy, and CORE-Hard:
We evaluated two agents on CORE: AutoGPT and the CORE-Agent, which was built by modifying AutoGPT for the benchmark:
Agent Architecture | LLM Model | Retrieve Accuracy | Easy Accuracy | Hard Accuracy | Overall Accuracy |
---|---|---|---|---|---|
CORE-Agent | gpt-4o | 57.78% | 57.78% | 22.22% | 45.93% |
gpt-4o-mini | 44.44% | 26.67% | 15.56% | 28.89% | |
AutoGPT | gpt-4o | 35.6% | 37.8% | 6.7% | 26.7% |
gpt-4o-mini | 8.9% | 2.2% | 2.2% | 4.43% |
As shown in the above table, CORE-Agent with gpt-4o is the best performing agent, scoring 57.78% on the easiest task but only 22.2% on the hardest task, showing much room for improvement. Agents powered by gpt-4o-mini, while less accurate, are much cheaper to run. The Pareto frontiers for the retrieval and hard tasks are shown above.
We present CORE, a benchmark that assesses agents on their ability to reproduce the results of published scientific papers. AI agents reproducing research effectively could drastically reduce the human labor required to read, understand, and run code to assess computational reproducibility. Our baseline results show that simple task-specific modifacations to existing general-purpose agents can help increase accuracy, and yet our best agent only has a test-set accuracy of 22.2% on the hardest tasks, showing much room for improvement. We hope CORE will stimulate the development of agents that can reduce the time and effort required for burdensome yet routine scientific activities.
We thank...