reran notebook

haesleinhuepf · haesleinhuepf · commit d41d8b48e437 · 2024-12-31T10:58:21.000+01:00
diff --git a/docs/66_arxiv_agent/simplifying_agentic_workflows.ipynb b/docs/66_arxiv_agent/simplifying_agentic_workflows.ipynb
@@ -71,7 +71,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "id": "48233b8e-269e-4fab-9b47-16be4a1dd664",
    "metadata": {},
    "outputs": [
@@ -80,7 +80,10 @@
      "output_type": "stream",
      "text": [
       "read_arxiv_paper(https://arxiv.org/abs/2211.11501)\n",
-      "read_arxiv_paper(https://arxiv.org/abs/2308.16458)\n"
+      "read_arxiv_paper(https://arxiv.org/abs/2308.16458)\n",
+      "read_arxiv_paper(https://arxiv.org/abs/2411.07781)\n",
+      "read_arxiv_paper(https://arxiv.org/abs/2408.13204)\n",
+      "read_arxiv_paper(https://arxiv.org/abs/2406.15877)\n"
      ]
     }
    ],
@@ -107,7 +110,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "id": "944bbd32-af7e-4045-9f19-9337ba1df415",
    "metadata": {},
    "outputs": [],
@@ -136,10 +139,42 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "id": "e530f509-f3da-472b-8970-f4fe56160a8b",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/markdown": [
+       "# Abstract\n",
+       "The field of code generation has witnessed significant advancements with the advent of large language models (LLMs). However, the development of reliable and comprehensive benchmarks to evaluate the capabilities of these models is crucial for further progress. This review manuscript discusses recent developments in code generation benchmarks, highlighting their key features, evaluation methodologies, and findings. We summarize the main contributions of five notable benchmarks: DS-1000, BioCoder, RedCode, DOMAINEVAL, and BigCodeBench, and discuss their implications for the future of code generation research.\n",
+       "\n",
+       "# Introduction\n",
+       "Code generation has become an increasingly important area of research, with potential applications in software development, data analysis, and other fields. The development of large language models (LLMs) has driven significant progress in this area, enabling the generation of high-quality code for a variety of tasks. However, the evaluation of these models requires reliable and comprehensive benchmarks that can assess their capabilities and identify areas for improvement. In this review, we discuss recent developments in code generation benchmarks, focusing on their design, evaluation methodologies, and key findings.\n",
+       "\n",
+       "## Recent Developments in Code Generation Benchmarks\n",
+       "Several recent benchmarks have been proposed to evaluate the capabilities of LLMs in code generation. [DS-1000](https://ds1000-code-gen.github.io) is a benchmark that focuses on data science code generation, featuring a thousand problems spanning seven Python libraries. This benchmark incorporates multi-criteria metrics to evaluate the correctness and reliability of generated code, achieving a high level of accuracy. In contrast, [BioCoder](https://github.com/gersteinlab/biocoder) targets bioinformatics code generation, covering a wide range of topics and incorporating a fuzz-testing framework for evaluation. [RedCode](https://github.com/AI-secure/RedCode) is a benchmark that focuses on the safety of code agents, evaluating their ability to recognize and handle risky code. [DOMAINEVAL](https://domaineval.github.io) is a multi-domain code benchmark that assesses the capabilities of LLMs in various domains, including computation, system, and cryptography. Finally, [BigCodeBench](https://bigcodebench.github.io) is a benchmark that challenges LLMs to invoke multiple function calls from diverse libraries and domains.\n",
+       "\n",
+       "## Evaluation Methodologies\n",
+       "The evaluation methodologies employed by these benchmarks vary, but most involve a combination of automatic and manual evaluation. [DS-1000](https://ds1000-code-gen.github.io) uses multi-criteria metrics to evaluate the correctness and reliability of generated code, while [BioCoder](https://github.com/gersteinlab/biocoder) employs a fuzz-testing framework to assess the robustness of generated code. [RedCode](https://github.com/AI-secure/RedCode) uses a combination of automatic and manual evaluation to assess the safety of code agents, and [DOMAINEVAL](https://domaineval.github.io) relies on automatic evaluation to assess the capabilities of LLMs in various domains. [BigCodeBench](https://bigcodebench.github.io) uses a combination of automatic and manual evaluation to assess the ability of LLMs to invoke multiple function calls from diverse libraries and domains.\n",
+       "\n",
+       "## Results and Discussion\n",
+       "The results of these benchmarks highlight the strengths and weaknesses of current LLMs in code generation. [DS-1000](https://ds1000-code-gen.github.io) shows that the current best public system achieves 43.3% accuracy, leaving ample room for improvement. [BioCoder](https://github.com/gersteinlab/biocoder) demonstrates that successful models require domain-specific knowledge of bioinformatics and the ability to accommodate long prompts with full context. [RedCode](https://github.com/AI-secure/RedCode) highlights the need for stringent safety evaluations for diverse code agents, as current models tend to produce more sophisticated and effective harmful software. [DOMAINEVAL](https://domaineval.github.io) reveals significant performance gaps between LLMs in different domains, with some models falling short on cryptography and system coding tasks. Finally, [BigCodeBench](https://bigcodebench.github.io) shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores significantly lower than human performance.\n",
+       "\n",
+       "## Future Work\n",
+       "The development of reliable and comprehensive benchmarks is crucial for further progress in code generation research. Future work should focus on creating benchmarks that evaluate the capabilities of LLMs in a variety of domains and tasks, as well as assessing their safety and reliability. The use of multi-criteria metrics and fuzz-testing frameworks can help to ensure the correctness and robustness of generated code. Additionally, the development of benchmarks that challenge LLMs to invoke multiple function calls from diverse libraries and domains can help to assess their ability to follow complex instructions and use function calls precisely.\n",
+       "\n",
+       "# Conclusions\n",
+       "In conclusion, recent developments in code generation benchmarks have highlighted the strengths and weaknesses of current LLMs in code generation. The design and evaluation methodologies of these benchmarks have provided valuable insights into the capabilities and limitations of LLMs, and have identified areas for further research and improvement. As the field of code generation continues to evolve, the development of reliable and comprehensive benchmarks will remain crucial for assessing the capabilities and safety of LLMs, and for driving further progress in this area."
+      ],
+      "text/plain": [
+       "<IPython.core.display.Markdown object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "display(Markdown(result))"
    ]