10 Key Insights from Automating AI Evaluation Analysis with GitHub Copilot

As an AI researcher at GitHub, I recently embarked on a journey that fundamentally changed how I work—and, in the process, automated my own intellectual toil. By leveraging GitHub Copilot and building custom coding agents, I streamlined the analysis of agent trajectories, a task that once required poring over hundreds of thousands of lines of JSON data. What started as a personal productivity hack evolved into a team-wide tool, enabling my peers on the Copilot Applied Science team to do the same. Here are ten critical lessons I learned along the way.

1. The Challenge of Analyzing Agent Trajectories

Evaluating coding agents involves dissecting trajectories—detailed logs of an agent's thoughts and actions while solving tasks. Each task in benchmarks like TerminalBench2 or SWEBench-Pro produces a massive JSON file. Multiply that by dozens of tasks and multiple runs daily, and you’re facing hundreds of thousands of lines of code. Manual inspection is impossible, yet identifying patterns is crucial for improving agent performance. This bottleneck sparked the need for an automated solution.

10 Key Insights from Automating AI Evaluation Analysis with GitHub Copilot — Source: github.blog

2. The Repetitive Loop with GitHub Copilot

Initially, I used GitHub Copilot to surface patterns within these trajectories. I’d ask it to summarize or highlight anomalies, then manually investigate. This reduced my reading load from hundreds of thousands to a few hundred lines, but the process remained repetitive. I kept repeating the same query-and-investigate loop across multiple benchmarks. The engineer in me saw an opportunity: automate this intellectual grunt work so I could focus on higher-level analysis.

3. The Spark of Automation

Driven by frustration and inspiration, I decided to build a system that could automate the entire pattern-discovery workflow. Thus, eval-agents was born—a set of coding agents designed to analyze trajectories autonomously. These agents use GitHub Copilot under the hood to parse JSON, identify trends, and even suggest hypotheses. My goal was to eliminate the repetitive cycle and free up time for creative problem-solving. Learn how I designed the system for sharing.

4. Designing for Sharing and Collaboration

A key design principle was making agents easy to share. Agents are packaged as reusable modules with clear interfaces, allowing teammates to pull them into their own workflows. I drew on my experience as an open-source maintainer of the GitHub CLI to emphasize documentation and a low barrier to entry. If others can’t quickly adopt and adapt the tool, it’s a failure—the system must serve the whole team, not just its creator.

5. Enabling Easy Authoring of New Agents

Not every analysis need is foreseeable. To future-proof the system, I created templates and scaffolding that make authoring new agents straightforward. A scientist can describe their desired analysis in natural language, and the framework generates the agent skeleton. This lowered the skill threshold, empowering colleagues without deep engineering backgrounds to craft custom agents. See how this accelerated development loops.

6. Making Agents the Primary Contribution Mode

The third design goal was to foster a culture where coding agents become the primary way to contribute analyses. Instead of writing one-off scripts or sharing findings in meetings, team members now contribute new agents to a shared library. These agents get vetted, improved, and reused. This transforms ad-hoc insights into durable assets, promoting collaboration and reducing redundant work.

7. Leveraging Prior Experience as an OSS Maintainer

My background maintaining the GitHub CLI taught me the importance of clear conventions, version control, and community involvement. I applied these lessons to eval-agents: each agent has a manifest file with metadata, tests come bundled, and a contribution guide exists. Open-source principles—like inclusive code reviews and changelogs—ensure the tool evolves reliably and welcomes contributions from anyone on the team.

8. Accelerating the Development Loop

By automating pattern detection and hypothesis generation, my personal development loop became lightning fast. I now spend minutes on analyses that previously took hours. For example, an agent can categorize trajectory bottlenecks and produce a summary with visual highlights. This speed allows me to iterate on agent improvements multiple times a day, compounding progress across the benchmarks.

9. Empowering the Entire Team

Once the tool was ready, I introduced it to the Copilot Applied Science team. Adoption was swift because the agents addressed a pain point everyone shared. Now, multiple researchers independently run agents, share results, and even contribute enhancements. The tool became a force multiplier, enabling the whole team to conduct sophisticated trajectory analyses without manual drudgery. What lessons did we learn along the way?

10. Lessons Learned and Future Directions

Building eval-agents taught me that automating intellectual work is not a one-time effort—it requires ongoing maintenance and iteration. I now find myself in a new role: shepherding this tool so peers can automate their own tasks. Key lessons include emphasizing modular design, investing in tests, and fostering a contributor culture. Looking ahead, we plan to extend agents to auto-generate reports and even propose experiments, further pushing the boundaries of AI-assisted research.

Conclusion

This journey has transformed my job from an individual analyzer to a facilitator of automated analysis. By embracing GitHub Copilot and building reusable agents, my team and I have overcome the bottleneck of trajectory evaluation. The same pattern—identifying repetitive intellectual tasks and automating them—can benefit any software engineering or research team. The future of development is agent-driven, and we’re just scratching the surface.

Tags: