Dev
GitHub repos gaining traction - what high-signal users are starring and what's climbing the board, captured daily and enriched from GitHub. Raw material for spotting new tech and patterns worth building on.
653
repos tracked
153
surfaced this week
141
created < 30d
Python
top language
21 repos
-
[NeurIPS 2025] The first web-based benchmark and platform to evaluate visual reasoning and interaction capabilities of MLLM powered agents through diverse and dynamic CAPTCHA puzzles.
-
Rojo enables Roblox developers to use professional-grade software engineering tools
-
Evaluation harness for OpenHands V1.
-
the LLM vulnerability scanner
-
Framework for evaluating and improving agents
-
Evaluation tools shared across anserini, pyserini, and pygaggle
-
ABC: Scalable Behavior Cloning with Open Data, Training, and Evaluation
-
A public-domain dataset of prompts and scenarios for evaluating compliance with the OpenAI Model Spec.
-
Readymade evaluators for agent trajectories
-
Readymade evaluators for your LLM apps
-
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
-
Ridiculously fast symbolic expressions
-
[ICML '26] Code repo for the paper entitled "Convex Dataset Valuation for Post-Training" at ICML 2026.
-
VirtueBench V2: Multi-dimensional virtue evaluation benchmark for LLMs with tripartite and Ignatian temptation models
-
A benchmark for evaluating AI agents on realistic business workflows
-
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
-
Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
-
Rubric compiler and judge engine for LLM evaluation
-
Code for "Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly" (CVPR 2026)
-
Vim-fork focused on extensibility and usability
-
ocrscout is a toolkit for frontier OCR models that allows you to run, evaluate and profile OCR models on your own data and compute infrastructure