Dev

GitHub repos gaining traction - what high-signal users are starring and what's climbing the board, captured daily and enriched from GitHub. Raw material for spotting new tech and patterns worth building on.

653

repos tracked

153

surfaced this week

141

created < 30d

Python

top language

Python 271 TypeScript 84 HTML 36 JavaScript 33 Swift 32 Rust 29 Go 22 Jupyter Notebook 21 C++ 19 Shell 8 C 6 Elixir 6

Section: Language: Created: Sort: i

21 repos

★ 78 MetaAgentX/OpenCaptchaWorld JavaScript

[NeurIPS 2025] The first web-based benchmark and platform to evaluate visual reasoning and interaction capabilities of MLLM powered agents through diverse and dynamic CAPTCHA puzzles.
★ 1.6k rojo-rbx/rojo Rust

Rojo enables Roblox developers to use professional-grade software engineering tools

lua roblox roblox-studio sync homepage ↗
★ 92 OpenHands/benchmarks Python

Evaluation harness for OpenHands V1.
★ 8.2k NVIDIA/garak Python

the LLM vulnerability scanner

ai llm-evaluation llm-security security-scanners vulnerability-assessment homepage ↗
★ 2.7k harbor-framework/harbor Python

Framework for evaluating and improving agents

evals rl-environments terminal-bench homepage ↗
★ 36 castorini/anserini-tools Python

Evaluation tools shared across anserini, pyserini, and pygaggle
★ 193 amazon-far/abc Python new · 14d old

ABC: Scalable Behavior Cloning with Open Data, Training, and Evaluation

bc diffusion-policy robotics vla homepage ↗
★ 10 openai/model_spec_dataset

A public-domain dataset of prompts and scenarios for evaluating compliance with the OpenAI Model Spec.
★ 629 langchain-ai/agentevals Python

Readymade evaluators for agent trajectories
★ 1.1k langchain-ai/openevals Python

Readymade evaluators for your LLM apps
★ 11.7k tensorzero/tensorzero Rust

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

ai ai-engineering anthropic artificial-intelligence deep-learning genai generative-ai homepage ↗
★ 134 SymbolicML/DynamicExpressions.jl Julia

Ridiculously fast symbolic expressions

binary-trees expression-evaluator symbolic-computation symbolic-manipulation symbolic-regression homepage ↗
★ 1 uiuctml/convex_data_valuation Python

[ICML '26] Code repo for the paper entitled "Convex Dataset Valuation for Post-Training" at ICML 2026.

data-selection llm homepage ↗
★ 6 christian-machine-intelligence/virtue-bench-2 Python

VirtueBench V2: Multi-dimensional virtue evaluation benchmark for LLMs with tripartite and Ignatian temptation models
★ 102 zapier/AutomationBench Python

A benchmark for evaluating AI agents on realistic business workflows

benchmarks evals llm primeintellect homepage ↗
★ 2.5k huggingface/lighteval Python

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-framework evaluation-metrics huggingface homepage ↗
★ 19.1k trycua/cua HTML

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

agent ai-agent apple computer-use computer-use-agent containerization cua homepage ↗
★ 12 secemp9/rubrify Python

Rubric compiler and judge engine for LLM evaluation
★ 6 justachetan/flat-pack-bench Python

Code for "Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly" (CVPR 2026)

benchmark gemini ikea lvlms video-understanding vlm homepage ↗
★ 100.7k neovim/neovim Vim Script

Vim-fork focused on extensibility and usability

api c lua neovim nvim text-editor vim homepage ↗
★ 3 storytracer/ocrscout Python

ocrscout is a toolkit for frontier OCR models that allows you to run, evaluate and profile OCR models on your own data and compute infrastructure