AI QA Research Practice
Independent practice run by Carlos García. I build evaluation workflows, observe real human-AI interactions, and publish everything openly on GitHub.
Current Focus
Custom harnesses with LLM-as-judge scoring, rubric design, and documented limitations.
Claude API + Playwright, end to end.
Observed interaction patterns from real AI-assisted working sessions.
File and line, not vibes.
Public Artifacts
Claude API + Playwright pipeline. User story in, test plan, Playwright specs, and bug report out. Includes LLM-as-judge eval layer scoring AC traceability, atomicity, and verifiability across all generated test cases.
Field observations on AI-human behavioral patterns from real working sessions. Documented openly as they happen.
Controlled experiments on LLM behavior. Exp 001: skill activation reliability (10 runs). Dialogue Dynamics Eval A: Listener vs Advisor persona comparison, mechanical scoring + LLM-as-judge, 15 openers, findings documented.
About
I run experiments on AI behavior, build evaluation tooling, and publish what I find. Including the failures.