AI QA Research Practice

Studying how AI systems behave when no one specified what they should do.

Independent practice run by Carlos García. I build evaluation workflows, observe real human-AI interactions, and publish everything openly on GitHub.

Current Focus

Reproducible LLM evaluation

Inspect, Promptfoo, custom harnesses.

AI-assisted testing pipelines

Claude API + Playwright, end to end.

Human-AI interaction quality

Observed interaction patterns from real AI-assisted working sessions.

Code-level audit methodology

File and line, not vibes.

Public Artifacts

qa-ai-workflow ACTIVE

Claude API + Playwright pipeline. User story in, test plan, Playwright specs, and bug report out.

View on GitHub → Updated May 2026

ai-human-observatory ACTIVE

Field observations on AI-human behavioral patterns from real working sessions. Documented openly as they happen.

View on GitHub → Updated May 2026

ai-eval-toolkit ACTIVE

Controlled experiments on LLM behavior. Exp 001: skill activation reliability (10 runs). Dialogue Dynamics Eval A: Listener vs Advisor persona comparison, mechanical scoring + LLM-as-judge, 15 openers, findings documented.

View on GitHub → Updated Jun 2026

About

Holteck is an independent AI quality observatory.

The focus is real-world behavior: how AI systems perform in production, what patterns emerge in human-AI interaction, and where better evaluation tooling still needs to be built.

Observations, experiments, and tooling are published openly on GitHub - including the failures.

Based in Monterrey, Mexico.