Tyler Schultz Auto-research

Auto-research experimentation

Auto research, but for prompts.

Testing whether an eval loop can improve a prompt-driven unit-test discovery skill.

  • Prompt engineering
  • Evaluation
  • Unit tests

Posted April 30, 2026

Can we determine if a skill is performing the way we intend? I used an auto-research loop to test a unit-test discovery skill against a repeatable eval, then let the loop propose and score prompt changes. Across 64 rounds, our target F1 climbed from 0.228 to 0.480.

This was a proof of concept rather than a production-ready result; using it officially would require more data, a held-out evaluation set, and stronger evidence that the gains generalize.

Rounds 64 three branches
Keeps 11 accepted prompt edits
Improvement +110% baseline to best F1

The pattern

What is auto-research?

The basic idea is to turn research into a closed loop. An agent proposes a small change, runs a fixed evaluation, compares the result to the current best score, and either keeps the change or throws it away.

That only works when the eval is cheap, repeatable, and difficult to game. The metric becomes the feedback signal that lets the loop search without a human judging every intermediate attempt.

The problem

Unit test: discovery and implementation.

When we ask an agent to write unit tests, we usually collapse two jobs into one. First it has to decide what behaviors are worth testing. Then it has to generate the actual test code.

The issue is discovery quality: how do we know the skill is targeting the correct behaviors and use cases before we ever ask it to write the final test implementation?

The target

Our focus.

We want to focus on the discovery side: finding the right behaviors and use cases to write tests for.

This experiment is solely focused on automatically determining which tests should be written, not generating the final XCTest implementation.

The dataset

A collection of ViewModels.

We use a collected set of ViewModels as the target surface for generating use cases and behaviors. Each run asks the skill to inspect those files and decide which tests should exist.

A ViewModel might contain validation rules, async submit paths, state transitions, enum modes, or computed properties that drive what the user sees.

The artifact

The output.

For each identified behavior or user path, we output a test title, a mock function name, and a short description of what that test should cover.

This gives us a clean intermediate artifact: the agent's answer to the question, "What should we test?"

The answer key

The golden set.

For selected ViewModels, we already have unit tests that humans wrote. Those tests define the behaviors this experiment treats as the target.

The eval is not asking whether every possible test was found. It asks how closely the agent's proposed plan matches the existing human-defined set.

The comparison

LLM as a judge.

The judge takes the agent's proposed test plan and the expected tests for the same ViewModel. It then decides which generated ideas match, which are extra, and which expected behaviors were missed.

This is the layer that turns a qualitative test plan into measurable feedback.

The setup

Recap so far.

At this point, the pieces are in place: a dataset of ViewModels, a golden set of expected tests, a generated candidate output, and a judge that can compare the two.

That gives us the process we need for the loop: generate test ideas, judge them against the target, turn the comparison into a score, and use that score to decide what happens next.

1 Choose ViewModels the dataset
2 Define golden set expected behaviors
3 Generate output candidate tests
4 Judge matches candidate vs target
5 Score and loop keep or discard

From verdict to number

The judge's verdict has to become a score.

So far the judge only tells us which proposed tests matched the golden set and which were missed. That is a qualitative read. The loop needs a single number it can compare across rounds.

Counting matches alone is misleading: a prompt could pad the output with dozens of weak ideas and rack up matches, or stay too cautious and miss most of the target set. We need a score that punishes both failure modes.

F1- Our metric

The loop needs a numeric score.

Precision tells us how much of the generated plan was useful. Recall tells us how much of the golden set it found. F1 combines those two into one target score.

That single number is what lets the loop compare prompt changes without a human manually reviewing every round.

The run

The eval loop is the measurement engine.

For each round, the agent uses the current prompt to inspect the ViewModels and generate test ideas. The judge compares those ideas to the golden set and returns the score.

Now every prompt edit can be evaluated the same way, against the same ViewModels, with the same scoring logic.

The rule

The decision is intentionally simple.

If a prompt change improves the score, the loop keeps the README edit. If the score drops or the run fails, the loop discards the change.

That keep-or-revert rule is what turns the eval into an optimization process.

The search

Then the loop repeats.

Once a change is kept or discarded, the agent proposes another prompt tweak and runs the same process again.

Over many rounds, the loop searches for instructions that make the skill better at identifying the right unit-test ideas.

The point

This gives prompt work a real feedback loop.

Without an eval, prompt changes are mostly judged by intuition and spot checks. This setup gives every change the same benchmark.

The question was whether that feedback signal was strong enough for an automated loop to discover useful prompt improvements.

What happened

64 rounds. 11 keeps, 47 discards.

The chart here is generated locally from the same run table used by the dashboard.

What the trajectory shows

Most of the gain came late, on the third branch.

The first branch (apr24b) crept up from the 0.228 baseline into the high‑0.20s. The second branch (apr24c) pushed past 0.30 by reshaping how tests get scoped and named. The biggest jump came in the third branch (apr24d), where switching from "more coverage" to "fewer, sharper tests" took the score from the mid‑0.30s to a peak of 0.480.

The curve is not a smooth climb. Long stretches of discards sit between every keep, and a handful of rounds crash outright. The loop spends most of its time learning what does not help.

What kinds of edits worked

Narrowing scope beat adding examples.

The kept edits cluster around a single idea: tell the skill to test fewer things, more precisely. Pruning rules, scoping by what the View actually consumes, requiring validation tests, and skipping async / navigation / error categories all stuck. Adding worked examples, longer naming guidance, or stricter target counts mostly did not.

You can see this in the output volume. The baseline generated 161 test ideas. The best run generated 67. The loop learned that being more selective was the highest‑leverage change it could make.

Round by round

Prompt edits, kept or discarded.

Pick any experiment round to compare the full prompt Markdown before and after that edit. The diff is generated from the README change for that round.

apr24d / round 30

Skip async/nav/error

Kept F1 0.480
Base prompt Before this round
Updated prompt After this round

Limitations and what's next

Treat this as a proof of concept.

The gain is real on this benchmark, but the benchmark is small: a fixed set of ViewModels, a single golden set, and an LLM judging another LLM's output. The loop could be quietly fitting to the judge's preferences instead of to better tests.

To trust this further would mean a held‑out set of ViewModels the loop never sees, more than one judge model, human spot checks on the kept rounds, and a test of whether the winning prompt generalizes to codebases beyond FXAir. The next step is closing those gaps before treating any of this as a production result.