This document tracks the highest-value research improvements for StableSteering as a study platform.
It focuses on:
- research design
- experimental validity
- evaluation quality
- interpretability
- study operations
- comparative baselines
It does not focus on core engineering execution. That belongs in:
The current system already supports:
- iterative steering sessions
- multiple samplers and updaters
- multiple feedback modes at the schema level
- deterministic test paths
- replay and trace capture
- real GPU-backed image generation
This is enough for exploratory pilot work, but not yet enough for strong claims about algorithm quality, usability, or scientific validity.
The largest current gaps are:
- limited comparative baselines
- limited human-study instrumentation
- no formal study protocols in the repo
- limited analysis automation
- weak coverage of confounds like seed sensitivity and user inconsistency
-
R0Needed before making strong research claims. -
R1Strongly improves study quality and interpretability. -
R2Valuable expansions once the core research loop is stable.
Why it matters:
- steering only matters scientifically if it beats simpler alternatives
- without baselines, improvements can be mistaken for ordinary prompt iteration or random luck
Implementation notes:
- define a minimum comparison set:
- prompt rewriting only
- prompt-only manual iteration without steering state
- no-update random sampling
- updater comparisons such as winner-copy, winner-average, and linear-preference
- lock the task set and evaluation rules before collecting data
- require every new strategy claim to include at least one baseline comparison
Success signal:
- every reported result can be compared against a non-steering or weaker-steering baseline
Why it matters:
- informal operator habits create hidden study drift
- protocols turn one-off demos into repeatable experiments
Implementation notes:
- define pilot-study templates with prompt selection rules, stopping criteria, and operator instructions
- version the protocol documents alongside the code
- standardize how prompts, negative prompts, configuration presets, and success criteria are recorded
- separate exploratory studies from claim-bearing studies in documentation and reporting
Success signal:
- two different operators can run the same study and produce comparable artifacts
Why it matters:
- seed effects, fatigue, UI bias, and interruptions can dominate outcomes if they are not measured
- better logging makes null or mixed results more interpretable
Implementation notes:
- log hidden repeats for agreement checks
- add user confidence and decision time fields where appropriate
- log interruptions, retries, abandonment, and mid-session config changes
- distinguish runtime failures from preference uncertainty in analysis exports
Success signal:
- a disappointing or surprising session can be explained in terms of recorded confounds rather than guesswork
Why it matters:
- qualitative enthusiasm is useful for exploration but too weak for evaluating methods
- explicit criteria make it possible to stop, compare, and reject hypotheses honestly
Implementation notes:
- define expected effect sizes or directional improvements for key tasks
- define acceptable operator burden and session length
- add replay-based success checks such as incumbent improvement rate and convergence stability
- specify robustness thresholds across alternate seeds and repeated runs
Success signal:
- the team can say clearly when a strategy is better, worse, or inconclusive
Why it matters:
- raw win counts are not enough to explain why a strategy helped or failed
- richer metrics reveal speed, stability, and user burden tradeoffs
Implementation notes:
- compute incumbent win rate, rounds-to-satisfaction, preference consistency, and seed robustness
- collect user-reported controllability and fatigue
- separate outcome quality metrics from interaction-cost metrics
- report uncertainty and sample counts together with aggregate values
Success signal:
- strategy comparisons show both effectiveness and operator cost
Why it matters:
- analysis should not begin with manual cleaning of raw session files
- structured exports reduce friction for notebooks, dashboards, and papers
Implementation notes:
- export tidy CSV or parquet tables for candidates, feedback events, rounds, and sessions
- include join keys and session metadata in each table
- preserve references to replay bundles and trace reports
- version the export schema and document it clearly
Success signal:
- a researcher can load a session corpus into a notebook with minimal preprocessing
Why it matters:
- reusable notebooks turn collected traces into repeatable analysis rather than one-off custom scripts
Implementation notes:
- create starter notebooks for trajectory analysis, seed robustness, sampler comparisons, and updater comparisons
- make notebooks read the official export schema rather than private ad hoc data layouts
- include plots for round progression, incumbent lineage, and preference stability
- keep example notebooks small enough to run on a local workstation
Success signal:
- the same notebooks can be rerun across new study cohorts without structural edits
Why it matters:
- replay is already one of the most information-dense artifacts in the project
- it should support comparative analysis, not only debugging
Implementation notes:
- derive structured summaries from replay automatically
- compute change-over-round plots and candidate-lineage views
- highlight baseline prompt images, incumbent carry-forward steps, and final winners
- support side-by-side replay comparisons across strategies or prompts
Success signal:
- replay becomes a first-class analysis surface, not just a development convenience
Why it matters:
- different preference interfaces capture different kinds of user intent
- pairwise, top-k, winner-only, and approve/reject modes may produce very different noise and speed profiles
Implementation notes:
- run controlled comparisons of rating, pairwise, top-k, winner-only, and approve/reject flows
- measure speed, confidence, consistency, and subjective burden for each mode
- separate interface effects from updater effects in analysis
- align frontend instrumentation with the true interaction type rather than rating-derived shortcuts
Success signal:
- the project can justify when one feedback mode is preferable to another
Why it matters:
- user preference data is only valuable if it remains stable enough to interpret
- long sessions may degrade data quality even if users continue to participate
Implementation notes:
- add hidden repeat judgments and calibration rounds
- measure round count versus confidence, time-to-decision, and critique quality
- look for fatigue patterns such as faster but less consistent later-round judgments
- test whether some feedback modes resist fatigue better than others
Success signal:
- session length recommendations are based on observed behavior rather than guesswork
Why it matters:
- layout, ordering, and displayed metadata can shift user choices independently of the underlying model behavior
- UI bias can easily be mistaken for algorithmic improvement
Implementation notes:
- randomize candidate order in controlled experiments
- compare metadata-hidden versus metadata-visible variants
- compare different grid densities and spacing
- test whether richer replay context changes future judgments
Success signal:
- the influence of interface design on measured outcomes is quantified rather than ignored
Why it matters:
- preference quality depends not only on the model but also on how the system asks for user judgments
- some workflows may benefit more from shortlist, critique, or incumbent-comparison interactions than from one-shot winner choice
Implementation notes:
- compare shortlist selection, best-versus-incumbent, approve/reject, pairwise tournament, and critique-assisted workflows
- study when structured reason tags improve preference consistency or update quality
- compare forced-choice interfaces against interfaces that allow uncertainty or "cannot decide" responses
- measure whether region-aware or attribute-aware elicitation helps for inpainting and image-prompt tasks
- analyze whether elicitation mode changes the apparent value of a sampler or preference model
Success signal:
- the research program can explain which UI/elicitation modes work best for which task families and user goals
Why it matters:
- real user studies are expensive, slow, and noisy
- anchor-seeking synthetic users can stress-test algorithms under known hidden targets
Implementation notes:
- define anchor types such as latent steering anchors, reference images, attribute vectors, and text-derived targets
- build a synthetic user model that prefers candidates closer to the anchor while still showing uncertainty and bounded inconsistency
- model near ties, occasional reversals, fatigue-like noise, and critique text aligned with choices
- compare synthetic trajectories against real traces on round count, winner stability, and path geometry
Success signal:
- synthetic anchor-seeking sessions look structurally similar to real steering sessions and support meaningful ablations
Why it matters:
- users often want controlled variation around a good region, not only convergence to one hidden optimum
- diversity-seeking synthetic tasks open a richer class of evaluation problems
Implementation notes:
- define one-center diversity tasks that preserve core concept while varying composition, lighting, pose, or background
- define multi-center tasks where preference rewards both desirability and coverage
- formalize diversity objectives using center distance, inter-candidate distance, coverage, and duplicate avoidance
- test policies such as shortlist preference, winner-plus-diversity bonus, and coverage-seeking ranking
Success signal:
- diversity-seeking synthetic trajectories are clearly distinguishable from pure anchor-seeking trajectories
Why it matters:
- synthetic corpora can accelerate algorithm iteration before expensive human studies
- controlled hidden targets make failure analysis much easier
Implementation notes:
- generate corpora with known hidden targets and varied difficulty settings
- use those corpora for regression testing of samplers, updaters, and feedback interpreters
- build challenge sets containing misleading local optima, seed-sensitive candidates, near ties, and quality-diversity tradeoffs
- compare sim-to-real transfer by tuning on synthetic traces and evaluating on later human sessions
Success signal:
- synthetic data reduces wasted human-study cycles and catches weak strategies earlier
Why it matters:
- a poor simulator can bias the whole research program
- realism should be measured and improved, not assumed
Implementation notes:
- define realism metrics across win/loss structure, rating distributions, critique patterns, stop-time distributions, and path geometry
- fit simulator parameters to match observed human behavior more closely
- compare multiple simulator families rather than searching for one universal synthetic user
- document which simulator simplifications are believed to be harmless and which remain risky
Success signal:
- synthetic-user realism can be discussed and improved with evidence, not intuition alone
Why it matters:
- many practical workflows use reference images, masks, or structural controls rather than only text prompts
- a steering method that works only for plain text-to-image may not generalize to real creative work
Implementation notes:
- study image-prompt, image-variation, inpainting, and ControlNet-guided steering as distinct task families
- compare whether the same preference-update logic transfers across those workflows
- define pipeline-specific metrics such as structure adherence, local edit faithfulness, and reference-image faithfulness
- study cross-workflow transfer, including whether synthetic-user models calibrated in one workflow transfer to another
Success signal:
- the research program can say where iterative steering helps most across diffusion workflow families
Why it matters:
- steering dimension controls the size and geometry of the search space the user is trying to navigate
- too few dimensions may cap attainable quality or controllability, while too many may increase noise, redundancy, and cognitive burden
- strong results on one fixed dimension do not automatically transfer to other tasks, prompts, or diffusion workflows
Implementation notes:
- compare fixed low dimensions such as 2, 3, 5, 8, and 16 across matched prompts, seeds, and feedback budgets
- test adaptive dimension schedules, for example start low for stability and expand capacity only after early convergence
- compare basis-construction methods such as random orthogonal axes, PCA-style data-driven axes, learned steering dictionaries, and semantically aligned attribute axes
- measure not only final preference outcome but also rounds-to-satisfaction, duplicate rate, diversity, preference consistency, and subjective user burden
- analyze whether optimal dimension depends on task family, prompt complexity, feedback mode, sampler family, or diffusion workflow
- study whether dimension interacts strongly with incumbent carry-forward and trust-radius policies
Success signal:
- the project can recommend steering-dimension choices for specific task classes with empirical justification
Why it matters:
- low-dimensional steering is interpretable and simple, but it may be too limited for some tasks
Implementation notes:
- compare low-dimensional steering against token-level, pooled-embedding, and hybrid representations
- measure whether richer representations improve controllability or only add instability
- preserve interpretability metrics while expanding representation capacity
Success signal:
- representation changes are justified by measurable gains rather than novelty alone
Why it matters:
- sampler quality strongly shapes the candidate set a user can choose from
- better samplers may improve both convergence and exploration efficiency
Implementation notes:
- compare current samplers with Thompson-style, quality-diversity, critique-conditioned, and adaptive trust-region methods
- evaluate both human-judged quality and synthetic benchmark performance
- test whether some samplers pair better with specific feedback modes or update rules
- include incumbent-versus-challenger and shortlist-aware samplers that explicitly optimize for comparison quality, not only search quality
- compare one-step greedy samplers against multi-round planning or lookahead samplers
Success signal:
- sampler comparisons reveal clear tradeoffs in exploration, stability, and user burden
Why it matters:
- simple winner heuristics are easy to inspect, but they may waste information from rankings, approvals, uncertainty, and critiques
- richer preference models can become the bridge between elicitation design and smarter candidate proposal
Implementation notes:
- compare Bradley-Terry, Plackett-Luce, Bayesian preference, and listwise reward models
- test models that incorporate confidence, near ties, or "cannot decide" outcomes instead of discarding them
- test critique-aware models that combine discrete selections with structured or free-text reasons
- compare models that infer absolute latent quality against models that only learn relative preference
- study whether posterior uncertainty from preference models improves downstream samplers
Success signal:
- the project has evidence about which preference-model family best converts user judgments into useful steering signals
Why it matters:
- the update rule determines how user judgments become steering-state movement
- weak updaters can waste high-quality feedback
Implementation notes:
- compare current simple updaters with Bradley-Terry, Bayesian preference, contextual bandit, and critique-aware approaches
- compare direct latent-state update rules against preference-model-plus-policy decompositions
- evaluate update sensitivity, robustness to noisy feedback, and stability over multiple rounds
- test how updater choice interacts with sampler choice and feedback modality
Success signal:
- updater research produces concrete guidance on which feedback-to-state mapping works best under which conditions
- establish baseline comparison tasks
- define prompt set
- define study protocol
- log confounds more explicitly
- add stronger metrics
- add analysis exports
- add notebooks and replay summaries
- build first realistic anchor-seeking synthetic-user pipeline
- build first diversity-seeking synthetic-user pipeline
- compare steering-dimension selection methods
- compare samplers
- compare preference and reward models
- compare updaters
- compare feedback modalities
- compare elicitation/UI modes
- compare representation strategies
- compare synthetic-user regimes against real-user outcomes
- compare steering behavior across text-to-image, image-prompt, inpainting, and ControlNet workflows
- define the baseline comparison matrix
- define pilot protocols and prompt/task sets
- improve confound logging
- define explicit research success criteria
- build analysis-ready exports
- create notebook-based analysis templates
- strengthen replay as an analysis asset
- compare feedback modalities with real users
- study richer elicitation modes and UI patterns
- evaluate consistency, fatigue, and interface bias
- define anchor-seeking synthetic-user tasks
- define diversity-seeking synthetic-user tasks
- build synthetic stress-test corpora
- evaluate synthetic-user realism
- extend studies to image-prompt, inpainting, and ControlNet workflows
- compare steering-dimension selection methods
- compare richer representations, samplers, preference models, and updaters
The next research phase should shift from “can the system run?” to “can the system support credible conclusions?”
That means focusing on:
- better baselines
- better measurement
- better confound control
- better analysis workflows
- better human-study structure
- realistic synthetic-user generation
- diversity-aware synthetic data regimes
- research coverage beyond text-only generation into richer diffusion pipeline families