Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
Rahul Bissa · Abhishek Vyas · Yash Jain
Submitted May 28, 2026
Abstract
We study supervised fine-tuning for screen-conditioned action prediction — given a persona and a product screen, predicting the action a real user would take. We introduce PiSAR, a dataset of 12,929 tuples drawn from app reviews, demographic data, and shopping traces, and evaluate fine-tuned models against frontier zero-shot baselines on a 661-row held-out test set.
Frontier baselines — Claude Opus 4.7 and GPT-5.5 — reach semantic similarity scores of 0.459 and 0.482. A fine-tuned Qwen3-VL-8B-Instruct reaches 0.783, with 79% of rows clearing the 0.7 threshold versus 1–2% for the baselines. The same training recipe applied to Gemma-4-26B-A4B-IT yields only 0.441 — evidence of a recipe-vs-model mismatch: the reasoning-tuned, high-parameter model resists displacement and would need more data or a stronger fine-tuning method to move.
Key results
Fine-tuned Qwen3-VL-8B-Instruct semantic similarity — up from 0.459 (Claude Opus 4.7) and 0.482 (GPT-5.5) zero-shot.
Share of test rows clearing the 0.7 similarity threshold after fine-tuning, versus 1–2% for frontier baselines.
PiSAR tuples sourced from app reviews, demographic data, and shopping traces; a 661-row held-out test set.
Bissa, R., Vyas, A., & Jain, Y. (2026). Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark. arXiv:2605.29400.