envisage / flux.1-fill + mask-decomposed evaluation

rhinoplasty goal visualization,
and the right ruler to measure it

first author with surgeon co-author Dr. Amit D. Bhrany, MD / 2026 / pipeline + evaluation protocol + paper

Envisage turns one frontal photo into a photorealistic, surgeon-grounded preview of a rhinoplasty goal, a pre-consult conversation starter, not an outcome prediction. Its real contribution is methodological: full-face ArcFace is structurally confounded for a localized edit, so every method scores a negative identity gap against the unedited input. That is a finding about the metric, not a failure. We build SurgicalScore, a mask-decomposed 0–1 protocol that measures the edit region instead of the copied pixels around it, and Envisage leads on it.

0.599 surgicalscore / highest −0.048 smallest arcface gap 24 clinical presets 457 external pairs validated

the thesis

Localized generative edits need localized evaluation. The headline result is not a leaderboard win, it is a diagnosis: the standard ruler for facial identity is the wrong tool for an edit that touches roughly 5% of the pixels. Once you measure the right thing, Envisage is the strongest system on it.

the structural confound

When a localized edit is hard-composited, the nose regenerated, every other pixel copied verbatim from the input, full-face ArcFace is dominated by the copied non-surgical pixels. Outside-mask SSIM exceeds 0.999, so the score barely reflects the edit at all. Measured against the unedited input as a proxy, every method posts a negative paired ArcFace gain. That is a property of the metric, not a failure of the models.

the prior work that exposed it

The author's earlier system, LandmarkDiff, regenerated the entire face and composited the surgical region back. Decomposition revealed that over 95% of its identity score came from those composited pixels: 0.509 with compositing, 0.023 without. The number measured the copy, not the model. That confound is exactly what SurgicalScore was built to remove.

surgicalscore, the fix

A mask-decomposed 0–1 protocol that scores only what the surgery changes: edit direction, edit magnitude, masked perceptual fidelity, realism, and outside-mask preservation. A perfect-predictor control (ground-truth paste) scores 0.919, anchoring the ceiling. On this ruler Envisage leads at 0.599, well clear of the strongest baseline at 0.502.

honest scope

Envisage is a pre-consult goal-visualization tool, not a patient-specific outcome predictor. It shows a believable, affordable, surgical-scope-faithful preview from a single photo, no stereophotogrammetry rig, no CT, no per-clinic license. The open problem we name plainly: turning the candidate-space headroom into a deployable, non-oracle ranker.

the pipeline

One frontal photo in, a photorealistic goal preview out. Three stages, no task-specific training for the base pipeline, ~20s per image on a single NVIDIA L40S (48GB, BF16). Identity preservation is architectural, outside the mask, pixels are copied, not generated.

1 / tps pre-warp

A procedure-specific thin-plate-spline warp displaces nasal landmarks 2–4px to seed the lateral geometry that depth alone cannot express, bridge thinning, tip refinement.

scipy RBF interpolation, thin-plate-spline kernel
MediaPipe 478 landmarks index the displacement field
Boundary anchors confine the warp to the surgical region

2 / modified depth

Depth Anything V2 estimates a monocular depth map, then landmark-indexed Gaussian kernels edit it to encode the intended tissue displacement, e.g. flattening a dorsal hump.

Kernel size and intensity scale with measured anatomy
Encodes the surgical change as a physically meaningful signal
Resolution-independent, no fixed pixel offsets

3 / flux.1-fill + depth controlnet

FLUX.1-Fill-dev (12B rectified-flow fill) with a pretrained depth ControlNet regenerates only the masked region, conditioned on the modified depth and a preset-specific prompt.

Pretrained weights, no fine-tuning of the base pipeline
24 anatomically-grounded clinical presets condition the prompt
~20s per image on one NVIDIA L40S, BF16

4 / hard-mask composite + gating

A hard-mask composite copies every non-surgical pixel verbatim from the input, outside-mask SSIM > 0.999, so identity is preserved by construction. A 5-seed sweep and a 7 hard-gate scorer select the output.

Identity preservation is architectural, not a backbone property
5-seed candidate sweep × 7 hard gates pick the result
Operational PASS on 208/211 cases (98.6%)

Released as a reference framework: 24 presets across three procedures, 8 rhinoplasty (Daniel's taxonomy, the headline), 8 blepharoplasty (Tessier), 8 rhytidectomy (SMAS). Rhinoplasty is the evaluated headline procedure.

surgicalscore, the right ruler

The paper's primary contribution is the evaluation protocol, not the pipeline. SurgicalScore is a mask-decomposed 0–1 score that grades the edit region against paired post-operative ground truth instead of letting copied background pixels inflate a full-face number.

five components, one score

Directional alignment A (0.40), edit-magnitude fit B (0.30), masked LPIPS C (0.15), realism D (0.10; SER-FIQA + CR-FIQA), and outside-mask preservation E (0.05), with a passthrough floor of 0.30. The weights put 70% of the score on getting the surgical change right, not just on looking real.

anchored to a ceiling

A perfect-predictor control, pasting the real post-op nose with no model in the loop, scores SS_raw = 0.919. That fixes the top of the scale, so a method's score is read against an achievable ceiling rather than a hypothetical 1.0 that the protocol itself never reaches.

why full-face arcface fails here

ArcFace was trained to verify identity across changes orders of magnitude larger than a rhinoplasty. At a 5%-of-pixels edit, paired against the input, no method posts a positive gain, Envisage's gap is the smallest at −0.048 (95% CI −0.055, −0.042), vs ICEdit −0.139, FLUX.1-Kontext −0.242, InstructPix2Pix −0.294, all p<1e-4. The metric, not the model, is the bottleneck.

the in-mask shift, isolated

Because copied outside-mask pixels are identical between input and output, the paired full-face difference is dominated by the in-mask edit, a correct edit that shifts the nose toward the post-op target reads as lower identity. SurgicalScore isolates that shift and rewards it, which is precisely the signal full-face ArcFace destroys.

results

Rhinoplasty headline cohort: N=211 (ASPS public gallery 202 + private clinical archive 9). Four-way SurgicalScore comparison on the N=205 intersection; all gaps paired and significant at p<1e-4.

surgicalscore lead Envisage is highest at 0.599 [0.579, 0.619], vs ICEdit 0.502, InstructPix2Pix 0.337, FLUX.1-Kontext 0.229, a paired lead of +0.090 over ICEdit, +0.245 over InstructPix2Pix, +0.189 over FLUX.1-Kontext (all p<1e-4, N=205 four-way intersection).

smallest arcface gap No method beats the unedited input on full-face ArcFace; Envisage's gap is the smallest at −0.048 vs −0.139 / −0.242 / −0.294 for the baselines. Envisage output-to-GT cosine is 0.662 vs input-to-GT 0.711, it moves toward the target, which the metric reads as identity loss.

beats-input fraction Envisage beats the input proxy on 16.1% of cases, vs 0.0% / 4.3% / 1.0% for InstructPix2Pix / ICEdit / Kontext. Under a confounded metric, even a small positive fraction is a meaningful separation.

component ablation Each design choice carries weight: hard-mask composite ΔSS +0.034 (p=.001), the 24-preset anatomy prompt +0.034 (p=.002), depth-ControlNet conditioning +0.023 (p=.030).

the lead is not just the composite Hand the baselines the same hard-mask composite and Envisage still holds a paired-significant lead: +0.045 over InstructPix2Pix, +0.058 over ICEdit. The edit quality, not the compositing trick, is doing the work.

5-seed oracle headroom A GT-aware best-of-selection, an upper bound, not deployable, cuts the residual ArcFace gap by 73% (−0.054 → −0.015) and raises SurgicalScore from 0.609 to 0.743 on the 5-seed-valid subset (N=207). The good candidate already exists in the 5-seed sweep; the open problem is a non-oracle ranker to find it.

method	surgicalscore	arcface gap	beats input
Envisage (ours)	0.599	−0.048	16.1%
ICEdit	0.502	−0.139	4.3%
InstructPix2Pix	0.337	−0.294	0.0%
FLUX.1-Kontext	0.229	−0.242	1.0%

SurgicalScore (mask-decomposed, higher is better) and full-face ArcFace gain over the unedited input (every method is negative, a finding about the metric). Rhinoplasty cohort; four-way SurgicalScore on the N=205 intersection, all paired gaps p<1e-4. A perfect-predictor GT-paste control anchors SS_raw at 0.919; a 5-seed GT-oracle upper bound reaches 0.743.

external validation & surgeon review

A single benchmark can hide a confound. The negative gap and the operational gate were re-checked on an external corpus and read by the surgeon co-author.

457-pair external corpus

On an out-of-cohort ASPS/PCA corpus of 457 pairs, the structural negative gap replicates (pooled GT-to-pooled 0.597 vs baseline 0.664), the confound is not an artifact of the headline cohort. 95.8% (438/457) of cases pass the identity gate.

surgeon-in-the-loop

A board-certified facial plastic surgeon, the co-author, Dr. Amit D. Bhrany, reviewed six representative outputs as a feasibility check on surgical plausibility. A blinded multi-rater GAIS/ROE study is named explicitly as future work, not claimed here.

Baselines: InstructPix2Pix, ICEdit (FLUX.1-Fill-dev + MoE-LoRA diptych), FLUX.1-Kontext-dev; internal diagnostics Direct Copy, TPS Warp, and FLUX inpainting without ControlNet. Data: HDA Plastic Surgery Database + external ASPS Photo Gallery + a private clinical archive (PCA).

my role

engineering

I built the full pipeline: the procedure-specific TPS pre-warp, landmark-indexed depth modification, FLUX.1-Fill + depth-ControlNet inpainting, the hard-mask composite, the 5-seed sweep and 7-gate scorer, the 24-preset library, and the Gradio + FastAPI live demo on Hugging Face Spaces.

research

I designed and validated SurgicalScore, ran the four-way benchmark and component ablations, established the full-face ArcFace confound and the LandmarkDiff decomposition that motivated it, ran the 457-pair external validation, and wrote the paper. First author; Dr. Bhrany contributed the surgical review.

paper, demo + code

paper

Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation. Mudit Agarwal (University of Washington) and Amit D. Bhrany, MD (UW School of Medicine, Otolaryngology–Head & Neck Surgery). Preprint; arXiv forthcoming.

read paper ↓

live demo

Run the pipeline on your own frontal photo across the 24 clinical presets, rhinoplasty, blepharoplasty, or rhytidectomy. Hosted on Hugging Face Spaces.

open demo ↗

source code

Released as infrastructure: the Envisage pipeline, the SurgicalScore implementation, the preset definitions, and matched split manifests. Every reported number is reproducible from the repository.

view on github ↗

Envisage is the successor to LandmarkDiff, the author's earlier landmark-conditioned diffusion system. Decomposing LandmarkDiff's identity score, 0.509 with compositing, 0.023 without, revealed the structural confound that motivated SurgicalScore and the inpainting-first redesign.

tools + methods

diffusion

FLUX.1-Fill depth ControlNet Depth Anything V2

geometry + masks

MediaPipe (478 landmarks) thin-plate splines (scipy) OpenCV

evaluation

ArcFace / InsightFace buffalo_l LPIPS SER-FIQA / CR-FIQA SSIM Monk Skin Tone

serving + infra

PyTorch NVIDIA L40S BF16 Gradio FastAPI Hugging Face Spaces

rhinoplasty goal visualization, and the right ruler to measure it