← back to main

envisage / flux.1-fill + mask-decomposed evaluation

rhinoplasty goal visualization,
and the right ruler to measure it

first author with surgeon co-author Dr. Amit D. Bhrany, MD / 2026 / pipeline + evaluation protocol + paper

Envisage turns one frontal photo into a photorealistic, surgeon-grounded preview of a rhinoplasty goal, a pre-consult conversation starter, not an outcome prediction. Its real contribution is methodological: full-face ArcFace is structurally confounded for a localized edit, so every method scores a negative identity gap against the unedited input. That is a finding about the metric, not a failure. We build SurgicalScore, a mask-decomposed 0–1 protocol that measures the edit region instead of the copied pixels around it, and Envisage leads on it.

Localized generative edits need localized evaluation. The headline result is not a leaderboard win, it is a diagnosis: the standard ruler for facial identity is the wrong tool for an edit that touches roughly 5% of the pixels. Once you measure the right thing, Envisage is the strongest system on it.

the structural confound

When a localized edit is hard-composited, the nose regenerated, every other pixel copied verbatim from the input, full-face ArcFace is dominated by the copied non-surgical pixels. Outside-mask SSIM exceeds 0.999, so the score barely reflects the edit at all. Measured against the unedited input as a proxy, every method posts a negative paired ArcFace gain. That is a property of the metric, not a failure of the models.

the prior work that exposed it

The author's earlier system, LandmarkDiff, regenerated the entire face and composited the surgical region back. Decomposition revealed that over 95% of its identity score came from those composited pixels: 0.509 with compositing, 0.023 without. The number measured the copy, not the model. That confound is exactly what SurgicalScore was built to remove.

surgicalscore, the fix

A mask-decomposed 0–1 protocol that scores only what the surgery changes: edit direction, edit magnitude, masked perceptual fidelity, realism, and outside-mask preservation. A perfect-predictor control (ground-truth paste) scores 0.919, anchoring the ceiling. On this ruler Envisage leads at 0.599, well clear of the strongest baseline at 0.502.

honest scope

Envisage is a pre-consult goal-visualization tool, not a patient-specific outcome predictor. It shows a believable, affordable, surgical-scope-faithful preview from a single photo, no stereophotogrammetry rig, no CT, no per-clinic license. The open problem we name plainly: turning the candidate-space headroom into a deployable, non-oracle ranker.

One frontal photo in, a photorealistic goal preview out. Three stages, no task-specific training for the base pipeline, ~20s per image on a single NVIDIA L40S (48GB, BF16). Identity preservation is architectural, outside the mask, pixels are copied, not generated.

1 / tps pre-warp

A procedure-specific thin-plate-spline warp displaces nasal landmarks 2–4px to seed the lateral geometry that depth alone cannot express, bridge thinning, tip refinement.

  • scipy RBF interpolation, thin-plate-spline kernel
  • MediaPipe 478 landmarks index the displacement field
  • Boundary anchors confine the warp to the surgical region

2 / modified depth

Depth Anything V2 estimates a monocular depth map, then landmark-indexed Gaussian kernels edit it to encode the intended tissue displacement, e.g. flattening a dorsal hump.

  • Kernel size and intensity scale with measured anatomy
  • Encodes the surgical change as a physically meaningful signal
  • Resolution-independent, no fixed pixel offsets

3 / flux.1-fill + depth controlnet

FLUX.1-Fill-dev (12B rectified-flow fill) with a pretrained depth ControlNet regenerates only the masked region, conditioned on the modified depth and a preset-specific prompt.

  • Pretrained weights, no fine-tuning of the base pipeline
  • 24 anatomically-grounded clinical presets condition the prompt
  • ~20s per image on one NVIDIA L40S, BF16

4 / hard-mask composite + gating

A hard-mask composite copies every non-surgical pixel verbatim from the input, outside-mask SSIM > 0.999, so identity is preserved by construction. A 5-seed sweep and a 7 hard-gate scorer select the output.

  • Identity preservation is architectural, not a backbone property
  • 5-seed candidate sweep × 7 hard gates pick the result
  • Operational PASS on 208/211 cases (98.6%)

Released as a reference framework: 24 presets across three procedures, 8 rhinoplasty (Daniel's taxonomy, the headline), 8 blepharoplasty (Tessier), 8 rhytidectomy (SMAS). Rhinoplasty is the evaluated headline procedure.

The paper's primary contribution is the evaluation protocol, not the pipeline. SurgicalScore is a mask-decomposed 0–1 score that grades the edit region against paired post-operative ground truth instead of letting copied background pixels inflate a full-face number.

five components, one score

Directional alignment A (0.40), edit-magnitude fit B (0.30), masked LPIPS C (0.15), realism D (0.10; SER-FIQA + CR-FIQA), and outside-mask preservation E (0.05), with a passthrough floor of 0.30. The weights put 70% of the score on getting the surgical change right, not just on looking real.

anchored to a ceiling

A perfect-predictor control, pasting the real post-op nose with no model in the loop, scores SS_raw = 0.919. That fixes the top of the scale, so a method's score is read against an achievable ceiling rather than a hypothetical 1.0 that the protocol itself never reaches.

why full-face arcface fails here

ArcFace was trained to verify identity across changes orders of magnitude larger than a rhinoplasty. At a 5%-of-pixels edit, paired against the input, no method posts a positive gain, Envisage's gap is the smallest at −0.048 (95% CI −0.055, −0.042), vs ICEdit −0.139, FLUX.1-Kontext −0.242, InstructPix2Pix −0.294, all p<1e-4. The metric, not the model, is the bottleneck.

the in-mask shift, isolated

Because copied outside-mask pixels are identical between input and output, the paired full-face difference is dominated by the in-mask edit, a correct edit that shifts the nose toward the post-op target reads as lower identity. SurgicalScore isolates that shift and rewards it, which is precisely the signal full-face ArcFace destroys.

Rhinoplasty headline cohort: N=211 (ASPS public gallery 202 + private clinical archive 9). Four-way SurgicalScore comparison on the N=205 intersection; all gaps paired and significant at p<1e-4.

surgicalscore lead Envisage is highest at 0.599 [0.579, 0.619], vs ICEdit 0.502, InstructPix2Pix 0.337, FLUX.1-Kontext 0.229, a paired lead of +0.090 over ICEdit, +0.245 over InstructPix2Pix, +0.189 over FLUX.1-Kontext (all p<1e-4, N=205 four-way intersection).
smallest arcface gap No method beats the unedited input on full-face ArcFace; Envisage's gap is the smallest at −0.048 vs −0.139 / −0.242 / −0.294 for the baselines. Envisage output-to-GT cosine is 0.662 vs input-to-GT 0.711, it moves toward the target, which the metric reads as identity loss.
beats-input fraction Envisage beats the input proxy on 16.1% of cases, vs 0.0% / 4.3% / 1.0% for InstructPix2Pix / ICEdit / Kontext. Under a confounded metric, even a small positive fraction is a meaningful separation.
component ablation Each design choice carries weight: hard-mask composite ΔSS +0.034 (p=.001), the 24-preset anatomy prompt +0.034 (p=.002), depth-ControlNet conditioning +0.023 (p=.030).
the lead is not just the composite Hand the baselines the same hard-mask composite and Envisage still holds a paired-significant lead: +0.045 over InstructPix2Pix, +0.058 over ICEdit. The edit quality, not the compositing trick, is doing the work.
5-seed oracle headroom A GT-aware best-of-selection, an upper bound, not deployable, cuts the residual ArcFace gap by 73% (−0.054 → −0.015) and raises SurgicalScore from 0.609 to 0.743 on the 5-seed-valid subset (N=207). The good candidate already exists in the 5-seed sweep; the open problem is a non-oracle ranker to find it.
method surgicalscore arcface gap beats input
Envisage (ours)0.599−0.04816.1%
ICEdit0.502−0.1394.3%
InstructPix2Pix0.337−0.2940.0%
FLUX.1-Kontext0.229−0.2421.0%

SurgicalScore (mask-decomposed, higher is better) and full-face ArcFace gain over the unedited input (every method is negative, a finding about the metric). Rhinoplasty cohort; four-way SurgicalScore on the N=205 intersection, all paired gaps p<1e-4. A perfect-predictor GT-paste control anchors SS_raw at 0.919; a 5-seed GT-oracle upper bound reaches 0.743.

A single benchmark can hide a confound. The negative gap and the operational gate were re-checked on an external corpus and read by the surgeon co-author.

457-pair external corpus

On an out-of-cohort ASPS/PCA corpus of 457 pairs, the structural negative gap replicates (pooled GT-to-pooled 0.597 vs baseline 0.664), the confound is not an artifact of the headline cohort. 95.8% (438/457) of cases pass the identity gate.

surgeon-in-the-loop

A board-certified facial plastic surgeon, the co-author, Dr. Amit D. Bhrany, reviewed six representative outputs as a feasibility check on surgical plausibility. A blinded multi-rater GAIS/ROE study is named explicitly as future work, not claimed here.

Baselines: InstructPix2Pix, ICEdit (FLUX.1-Fill-dev + MoE-LoRA diptych), FLUX.1-Kontext-dev; internal diagnostics Direct Copy, TPS Warp, and FLUX inpainting without ControlNet. Data: HDA Plastic Surgery Database + external ASPS Photo Gallery + a private clinical archive (PCA).

engineering

I built the full pipeline: the procedure-specific TPS pre-warp, landmark-indexed depth modification, FLUX.1-Fill + depth-ControlNet inpainting, the hard-mask composite, the 5-seed sweep and 7-gate scorer, the 24-preset library, and the Gradio + FastAPI live demo on Hugging Face Spaces.

research

I designed and validated SurgicalScore, ran the four-way benchmark and component ablations, established the full-face ArcFace confound and the LandmarkDiff decomposition that motivated it, ran the 457-pair external validation, and wrote the paper. First author; Dr. Bhrany contributed the surgical review.

paper

Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation. Mudit Agarwal (University of Washington) and Amit D. Bhrany, MD (UW School of Medicine, Otolaryngology–Head & Neck Surgery). Preprint; arXiv forthcoming.

read paper ↓

live demo

Run the pipeline on your own frontal photo across the 24 clinical presets, rhinoplasty, blepharoplasty, or rhytidectomy. Hosted on Hugging Face Spaces.

open demo ↗

source code

Released as infrastructure: the Envisage pipeline, the SurgicalScore implementation, the preset definitions, and matched split manifests. Every reported number is reproducible from the repository.

view on github ↗

Envisage is the successor to LandmarkDiff, the author's earlier landmark-conditioned diffusion system. Decomposing LandmarkDiff's identity score, 0.509 with compositing, 0.023 without, revealed the structural confound that motivated SurgicalScore and the inpainting-first redesign.

diffusion

FLUX.1-Fill depth ControlNet Depth Anything V2

geometry + masks

MediaPipe (478 landmarks) thin-plate splines (scipy) OpenCV

evaluation

ArcFace / InsightFace buffalo_l LPIPS SER-FIQA / CR-FIQA SSIM Monk Skin Tone

serving + infra

PyTorch NVIDIA L40S BF16 Gradio FastAPI Hugging Face Spaces
← back to projects