CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li , Xuan Cheng* , Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
CVPR 2026

*Corresponding Author -- chengxuan@xmu.edu.cn
Teaser image

CustomTex is capable of generating high-fidelity texture for a 3D scene mesh, driven by instance-specific reference images.

Abstract

The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they often lack the precision required for fine-grained, instance-level control, and tend to produce textures with artifacts and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene together with reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that decouples semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with instance-aware cross attention, to ensure semantic plausibility and reference-instance alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

Method Overview

Method overview of CustomTex

Pipeline of CustomTex. CustomTex textures a complete 3D indoor scene by optimizing a texture map in UV space through a dual-distillation training approach. In each iteration, the 3D scene with optimized texture is rendered from a random viewpoint, producing an RGB image, a depth map and instance masks. Instance masks are used to align each reference image's features with the correct object instance in the rendered RGB image via a specialized cross-attention. The Variational Score Distillation gradient and the Super-Resolution gradient are computed based on the well-aligned reference images condition to update the texture field.

Reference-Guided Comparison

Reference Image

Reference Image

Paint3D

HY3D-2.1

SceneTex-IPA

Ours

Reference Image

Reference Image

Paint3D

HY3D-2.1

SceneTex-IPA

Ours

Stylization for More Scenes

High-Quality Renderings

BibTeX

@misc{CustomTex2025,
  title={CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization},
  author={Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao},
  year={2025},
  url={https://chenweilinx.github.io/CustomTex/},
  note={Preprint}
}