CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
Abstract
The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they often lack the precision required for fine-grained, instance-level control, and tend to produce textures with artifacts and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene together with reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that decouples semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with instance-aware cross attention, to ensure semantic plausibility and reference-instance alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.
Method Overview
Pipeline of CustomTex. CustomTex textures a complete 3D indoor scene by optimizing a texture map in UV space through a dual-distillation training approach. In each iteration, the 3D scene with optimized texture is rendered from a random viewpoint, producing an RGB image, a depth map and instance masks. Instance masks are used to align each reference image's features with the correct object instance in the rendered RGB image via a specialized cross-attention. The Variational Score Distillation gradient and the Super-Resolution gradient are computed based on the well-aligned reference images condition to update the texture field.
Reference-Guided Comparison
Reference Image
Paint3D
HY3D-2.1
SceneTex-IPA
Ours
Reference Image
Paint3D
HY3D-2.1
SceneTex-IPA
Ours
Stylization for More Scenes
Reference Image
Textured Result (Ours)
Reference Image
Textured Result (Ours)
Reference Image
Textured Result (Ours)
Reference Image
Textured Result (Ours)
Reference Image
Textured Result (Ours)
High-Quality Renderings
The "living room" texture generated by our method is rendered into 2,000 × 2,000 resolution image.
The "living room" texture generated by our method is rendered into 2,000 × 2,000 resolution image.
The "bedroom" texture generated by our method is rendered into 2,000 × 2,000 resolution image.
The "bedroom" texture generated by our method is rendered into 2,000 × 2,000 resolution image.
BibTeX
@misc{CustomTex2025,
title={CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization},
author={Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao},
year={2025},
url={https://chenweilinx.github.io/CustomTex/},
note={Preprint}
}