3D representation in 512-Byte:
Variational tokenizer is the key for autoregressive 3D generation

Abstract: Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.

Variational Tokenizer (VAT) compresses unordered 3D data into compact 1D latent tokens, while supporting high-fidelity 3D generation via autoregressive modeling.

Two-stage training

(a) VAT comprises the 3D input features into a smaller 1D sequence of latent tokens. The encoder's output retains only the latent tokens, resulting in a compact 1D latent representation that preserves the original information. Next, VVQ maps the 1D latent onto a Gaussian distribution, where quantization is applied residually across scales. This process allows tokens to self-organize into distinct subspaces within the same Gaussian distribution. Following vector quantization, the triplane decoder recovers the output features based on the discrete token maps, and a triplane-based convolutional neural network, combined with an MLP, upsamples the low-resolution features into a high-resolution 3D occupancy grid.

(b) At the second stage, we train the Next-Scale Autoregressive Transformer on discrete tokens. Here, discrete tokens generated by VAT are used as supervised signal for a decoder-only transformer trained for next-scale prediction. The model is conditioned on image and text features with a causal attention mask trained by cross-entropy loss.

Scaling law with network parameters N

3D Generation Comparision

Comparision of state-of-the art 3D generation methods using in-the-wild images. Note that the commercial software displayed on the left may expand thousands of their own data for training, whereas our model is only trained on the Objaverse dataset.

Quantitative comparison of state-of-the-art 3D generation methods.

Necessity of VVQ

Reconstruction results with varying numbers of tokens, with and without in-context token compression (Comp.) and Variational Vector Quantization (VVQ). By incorporating VVQ, our VAT achieves the best balance between reconstruction accuracy and cross-scale consistency.