READ-CLIP: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

READ-CLIP method overview and a radar chart comparing READ-CLIP against CLIP, NegCLIP, CE-CLIP, and FSC-CLIP on WhatsUp, SugarCrepe, SugarCrepe++ (ITT), SugarCrepe++ (TOT), VALSE, and CREPE.

Left: READ adds a token-level reconstruction objective (via a frozen text decoder) and a sentence-level alignment objective on top of CLIP's image–text contrastive loss. Right: READ-CLIP sets a new state of the art across five compositional reasoning benchmarks.

Abstract

Vision–language models trained with standard contrastive objectives often behave like a bag of words: they attend to individual tokens rather than the relationships between them, and struggle with compositional reasoning. We propose READ (REconstruction and Alignment of text Descriptions), a lightweight fine-tuning recipe that augments the contrastive loss with two auxiliary objectives: a token-level reconstruction objective, in which a frozen text decoder reconstructs related captions from the CLIP text embedding, and a sentence-level alignment objective that pulls paraphrases together in the embedding space. The two objectives are complementary—reconstruction captures word relationships within a caption, while alignment keeps representations consistent across paraphrases with different wording. Applied to CLIP (and to variants such as NegCLIP and FSC-CLIP), READ sets a new state of the art across five compositional reasoning benchmarks: SugarCrepe++, WhatsUp, CREPE, VALSE, and SugarCrepe.

Method

READ keeps CLIP's image–text contrastive loss and adds two training-only objectives on the text side. At inference, READ-CLIP is a drop-in CLIPModel—no decoder, no extra parameters, and the same compute as the original CLIP.

(1) Token-level reconstruction: a linear projector feeds the text embedding into a frozen decoder that reconstructs a paraphrased caption. (2) Sentence-level alignment: paraphrases are aligned with their original captions in the embedding space.

Token-level reconstruction

A frozen T5 decoder (t5-v1_1-large) reconstructs related captions from the CLIP text embedding, forcing the embedding to retain word-relationship information rather than a bag of tokens.

Sentence-level alignment

Paraphrases of the same caption are pulled together in the embedding space, making the text encoder robust to surface wording and improving consistency across rephrasings.

Results

Compositional reasoning accuracy with a ViT-B/32 backbone, compared against strong baselines.

Benchmark	READ-CLIP	NegCLIP	FSC-CLIP
WhatsUp	43.9	42.4	39.8
VALSE	76.2	73.7	74.4
CREPE	41.5	30.5	42.5
SugarCrepe	87.0	83.6	85.2
SugarCrepe++ (ITT)	69.8	65.0	67.9
SugarCrepe++ (TOT)	66.2	62.5	64.4
Average	64.1	59.6	62.4

See the paper for the full set of baselines and ablations.

Usage

The checkpoint loads directly with transformers as a standard CLIPModel:

import torch
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("Mayfull/READ-CLIP")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=["a photo of a cat", "a photo of a dog"],
                   images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    probs = model(**inputs).logits_per_image.softmax(dim=-1)

Citation

@inproceedings{kwon2026enhancing,
  title={Enhancing Compositional Reasoning in {CLIP} via Reconstruction and Alignment of Text Descriptions},
  author={Jihoon Kwon and Kyle Min and Jy-yong Sohn},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2026},
  url={https://openreview.net/forum?id=6uKIm4bfEe}
}