230704 - Shirakawa - Umap visualization of CLIP text feature

Summary

In this report, we performed UMAP visualization to explore the semantic features of the Natural scene dataset (NSD). The visualization results showed a few dozen clusters within the semantic feature space. Furthermore, an overlap was observed between the NSD test set and training set.

Methods

Natural Scene Dataset (NSD)

The natural scenes dataset (NSD) comprises a massive high-resolution (7T) fMRI dataset. Subjects were presented with tens of thousands of natural scene images in an MRI scanner as a whole. The NSD [1] images are taken from the MSCOCO image database [2], with 80 categories selected from the original 90 COCO categories to be included among the 73,000 NSD images. Eight subjects in the NSD study each viewed approximately 10,000 images in the MRI scanner, with each image presented up to three times. The authors of the NSD generated a special subset of 1,000 images, known as the "shared set," which all subjects viewed. The remaining images, which make up the "subject-unique-set," were each viewed by a unique subject. Researchers who develop image reconstruction methods often utilize the subject-unique-set as a training dataset and evaluate the reconstruction performance using the shared set [3, 4, 5, 6, 7]. Note that NSD itself is not designed for image reconstruction analysis.

In this report, we selected images presented by subject 1. Hereafter the term "nsd-train" will be used to refer to the subject1-unique-set (8859 images), and "nsd-test" will refer to the shared-set (982 images).

Text feature

NSD images also have annotations (text description of an image). Annotations from NSD images were extracted using the NSD access package (https://github.com/tknapen/nsd_access). Following in [3], all annotations corresponding to each image were processed using the CLIP text encoder, which is the same model as Stable diffusion v1.4. Text feature was then derived by computing the average across the respective outputs.

Visualization

The representation of text features was visualized utilizing Uniform Manifold Approximation and Projection (UMAP) [8]. UMAP is a nonlinear dimensionality reduction technique to transform high-dimensional data into a low-dimensional space while retaining meaningful properties of the original data.

Initially, all text features (nsd train/test) are projected together. Subsequently, labels were assigned. A standardization procedure was applied prior to the UMAP embedding.

Results

Figure 1 shows the UMAP visualization of the NSD text feature. We found that 1) there are at most several dozen of clusters and 2) the nsd-test feature significantly overlaps with the nsd-train feature.

Figure 1. UMAP visualization of text feature whose images presented to subject 1.

In order to see what images are in each cluster, We randomly selected images from some clusters, shown in Figure 2. We observed a high degree of semantic similarity among them. Moreover, semantically similar images are also often similar image layouts. For example, images from the tennis cluster typically exhibited a background of either green or brown with a single object. Images from the airplane cluster exhibited either blue or gray backgrounds with horizontally oriented shapes.

Figure 2. Randomly sampled images in some clusters.

We checked all clusters and found that each cluster can be named as one word (such as “Train,” “Elephant,” “Dog,” “Hydrant,” or “Clock tower”), shown in Figure 3. Importantly, although NSD subjects were presented with around 10,000 unique images, the number of identifiable semantic clusters was less than 40.

Figure 3. UMAP visualization with labeling one word in each cluster manually.