Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models

Institute of Computing Technology, Chinese Academy of Sciences
BMVC 2024

* Indicates Equal Contribution

Abstract

Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to human in many ways, such as spatial detail fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.

What is Categorization?

Sizes of model trees

Categorization is a fundamental human cognitive ability. Research in cognitive science suggests that categorization involves two key processes: category learning and category use.

Cognitive Process of Categorization

Sizes of model trees

We model categorization as a process of category learning and use across concrete and abstract spaces. The concrete space includes perceivable visual entities in the real world, while the abstract space represents the categorization rules stored by both humans and LMMs.

ComBo Benchmark

Sizes of model trees

We build a large-scale repository of Composite Blocks by disentangling object attributes such as shape, material, color, and contact points. This results in 9,504 objects, each rendered from random viewpoints to create photorealistic images.

Conclusion

In this work, we introduce the ComBo benchmark, focusing on evaluating the categorization capability of Large Multimodal Models (LMMs). Inspired by research on categorization in cognitive science, we design three evaluation tasks from different perspectives, comprehensively assessing the LMMs' ability in pattern perception, abstract concept alignment, and generalization of categorization. The evaluation results reveal that LMMs still exhibit deficiencies in spatial detail perception, abstract concept reasoning, and learning of new categories. Although in-context learning or Chain-of-Thought (CoT) techniques can further improve the performance of LMMs, there remains a gap compared to human categorization capability, providing recommendations for future improvements in LMMs.

BibTeX


        @inproceedings{fu2024blocks,
          title={Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models},
          author={Fu, Bin and Wan, Qiyang and Li, Jialin and Wang, Ruiping and Chen, Xilin},
          booktitle={The Thirty Fifth British Machine Vision Conference (BMVC)},
          year={2024},
          url={https://openreview.net/forum?id=sZGZVedrdS}
        }