Categorization, a core cognitive ability in humans that
organizes objects based on common features, is essential to
cognitive science as well as computer vision. To evaluate the
categorization ability of visual AI models, various proxy tasks
on recognition from datasets to open world scenarios have been
proposed. Recent development of Large Multimodal Models (LMMs)
has demonstrated impressive results in high-level visual tasks,
such as visual question answering, video temporal reasoning,
etc., utilizing the advanced architectures and large-scale
multimodal instruction tuning. Previous researchers have
developed holistic benchmarks to measure the high-level visual
capability of LMMs, but there is still a lack of pure and
in-depth quantitative evaluation of the most fundamental
categorization ability. According to the research on human
cognitive process, categorization can be seen as including two
parts: category learning and category use. Inspired by this, we
propose a novel, challenging, and efficient benchmark based on
composite blocks, called ComBo, which provides a disentangled
evaluation framework and covers the entire categorization
process from learning to use. By analyzing the results of
multiple evaluation tasks, we find that although LMMs exhibit
acceptable generalization ability in learning new categories,
there are still gaps compared to human in many ways, such as
spatial detail fine-grained perception of spatial relationship
and abstract category understanding. Through the study of
categorization, we can provide inspiration for the further
development of LMMs in terms of interpretability and
generalization.