to download the project base paper on granularity project

In this paper, we introduce advanced deep,learning Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic awareness, we consolidate multiple datasets across granularities and train on decoupled objects and parts classification. This allows our model to facilitate knowledge transfer among rich semantic information. For the multigranularity capability, we propose a multi-choice learning scheme, enabling each click point to generate masks at multiple levels that correspond to multiple groundtruth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves semantic-awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements. We will provide code and a demo for further exploration and evaluation at

The universal and interactive AI systems that follow human intents have shown their potential in natural language processing [46, 47] and controllable image generation [52, 66]. However, such a universal system for pixel-level image understanding remains less explored. We argue that a universal segmentation model should possess the following important properties: universal representation, semantic-awareness, and granularity-abundance. Regardless of the specific image domain or prompt
context, the model is capable of acquiring a versatile representation, predicting segmentation masks
in multi-granularity, and understanding the semantic meaning behind each segmented region. Previous works [31, 70, 58] attempted to investigate these properties, but only achieved part of the goals. The main obstacles impeding the progress of such a universal image segmentation model can be attributed to limitations in both model architecture flexibility and training data availability.

model Architecture :The existing image segmentation model architectures are dominated by the
single-input-single-output pipeline that discards any ambiguity. While this pipeline is prevalent in both anchor-based CNN architectures [24] and query-based Transformer architectures [4, 11], and has demonstrated remarkable performance in semantic, instance, and panoptic segmentation tasks [39, 68, 30], it inherently restricts the model to predict multi-granularity segmentation masks in an end-to-end manner. Although clustering postprocessing techniques [13] can produce multiple masks for a single object query, they are neither efficient nor effective solutions for a granularity-aware segmentation model. Preprint. Under review.
Our model is capable of dealing with various segmentation tasks including open-set and interactive segmentation. (a) Our model can do instance, semantic, panoptic segmentation, and part segmentation. (b) Our model is able to output multi-level semantics with different granularities. The red point on the left-most image is the click.(c) We connect our model with an inpainting model to perform multi-level inpainting. The prompts are “Spider-Man” and “BMW car”, respectively. Note that only one click is needed to produce the results in (b) and (c), respectively.

  • Training Data. Scaling up segmentation datasets that possess both semantic-awareness and granularity-awareness is a costly endeavor. Existing generic object and segmentation datasets such
    as MSCOCO [39] and Objects365 [53] offer large amounts of data and rich semantic information,
    but only at the object level. On the other hand, part segmentation datasets such as Pascal Part [9],
    PartImageNet [23], and PACO [49] provide more fine-grained semantic annotations, but their data volumes are limited. Recently, SAM [31] has successfully scale up the multi-granularity mask data to millions of images, but it does not include semantic annotations. In order to achieve the dual objectives of semantic-awareness and granularity-abundance, there is a pressing need to unify segmentation training on various data formats to facilitate knowledge transfer. However, the inherent differences in semantics and granularity across different datasets pose a significant
    challenge to joint training efforts.
    • In this paper, we introduce Semantic-SAM, a universal image segmentation model designed to enable segmenting and recognizing objects at any desired granularity. Given one click point from a user, our model addresses the spatial ambiguity by predicting masks in multiple granularities, accompanied by semantic labels at both the object and part levels. As shown in Figure 1, our model generates multi-level segmentation masks ranging from the person head to the whole truck.
    • the multi-granularity capability is achieved through a multi-choice learning design [37, 22] incorporated into the decoder architecture. Each click is represented with multiple queries, each containing a different level of embedding. These queries are trained to learn from all available ground-truth masks representing different granularities. To establish a correspondence between multiple masks and ground-truths, we employ a many-to-many matching scheme to ensure that a single click point could generate high-quality masks in multiple granularities. To accomplish semantic-awareness with a generalized capability, we introduce a decoupled classification approach for objects and parts, leveraging a shared text encoder to encode both objects and parts independently. This allows us to perform object and part segmentation separately, while adapting the loss function based on the data type. For instance, generic segmentation data lacks part classification loss, whereas SAM data does not include classification loss. To enrich semantics and granularity within our model, we consolidate seven datasets on three types of granularities, including generic segmentation of MSCOCO [39], Objects365 [53], ADE20k [68],part segmentation of PASCAL Part [9], PACO [49], PartImagenet [23], and SA-1B [31]. Their data formats are reorganized to match our training objectives accordingly. After joint training, our model obtains a strong performance across a variety of datasets. Notably, we find that learning from interactive segmentation could improve generic and part segmentation. For example, by jointly training SA-1B promptable segmentation and COCO panoptic segmentation, we achieve a gain of 2.3 box AP and a gain of 1.2 mask AP. In addition, through comprehensive experiments, we demonstrate that our granularity completeness is better than SAM with more than 3.4 1-IoU.
Leave a Comment


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *