LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
Abstract: While LISA effectively bridges the gap between segmentation and LLMs to enable reasoning segmentation, it poses certain limitations: unable to distinguish different instances of the target region, and constrained by the pre-defined textual response formats. In this work, we introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact. The main enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation. \textbf{2) More Natural Conversation}: Improved capability for multi-turn dialogue, with the ability to incorporate segmentation results directly into text responses, i.e., Segmentation in Dialogue (SiD). These improvements are achieved by curating the existing samples of generic segmentation datasets, aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources. Comparative analysis with the original LISA model shows significant advancements in these areas, positioning LISA++ as a notable upgrade in visual understanding and interaction. LISA++'s adaptability and improved features highlight the versatility of the mask-as-embedding paradigm proposed by LISA, and the potential as a foundational model for diverse applications.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023b.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Grounding language models to images for multimodal inputs and outputs. In ICML, 2023.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv:2307.04767, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023c.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Gres: Generalized referring expression segmentation. In CVPR, 2023a.
- Improved baselines with visual instruction tuning. arXiv preprint, 2023b.
- Visual instruction tuning. arXiv:2304.08485, 2023c.
- Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv:2305.05662, 2023d.
- Pfenet++: Boosting few-shot semantic segmentation with the noise-filtered context-aware prior mask. TPAMI, 2023.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
- Detgpt: Detect what you need via reasoning. arXiv:2305.14167, 2023a.
- Perceptiongpt: Effectively fusing visual perception into LLM. arXiv:2311.06612, 2023b.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv:2303.17580, 2023.
- Adaptive perspective distillation for semantic segmentation. TPAMI, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv:2305.11175, 2023.
- Lenna: Language enhanced reasoning detection assistant. arXiv:2312.02433, 2023.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023a.
- See, say, and segment: Teaching lmms to overcome false premises. arXiv:2312.08366, 2023b.
- Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023c.
- Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023a.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv:2303.11381, 2023b.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
- Griffon: Spelling out all object locations at any granularity with large language models. arXiv:2311.14552, 2023.
- Understanding the tricks of deep learning in medical image segmentation: Challenges and future directions. arXiv:2209.10307, 2022.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
- Generalized decoding for pixel, image, and language. In CVPR, 2023a.
- Segment everything everywhere all at once. arXiv:2304.06718, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.