Papers
Topics
Authors
Recent
Search
2000 character limit reached

LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model

Published 28 Dec 2023 in cs.CV | (2312.17240v3)

Abstract: While LISA effectively bridges the gap between segmentation and LLMs to enable reasoning segmentation, it poses certain limitations: unable to distinguish different instances of the target region, and constrained by the pre-defined textual response formats. In this work, we introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact. The main enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation. \textbf{2) More Natural Conversation}: Improved capability for multi-turn dialogue, with the ability to incorporate segmentation results directly into text responses, i.e., Segmentation in Dialogue (SiD). These improvements are achieved by curating the existing samples of generic segmentation datasets, aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources. Comparative analysis with the original LISA model shows significant advancements in these areas, positioning LISA++ as a notable upgrade in visual understanding and interaction. LISA++'s adaptability and improved features highlight the versatility of the mask-as-embedding paradigm proposed by LISA, and the potential as a foundational model for diverse applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023.
  3. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  4. End-to-end object detection with transformers. In ECCV, 2020.
  5. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023a.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023b.
  7. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  8. Grounding language models to images for multimodal inputs and outputs. In ICML, 2023.
  9. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  10. Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
  11. Semantic-sam: Segment and recognize anything at any granularity. arXiv:2307.04767, 2023b.
  12. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023c.
  13. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
  14. Gres: Generalized referring expression segmentation. In CVPR, 2023a.
  15. Improved baselines with visual instruction tuning. arXiv preprint, 2023b.
  16. Visual instruction tuning. arXiv:2304.08485, 2023c.
  17. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv:2305.05662, 2023d.
  18. Pfenet++: Boosting few-shot semantic segmentation with the noise-filtered context-aware prior mask. TPAMI, 2023.
  19. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  20. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  21. Detgpt: Detect what you need via reasoning. arXiv:2305.14167, 2023a.
  22. Perceptiongpt: Effectively fusing visual perception into LLM. arXiv:2311.06612, 2023b.
  23. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv:2303.17580, 2023.
  24. Adaptive perspective distillation for semantic segmentation. TPAMI, 2022.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  26. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv:2305.11175, 2023.
  27. Lenna: Language enhanced reasoning detection assistant. arXiv:2312.02433, 2023.
  28. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023a.
  29. See, say, and segment: Teaching lmms to overcome false premises. arXiv:2312.08366, 2023b.
  30. Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023c.
  31. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023a.
  32. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv:2303.11381, 2023b.
  33. mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
  34. Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
  35. Griffon: Spelling out all object locations at any granularity with large language models. arXiv:2311.14552, 2023.
  36. Understanding the tricks of deep learning in medical image segmentation: Challenges and future directions. arXiv:2209.10307, 2022.
  37. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  39. Scene parsing through ade20k dataset. In CVPR, 2017.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
  41. Generalized decoding for pixel, image, and language. In CVPR, 2023a.
  42. Segment everything everywhere all at once. arXiv:2304.06718, 2023b.
Citations (20)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.