Свежая выжимка ml и AI статей - каждый день
Sketching is a versatile tool for externalizing ideas, enabling rapid exploration and visual communication across various disciplines. While artificial systems have made significant advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains a challenge. In this article, we explore SketchAgent, a language-driven sequential sketch generation method that allows users to create, modify, and refine sketches through dynamic conversational interactions.
Sketching is a powerful tool for distilling ideas into their simplest form. Its fluid and spontaneous nature makes it uniquely versatile for visualization, rapid ideation, and communication across cultures, generations, and disciplines. Designers use sketches to explore new ideas, scientists employ them to formulate problems, and children engage in sketching to learn and express themselves. Artificial systems, in principle, have the potential to support and enhance human creativity, problem-solving, and visual expression through sketching, adapting flexibly to their exploratory nature.
Traditionally, sketch generation methods rely on human-drawn datasets to train generative models. However, fully capturing the diversity of sketches within datasets remains challenging, limiting these methods in both scale and diversity. Recent advancements in vision-language models, such as CLIP and text-to-image diffusion, have enabled sketch generation methods that reduce reliance on human-drawn datasets. These methods leverage pretrained model guidance and differentiable rendering to optimize parametric curves, creating sketches that go beyond predefined styles and categories.
SketchAgent leverages the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs) to enable versatile, progressive, language-driven sketching. Our agent can generate sketches across a wide range of textual concepts—from animals to engineering principles. Its sequential nature facilitates interactive human-agent sketching and supports iterative refinement and editing through a chat-based dialogue.
Our goal is to enable an off-the-shelf pretrained multimodal LLM to draw sketches based on natural language instructions. Here's an overview of our pipeline:
System Prompt: We provide the model with context about its expertise and introduce it to a grid canvas along with examples of how to use our sketching language for drawing single-stroke primitives.
User Prompt: Includes a description of the desired task and an example of a simple sketch drawn with our sketching language. This assists the agent in preserving the correct format for parsing.
Canvas Representation: We define the canvas as a numbered grid, allowing the agent to reference specific coordinates to enhance its spatial reasoning capabilities.
Sketch Representation: A sketch is defined as a sequence of ordered strokes, each represented by a series of grid coordinates. These coordinates are processed into smooth Bézier curves and rendered onto the canvas.
We introduce an intuitive sketching language to the model through in-context examples, enabling it to "draw" using string-based actions. These actions are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks.
The canvas remains accessible to both the user and the agent throughout the session. The agent generates strokes sequentially and pauses according to an adjustable stopping token, allowing the user to add their own strokes directly to the canvas. These strokes are then integrated into the agent's sequence, enabling it to continue drawing with real-time canvas updates.
We evaluate SketchAgent's performance qualitatively and quantitatively across a selected set of sketching tasks:
SketchAgent demonstrates the capability to generate sketches of various concepts beyond standard categories, including scientific concepts, diagrams, and notable landmarks. For example:
SketchAgent generates sketches gradually, stroke by stroke, with each stroke carrying semantic meaning. This approach closely emulates the human sketching process, providing a more natural and dynamic sketch appearance.
We designed a web-based collaborative sketching environment where users and SketchAgent take turns drawing on a shared canvas to create a recognizable sketch from a given textual concept. The results show that collaboratively produced sketches achieve recognition levels close to those made solely by users and higher than those produced by the agent alone.
SketchAgent can perform interactive, text-based sketch editing within a chat dialogue, where the input to the agent combines both text and images. This allows for spatial reasoning and object relation edits, such as adding objects to sketches with specified relative locations or inferring placement based on semantics.
SketchAgent has several limitations:
We have presented SketchAgent, a method for language-driven, sequential sketch generation that can produce versatile sketches in real-time and engage meaningfully in collaborative sketching sessions with humans. By leveraging the prior knowledge embedded in pretrained multimodal LLMs through an intuitive sketching language and a grid canvas, SketchAgent represents a meaningful step toward developing general-purpose sketching systems with the potential to enhance human-computer communication and computer-aided ideation.
This work paves the way for future advancements in AI-driven sketching tools that can support iterative, evolving interactivity, fostering a more natural and creative collaboration between humans and machines.