SketchAgent: Language-Driven Sequential Sketch Generation

Sketching is a versatile tool for externalizing ideas, enabling rapid exploration and visual communication across various disciplines. While artificial systems have made significant advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains a challenge. In this article, we explore SketchAgent, a language-driven sequential sketch generation method that allows users to create, modify, and refine sketches through dynamic conversational interactions.

Introduction

Sketching is a powerful tool for distilling ideas into their simplest form. Its fluid and spontaneous nature makes it uniquely versatile for visualization, rapid ideation, and communication across cultures, generations, and disciplines. Designers use sketches to explore new ideas, scientists employ them to formulate problems, and children engage in sketching to learn and express themselves. Artificial systems, in principle, have the potential to support and enhance human creativity, problem-solving, and visual expression through sketching, adapting flexibly to their exploratory nature.

Traditionally, sketch generation methods rely on human-drawn datasets to train generative models. However, fully capturing the diversity of sketches within datasets remains challenging, limiting these methods in both scale and diversity. Recent advancements in vision-language models, such as CLIP and text-to-image diffusion, have enabled sketch generation methods that reduce reliance on human-drawn datasets. These methods leverage pretrained model guidance and differentiable rendering to optimize parametric curves, creating sketches that go beyond predefined styles and categories.

SketchAgent Overview

SketchAgent leverages the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs) to enable versatile, progressive, language-driven sketching. Our agent can generate sketches across a wide range of textual concepts—from animals to engineering principles. Its sequential nature facilitates interactive human-agent sketching and supports iterative refinement and editing through a chat-based dialogue.

Key Features

No Training Required: SketchAgent requires no additional training or fine-tuning, making it highly accessible.
Sequential Sketching: The agent captures the evolving, dynamic qualities intrinsic to sketching by drawing stroke by stroke.
Interactive Collaboration: Users can engage in dialogue-driven drawing and collaborate meaningfully with the agent.
Diverse Concept Generation: From simple objects to complex diagrams, SketchAgent can sketch a wide array of concepts.

Method

Our goal is to enable an off-the-shelf pretrained multimodal LLM to draw sketches based on natural language instructions. Here's an overview of our pipeline:

System Prompt: We provide the model with context about its expertise and introduce it to a grid canvas along with examples of how to use our sketching language for drawing single-stroke primitives.
User Prompt: Includes a description of the desired task and an example of a simple sketch drawn with our sketching language. This assists the agent in preserving the correct format for parsing.
Canvas Representation: We define the canvas as a numbered grid, allowing the agent to reference specific coordinates to enhance its spatial reasoning capabilities.
Sketch Representation: A sketch is defined as a sequence of ordered strokes, each represented by a series of grid coordinates. These coordinates are processed into smooth Bézier curves and rendered onto the canvas.

Sketching Language

We introduce an intuitive sketching language to the model through in-context examples, enabling it to "draw" using string-based actions. These actions are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks.

Collaborative Sketching

The canvas remains accessible to both the user and the agent throughout the session. The agent generates strokes sequentially and pauses according to an adjustable stopping token, allowing the user to add their own strokes directly to the canvas. These strokes are then integrated into the agent's sequence, enabling it to continue drawing with real-time canvas updates.

Results

We evaluate SketchAgent's performance qualitatively and quantitatively across a selected set of sketching tasks:

Text-Conditioned Sketch Generation

SketchAgent demonstrates the capability to generate sketches of various concepts beyond standard categories, including scientific concepts, diagrams, and notable landmarks. For example:

Scientific Concepts: Double-slit experiment, pendulum motion, photosynthesis, DNA replication, Newton's laws of motion, electromagnetic spectrum, plate tectonics, quantum entanglement, cell division (mitosis), black hole formation.
Diagrams: Circuit diagram, flowchart, organizational chart, ER diagram (Entity-Relationship), Venn diagram, mind map, Gantt chart, network topology diagram, pie chart, decision tree.
Notable Landmarks: Taj Mahal, Eiffel Tower, Great Wall of China, Pyramids of Giza, Statue of Liberty, Colosseum, Sydney Opera House, Big Ben, Mount Fuji, Machu Picchu.

Sequential Sketching

SketchAgent generates sketches gradually, stroke by stroke, with each stroke carrying semantic meaning. This approach closely emulates the human sketching process, providing a more natural and dynamic sketch appearance.

Human-Agent Collaborative Sketching

We designed a web-based collaborative sketching environment where users and SketchAgent take turns drawing on a shared canvas to create a recognizable sketch from a given textual concept. The results show that collaboratively produced sketches achieve recognition levels close to those made solely by users and higher than those produced by the agent alone.

Chat-Based Sketch Editing

SketchAgent can perform interactive, text-based sketch editing within a chat dialogue, where the input to the agent combines both text and images. This allows for spatial reasoning and object relation edits, such as adding objects to sketches with specified relative locations or inferring placement based on semantics.

Limitations and Future Work

SketchAgent has several limitations:

Model Priors: It is constrained by the priors of the backbone model, which is primarily optimized for text rather than visual content. This often results in overly abstract and unrecognizable outputs.
Human Figure Depiction: While distinctive features may be captured well in language, the resulting sketches of human figures are overly simple and lack expressivity.
Letters and Numbers: The agent may struggle with drawing letters and numbers, which could be improved by providing relevant in-context examples.

Conclusion

We have presented SketchAgent, a method for language-driven, sequential sketch generation that can produce versatile sketches in real-time and engage meaningfully in collaborative sketching sessions with humans. By leveraging the prior knowledge embedded in pretrained multimodal LLMs through an intuitive sketching language and a grid canvas, SketchAgent represents a meaningful step toward developing general-purpose sketching systems with the potential to enhance human-computer communication and computer-aided ideation.

This work paves the way for future advancements in AI-driven sketching tools that can support iterative, evolving interactivity, fostering a more natural and creative collaboration between humans and machines.

Статья на arxiv Оригинал pdf sketching language interaction

Ай Дайджест