CNN-Based Co-Creative Architecture

Updated 13 January 2026

The paper introduces a CNN-based framework that fine-tunes a VGG-16 model on sketch data to enable conceptual shifts using clustering in the feature space.
It employs advanced embeddings and composite novelty metrics combining visual and semantic distances, with evaluations showing significant creative impact (p < .01, FID ≈ 17.5).
GAN pipelines extend the system by synthesizing novel visual forms and refining stylistic details, thereby reducing designers' time-to-first-idea and expanding creative possibilities.

CNN-based co-creative architectures define a class of interactive intelligent systems that employ deep convolutional neural networks (CNNs) to engage in collaborative creation with human users, with the explicit goal of stimulating creativity through data-driven conceptual shifts, generative responses, or stylistic augmentation. These architectures are structured to analyze, encode, and retrieve or generate visual artifacts—most often sketches or images—in response to user input, thereby supporting an iterative, co-creative workflow in real time or near-real time.

1. Foundational Architectures and Mechanisms

At the core of the majority of early CNN-based co-creative systems is a discriminative convolutional encoder built atop pre-trained architectures such as VGG-16. For instance, in the architecture described by Saleh and Elgammal (Karimi et al., 2018), a VGG-16 network pre-trained on ImageNet is fine-tuned using a substantial collection of sketch data drawn from a subset of the Quick, Draw! dataset (65 object categories, up to 110,000 sketches per category). The full VGG-16 stack—with 15 convolutional layers organized into five blocks, each with increasing filter counts and ReLU activations—feeds into two fully-connected layers (fc1, fc2; both 4096 units), with the final classification layer omitted during feature extraction.

During inference, a user-generated sketch is rasterized and passed through the network, extracting a 4096-dimensional descriptor from the first fully-connected layer (fc1), which is subsequently used as the representation for retrieval or synthesis. Fine-tuning is performed with categorical cross-entropy loss over the relevant object categories, with all layers updated jointly. Optimizer choices and hyperparameters are often left unspecified; weight selection is typically by validation accuracy.

2. Clustering, Conceptual Shift, and Matching Algorithms

A central innovation in these architectures is the explicit modeling and operationalization of "conceptual shifts." Rather than simply retrieving the closest match in the user’s intended category, the system discovers clusters in the encoder’s feature space using k-means or related vector quantization methods:

Each sketch is embedded, and clusters—one per visual or conceptual category—are defined by centroids in the high-dimensional feature space.
Conceptual shift is realized by computing the Euclidean distance between the centroid of the user's sketch cluster and centroids from clusters belonging to different object categories.
The system then selects the closest cluster from a non-matching category (minimizing the centroid distance), retrieves a representative sketch, and presents this as a conceptual shift prompt to the user (Karimi et al., 2018).

This process is formalized as:

$j^* = \underset{j: \text{cat}(j) \neq \text{cat}(u)}{\arg\min} \| \mu_u - \mu_j \|_2$

where $\mu_u$ is the centroid of the user's assigned cluster, and $j^*$ is the index of the most similar non-matching cluster.

3. Advanced Representations and Novelty Metrics

Subsequent systems, such as the Creative Sketching Partner (CSP), expanded the representational machinery to incorporate both discriminative and sequence-based models. While the VGG16 pipeline offers a fixed-length, high-dimensional embedding, CSP leverages a CNN-LSTM model that processes raw stroke sequences, feeding the stroke data through three 1D convolutional layers (progressively increasing filters), followed by a stack of three LSTM layers before deriving a compact 256-dimensional embedding from the final LSTM hidden state (Karimi et al., 2019).

Novelty in these models is estimated as a composite metric, integrating visual distance in the embedding space and conceptual distance in a semantic (word2vec) space of category labels. Visual similarity is computed as one minus the min–max normalized Euclidean distance between cluster centroids; conceptual similarity uses the cosine similarity of word2vec embeddings. The final novelty score is the percentile-based combination, supporting explicit partitioning of candidate responses into low, intermediate, and high novelty bins. This enables dynamic control over the creative divergence presented to the user.

4. Extension to Generative Architectures: CNN-based GAN Pipelines

Beyond discriminative encoders, co-creative architectures have evolved to incorporate convolutional generative adversarial networks (GANs), enabling not just retrieval, but actual synthesis of novel visual forms. López-Martínez et al. (Lataifeh et al., 2023) systematically compared DCGAN, WGAN, WGAN-GP, BigGAN-deep, and StyleGAN2-ada architectures—each with CNN backbones—for silhouette and stylistic character design. GANs are trained over domain-specific datasets (e.g., 10,000 human/monster silhouettes at 64x64 or 512x512 resolution), sometimes using transfer learning (e.g., from FFHQ) and extensive data augmentation pipelines (ADA).

The generative pipeline is often staged: a first GAN (e.g., StyleGAN2-ada) produces basic silhouettes, abstracting away color and texture to prompt human designers to focus on fundamental forms. A subsequent network (e.g., Pix2Pix or StyleGAN2-ada fine-tuned for color) performs high-resolution coloring or adds surface detail. Loss functions include adversarial objectives, L1 regularization (for Pix2Pix), and Wasserstein or gradient penalty terms (for WGAN-GP), with Fréchet Inception Distance (FID) as the primary measure of generative fidelity.

5. Workflow Integration and Human-AI Interaction Patterns

Interaction protocols within CNN-based co-creative systems are structured as iterative call-and-response loops:

The user generates a sketch or provides a design seed (stroke sequence or raster image).
The system embeds the input, locates its cluster/centroid in feature or semantic space, or generates a novel variant using the current generative model.
The AI retrieves or synthesizes a sketch from a suitably shifted category or latent direction—according to the computed novelty level or conceptual shift distance.
This output is rendered on the shared canvas, and the designer integrates, modifies, or responds, continuing the cycle.

In retrieval-based schemes, this yields visually coherent but semantically unexpected responses; generative approaches expand the possible design space by enabling the creation of forms not explicitly present in the training set (Karimi et al., 2018, Karimi et al., 2019, Lataifeh et al., 2023).

6. Empirical Evaluation and Creative Metrics

Evaluations are predominantly qualitative, supplemented in some cases by quantitative characterization:

Saleh and Elgammal’s system provides call-and-response sketch pairs, showcasing both low-distance and higher-distance cross-category matches, but does not report explicit classification or retrieval metrics (Karimi et al., 2018).
CSP quantitatively measures the impact of novelty-exposed AI responses on user creativity, demonstrating that high-novelty outputs correlate with both increased inspiration and transformational design, validated through user study statistics ( $p < .01$ ) (Karimi et al., 2019).
Generative pipelines are benchmarked via FID (e.g., StyleGAN2-ada with transfer learning achieves FID ≈ 17.5), with user studies indicating that GAN-generated seeds reduce designers’ "time-to-first-idea" and positively correlate GAN quality with subjective ratings of novelty ( $\rho \approx 0.72$ ) (Lataifeh et al., 2023).

7. Design Considerations, Limitations, and Future Directions

Key design choices include the use of separate stages for structural and stylistic synthesis, conditional architectures for semantic control, adaptive data augmentation to prevent mode collapse on small datasets, and the utility of transfer learning to achieve convergence with minimal samples. Limitations include computational cost (multi-day single-GPU training), over-specificity in highly realistic generative models (potentially inhibiting creative divergence), and unresolved issues surrounding the provenance of transfer-learned generative features.

Recommended practices for future architectures include incorporation of explicit creativity-oriented loss functions (e.g., entropy maximization in latent space), tools for user-driven latent space traversal and editing, and integration of multi-modal (textual + visual) conditioning for richer co-creative interactions (Lataifeh et al., 2023). A plausible implication is that effective co-creative systems depend not only on architectural advances, but equally on workflow design, task-specific evaluation, and creatively meaningful interaction between human and agent.

References:

Saleh and Elgammal, "Deep Learning for Identifying Potential Conceptual Shifts for Co-creative Drawing" (Karimi et al., 2018)
López-Martínez et al., "Augmenting Character Designers Creativity Using Generative Adversarial Networks" (Lataifeh et al., 2023)
Karimi et al., "Deep Learning in a Computational Model for Conceptual Shifts in a Co-Creative Design System" (Karimi et al., 2019)