Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-task CNN Porcelain Classification

Updated 28 January 2026
  • The paper introduces a multi-task CNN architecture with hard-parameter sharing and task-specific heads to simultaneously classify dynasty, ware, glaze, and vessel type.
  • It employs advanced techniques such as transfer learning and synthetic data augmentation to address class imbalances in real-world porcelain datasets.
  • Performance metrics indicate high accuracies with MobileNetV2 and ResNet50 backbones, though synthetic augmentation improves type classification at the expense of glaze detail.

Multi-task CNN-based porcelain classification is a specialized deep learning approach developed to automate the identification of Chinese porcelain artifacts, addressing four concurrent classification tasks: dynasty, kiln (ware), glaze, and vessel type. This paradigm leverages convolutional neural networks (CNNs) with hard-parameter sharing and task-specific output heads, sometimes enhanced by transfer learning and synthetic data augmentation, to achieve robust, scalable, and multi-faceted artifact categorization that meets the rigorous demands of archaeological and cultural heritage research (Ling et al., 18 Mar 2025, Ling et al., 21 Jan 2026).

1. Multi-Task CNN Architecture

Multi-task CNN-based porcelain classification hinges on “hard-parameter-sharing” models, where a single backbone network is used to extract a shared feature representation for all tasks. Four established CNN architectures are commonly utilized as backbones: ResNet50 (residual connections), MobileNetV2 (inverted residuals with linear bottlenecks), VGG16 (deep 3×3 stacks), and InceptionV3 (mixed convolutional kernels) (Ling et al., 18 Mar 2025).

For each input (224×224 px RGB image), the main CNN backbone is truncated after the global average pooling (GAP) layer. Four parallel fully-connected (FC) classification heads are then attached, each followed by a softmax activation for its respective task:

  • Dynasty: softmax over 2 classes (Song, Yuan)
  • Kiln/Ware: softmax over 10 (or 9) classes (e.g., Ding, Jizhou, Jun, etc.)
  • Glaze: softmax over 7–8 classes (e.g., Celadon, White, Black, etc.)
  • Type: softmax over 12 classes (e.g., Bowl, Dish, Vase, etc.)

This structure omits cross-task fusion or gating; all tasks independently score the shared feature vector (Ling et al., 18 Mar 2025, Ling et al., 21 Jan 2026).

In one variant, MobileNetV3-Large serves as backbone, with depthwise-separable convolutions, squeeze-and-excitation (SE) modules, and a 960-dimensional GAP output. Each head precedes its output by a dropout (p=0.5) layer to reduce overfitting (Ling et al., 21 Jan 2026).

2. Multi-Task Loss Objectives and Class-Balancing

Training employs a per-task categorical cross-entropy loss. For task tt with KtK_t classes and predictions p(t)\mathbf{p}^{(t)}: Lt=i=1Ktyi(t)logpi(t)L_t = -\sum_{i=1}^{K_t} y^{(t)}_i \log p^{(t)}_i

Multi-task learning combines these into a total loss, either as a uniform sum (Ling et al., 18 Mar 2025): Ltotal=Ldynasty+Lware+Lglaze+LtypeL_\text{total} = L_\text{dynasty} + L_\text{ware} + L_\text{glaze} + L_\text{type} or, to reflect task difficulty, as a weighted sum (Ling et al., 21 Jan 2026): Ltotal=tTλtLt\mathcal{L}_\text{total} = \sum_{t\in T} \lambda_t \mathcal{L}_t with empirically set λdynasty=1.0\lambda_{\rm dynasty}=1.0, λkiln=1.2\lambda_{\rm kiln}=1.2, λglaze=2.0\lambda_{\rm glaze}=2.0, λtype=1.5\lambda_{\rm type}=1.5.

To counter class imbalance, some pipelines apply class weights (wt,cw_{t,c}) per the “effective-number” method, capping at 10; the loss for a batch of NN is (Ling et al., 21 Jan 2026): Lt=1Ni=1Nc=1Ctwt,c  yi,c(t)  log(y^i,c(t))\mathcal{L}_t = -\frac{1}{N}\sum_{i=1}^N\sum_{c=1}^{C_t} w_{t,c}\;y^{(t)}_{i,c}\;\log\bigl(\hat y^{(t)}_{i,c}\bigr)

3. Datasets and Augmentation Techniques

The standard real-image dataset comprises 5,993 museum-photographed pottery images, each labeled for dynasty, ware, glaze, and type. Notably, the label distribution is highly imbalanced, with categories like “Song dynasty” and “White glaze” dominating, and extreme rarity for classes such as “Yellowish-green glaze” (4 examples) or “Teabowlstand” (64 examples) (Ling et al., 18 Mar 2025).

Preprocessing involves resizing to 224×224 pixels and random horizontal flip/rotation augmentations. For model assessment, an 80%/10%/10% train/validation/test split is typical (≈4,794/599/599 images) (Ling et al., 18 Mar 2025).

To expand the dataset, one approach leverages synthetic images produced by fine-tuned Stable Diffusion 1.5 with Low-Rank Adaptation (LoRA). LoRA-adapted diffusion models are trained on museum-verified, minority-class samples guided by detailed archaeological prompts and validated via a two-stage quality protocol (automated artifact removal + expert review). Synthetic samples are added at up to 10% of the training corpus, subject to class-specific rarity (Ling et al., 21 Jan 2026).

Further geometric and photometric augmentations are applied (random resized crop, rotation, flip, color jitter, affine transform, normalization). Oversampling of rare classes during training is executed via weighted random sampling proportional to 1/n1/\sqrt{n}, where nn is class frequency (Ling et al., 21 Jan 2026).

4. Training Procedures and Hyperparameter Regimes

Training is conducted in PyTorch, typically with:

  • Adam or AdamW optimizers (learning rates: 1×1031\times10^{-3} for baseline heads, 1×1041\times10^{-4} for pre-trained backbones)
  • Batch size of 32–64
  • Up to 50 epochs, with checkpointing for lowest validation loss and early stopping (10-epoch patience)
  • Hardware: NVIDIA RTX 2070 SUPER and comparable CPUs/GPUs
  • For transfer learning (TL), all backbones are initialized with ImageNet weights; training from scratch is compared in distinct runs (Ling et al., 18 Mar 2025, Ling et al., 21 Jan 2026)

When synthetic data is included, real-to-synthetic ratios such as 95:5 or 90:10 are compared, ensuring that synthetic samples do not leak into test splits (Ling et al., 21 Jan 2026).

5. Quantitative Performance and Analysis

Model-Level Results

Backbone Dynasty Acc. Ware/Kiln Acc. Glaze Acc. Type Acc.
InceptionV3 97.6% 94.4% 92.8% 86.1%
MobileNetV2 97.3% 95.3% 95.3% 86.1%
ResNet50 97.6% 93.0% 94.8% 86.1%
VGG16 94.6% 87.9% 88.3% 66.4%

MobileNetV2 and ResNet50 exhibit superior accuracy, robustness, and convergence. VGG16 underperforms, particularly on more fine-grained or unbalanced tasks (Ling et al., 18 Mar 2025).

Task Real-only F1 Real+5% F1 Real+10% F1
Dynasty 0.8480 0.8612 0.8808
Kiln 0.7394 0.7491 0.7615
Glaze 0.7338 0.7203 0.7079
Type 0.7463 0.7610 0.7791
Avg. F1 0.7674 0.7727 0.7823

Synthetic augmentation (10% LoRA-generated samples) yields the largest macro-F1 gains for type (+4.4 pp), moderate for dynasty (+3.9 pp) and kiln (+3.0 pp), and a decline for glaze (–3.5 pp), attributed to the generative model's limited realism in fine-surface texture (Ling et al., 21 Jan 2026). Type classification improvements are attributed to the generative model’s capacity to capture vessel morphology, while glaze performance degrades due to diffusion-induced texture averaging.

Other Evaluation Regimes

Balanced accuracy, precision, recall, and F1 scores are consistently highest for MobileNetV2 (e.g., balanced accuracy for type = 84.8%), with superior confusion matrix behavior evidenced by fewer misclassifications on major classes (Ling et al., 18 Mar 2025).

Transfer learning is critical: models trained from scratch lose 5–15% accuracy; e.g., MobileNetV2’s type accuracy jumps from 73.3% (no TL) to 86.1% (with TL) (Ling et al., 18 Mar 2025).

6. Limitations and Confusion Analysis

Frequent confusions occur among visually similar categories:

  • Ware: Peng ↔ Ding (both white wares)
  • Glaze: Green ↔ White or Celadon (overlapping hues)
  • Type: Vase ↔ Plate (top-down views obscure 3D shape)

Synthetic image augmentation is beneficial for tasks depending on global structure or under-represented classes but deteriorates glaze classification that relies on micro-surface detail, reflecting a texture-smoothing bias in diffusion-generated images. Performance gains saturate or reverse if synthetic data exceeds 10–15%, indicating distribution shift (Ling et al., 21 Jan 2026).

7. Enhancements and Future Directions

Recommended directions for elevating classification fidelity include:

Best practices for synthetic augmentation advise targeting under-represented classes, capping synthetic prevalence at 8–10%, employing expert-in-the-loop quality control, and emphasizing morphological prompt engineering. For pure texture tasks, non-diffusion synthetic methods may be superior (Ling et al., 21 Jan 2026).

These directions reflect ongoing efforts to balance scalable automation, archaeological authenticity, and generalization across heritage contexts, foregrounding the central role of multi-task CNNs in the computational study of material culture.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-task CNN-based Porcelain Classification.