CASIA-OLHWDB1.1 Chinese Handwriting Dataset
- CASIA-OLHWDB1.1 is a large-scale dataset of online handwritten Chinese characters featuring 3,755 classes and over 1.1 million samples from 300 writers.
- It provides raw pen trajectory data along with domain-specific features such as path signatures, directional histograms, and elastic distortions for enhanced recognition.
- Its extensive use in training deep neural and recurrent models underscores its pivotal role in advancing online handwritten Chinese character recognition.
The CASIA-OLHWDB1.1 dataset is a large-scale, publicly available repository of online handwritten Chinese characters widely utilized for evaluating algorithms in online handwritten Chinese character recognition (HCCR). Designed and released by the Institute of Automation, Chinese Academy of Sciences (CASIA), the dataset targets the GBK Level-1 set of simplified Chinese characters and has become a foundational benchmark for the development and assessment of deep neural architectures, domain-specific feature extraction, and novel training strategies in the HCCR domain.
1. Dataset Composition and Structure
CASIA-OLHWDB1.1 comprises 3,755 character classes corresponding to the standard GBK Level-1 subset. Data collection involved 300 distinct writers, each contributing a single instance of every character, resulting in an aggregate of 1,126,500 samples. The dataset is partitioned as follows:
| Set | Writers | Samples per Class | Total Samples |
|---|---|---|---|
| Training | 240 | 240 | 901,200 |
| Test | 60 | 60 | 225,300 |
No official validation split is provided; protocol details may differ across studies, but most reserve the training pool for hyperparameter selection or carve out a subset of writers for cross-validation (Graham, 2014, Lai et al., 2017, Ren et al., 2017, Yang et al., 2015).
Each sample consists of a variable-length sequence of pen-tip coordinates, represented by pairs. Some approaches append a monotonic timestamp for path signature feature extraction, while the canonical format is a series of segment-ordered, writer-normalized pen strokes (Lai et al., 2017, Ren et al., 2017).
2. Preprocessing and Feature Representations
The CASIA-OLHWDB1.1 data is distributed as raw pen trajectory sequences, with preprocessing pipelines differing by research objective and model type. Common stages include:
- Scaling and Centering: Trajectories are rescaled along both axes to fit a defined window size (e.g., , , or ) and mean-centered to the grid center (Ren et al., 2017, Yang et al., 2015).
- Rasterization: For convolutional architectures, raw trajectories are rendered to sparse binary bitmaps or multi-channel tensors. The rendering may include one-pixel-wide binary strokes or extended multi-channel grids embedding additional structure (e.g., 8-direction compass histograms, path signatures, or "imaginary strokes") (Graham, 2014, Yang et al., 2015).
- Domain-Specific Feature Augmentation: Several works incorporate domain-specific channels:
- Path signature maps: Iterated integrals of the trajectory, typically up to order or , producing up to 121 feature maps (Yang et al., 2015, Lai et al., 2017).
- Directional feature maps: 8-bin histograms quantifying local stroke orientation per grid cell (Graham, 2014, Yang et al., 2015).
- Imaginary strokes: Linear interpolations between pen-up and next pen-down points, rendered as auxiliary bitmap channels (Yang et al., 2015).
- Affine and Elastic Distortions: Applied online during training for data augmentation, including scaling, rotation, translation, global shear, and nonlinear deformations (Graham, 2014, Yang et al., 2015, Lai et al., 2017).
The resulting network inputs thus vary from 1- to 121-channel tensors at spatial resolutions such as , , or , with the vast majority of grid cells set to zero (spatial sparsity is reported in bitmap cases) (Graham, 2014, Yang et al., 2015).
3. Benchmark Architectures and Training Protocols
Research leveraging CASIA-OLHWDB1.1 spans a range of neural architectures and training paradigms. Representative models include:
- Deep Convolutional Networks: The DeepCNet() architecture processes spatially-sparse inputs through sequential layers of convolutions and max-pooling, with ReLU activations, dropout regularization (rates per layer typically increasing towards deeper layers), and cross-entropy loss. For , six blocks of pooling and convolution reduce a input to in spatial extent, terminating in a softmax classifier over 3,755 classes (Graham, 2014).
- Domain-Specific DCNNs: Composite networks integrate discrete domain knowledge—such as path signature features, 8-directional maps, and nonlinear normalization—with standard DCNN pipelines. These may be fused using hybrid serial-parallel ensembling, where an array of DCNNs, each ingesting distinct feature sets, yields robust consensus predictions (Yang et al., 2015).
- Recurrent Neural Networks: Sequence models (GRU, LSTM, and Hybrid-parameter RNNs with Memory Pool Units) operate directly on normalized sequences. Axis-wise scaling and mean-centering standardize temporal samples, and outputs from stacked RNNs are pooled for classification. Hybrid-parameter RNNs halve parameter count and accelerate inference while increasing accuracy over conventional bidirectional RNNs (Ren et al., 2017).
- Enhanced CNN Pipelines: Recent studies introduce DropDistortion (scheduled reduction of random affine distortions during training) in conjunction with higher-order path signature input and spatial stochastic max-pooling (SSMP, with fractional stride and elastic feature-map deformation), further boosting generalization (Lai et al., 2017).
Training protocols consistently employ stochastic gradient optimization (e.g., SGD with momentum, RMSProp, or Nesterov momentum), batch sizes typically in the range of 96–256, and dropout. Early stopping on training or carved-out validation writer subsets is common (Graham, 2014, Yang et al., 2015, Lai et al., 2017, Ren et al., 2017).
4. Benchmark Results and Comparative Analysis
CASIA-OLHWDB1.1 has served as a competitive testbed for both architectural innovations and auxiliary representation schemes. Notable reported results include:
| Architecture/Approach | Input Features (Channels) | Test Accuracy (%) | Source |
|---|---|---|---|
| DeepCNet(6,100) | Bitmap + 8-direction histograms (9) | 96.18 | (Graham, 2014) |
| DeepCNet(6,100) | Bitmap only (1) | 94.88 | (Graham, 2014) |
| DropDistortion+CNN+SSMP | Truncated path signature (121) | 97.30 | (Lai et al., 2017) |
| HSP DCNN Ensemble | Bitmap, signature, direction, etc. | 96.87 | (Yang et al., 2015) |
| Hybrid-parameter RNN, 5 layers (MPU) | Normalized trajectory | 96.5 | (Ren et al., 2017) |
Performance improvements with feature-rich representations (e.g., 121-channel path signatures) and advanced ensemble strategies are significant, with top-1 error rates reaching as low as 2.7–3.1%. The relative error reduction between basic bitmap-based DCNNs and multi-stream or distortion-augmented architectures can exceed 48% (Yang et al., 2015, Lai et al., 2017).
5. Significance in HCCR Research and Methodological Trends
CASIA-OLHWDB1.1’s scope—large class count, writer diversity, and standardized splits—makes it a de facto benchmark for modern HCCR algorithms. Key methodological trends that emerged and are continuously refined on this dataset include:
- Exploitation of Spatial Sparsity: High-resolution, one-pixel stroke rendering exploits input sparsity for computational efficiency in deep CNNs while preserving character detail (Graham, 2014).
- Embrace of Domain Knowledge: Integrating path signatures, imaginary strokes, and directional histograms captures both topological and dynamic properties of handwriting, consistently improving recognition beyond raw bitmaps (Yang et al., 2015, Lai et al., 2017).
- Progressive Data Augmentation: Randomized affine, elastic, and piecewise linear distortions (DropDistortion, ED) are crucial for generalization, especially in large-class regimes (Yang et al., 2015, Lai et al., 2017).
- Ensemble and Averaging Techniques: HSP ensembles and SSMP-based model averaging provide effective variance reduction in predictions (Yang et al., 2015, Lai et al., 2017).
- Efficient Sequence Modeling: Direct RNN operation on standardized trajectories, using parameter-sharing schedules (hybrid-parameter RNN) and compact hidden units (MPU), achieves competitive performance with reduced inference and memory cost (Ren et al., 2017).
6. Open Questions and Ongoing Directions
While accuracy on CASIA-OLHWDB1.1 continues to improve, several avenues remain active:
- Misclassified samples are frequently attributed to illegibility or annotation errors; detailed per-character or per-stroke breakdowns remain scarce in the literature (Lai et al., 2017).
- The generalizability of domain-specific augmentations (e.g., path signature order selection, optimal composition of channels) is still an area of empirical exploration.
- Model scaling, particularly regarding the balance between increased channel count and train-time efficiency, presents ongoing optimization challenges.
- The absence of official validation splits prompts the need for standardized protocol reporting to ensure result reproducibility.
A plausible implication is that future systems leveraging CASIA-OLHWDB1.1 are likely to incorporate increasingly sophisticated domain representations, ensemble averaging, and progressive data augmentation strategies as part of a state-of-the-art HCCR pipeline (Lai et al., 2017, Yang et al., 2015).
References:
(Graham, 2014) Spatially-sparse convolutional neural networks (Yang et al., 2015) Improved Deep Convolutional Neural Network For Online Handwritten Chinese Character Recognition using Domain-Specific Knowledge (Lai et al., 2017) Toward high-performance online HCCR: a CNN approach with DropDistortion, path signature and spatial stochastic max-pooling (Ren et al., 2017) A New Hybrid-parameter Recurrent Neural Networks for Online Handwritten Chinese Character Recognition