HiQ Dataset: High-Density Color QR Codes

Updated 2 December 2025

HiQ Dataset is a high-capacity color QR code collection featuring 5,390 high-density frames with diverse real-world distortions, including chromatic bleed and geometric misalignments.
It supports research in learning-based decoders with precise ground-truth annotations and robust calibration from varied imaging conditions and print resolutions.
The dataset, captured across 8 smartphone models and multiple lighting scenarios, serves as a benchmark for evaluating decoding methods like LSVM-CMI and QDA-CMI.

HiQ is a high-capacity color QR code framework distinguished by its robust decoding strategies for complex chromatic distortions and its foundational dataset, CUHK-CQRC (Chinese University of Hong Kong Color QR Code). The HiQ system and dataset enable the study and development of learning-based decoders for high-density color QR codes under real-world imaging and print conditions, introducing new challenges for robust color recovery and geometric correction in mobile applications (Yang et al., 2017).

1. Dataset Structure and Coverage

The CUHK-CQRC dataset comprises 5,390 high-density color QR code frames in 3-layer HiQ code format, with each codeword represented as a 3-bit tuple, corresponding to 8 distinct module colors. The samples are split between 1,506 still-photo captures and 3,884 camera-preview frames. Modules are structured at various densities, with code sizes including 105×105, 125×125, 137×137, 157×157, and 177×177 modules per side. Printouts vary from 22 mm to 70 mm in side length and are produced at 600 or 1200 dpi on ordinary copier paper. Imaging conditions reflect typical mobile capture scenarios, with approximately 5–8 pixels-per-module in high-density settings.

The dataset was constructed to capture authentic challenges in real-world CAD/barcode applications, including substantial dynamic and spectral variation in illumination. Indoor fluorescent, incandescent, diverse outdoor daylight scenarios, and both uniform and nonuniform shadowing are represented. No synthetic augmentations have been applied; all distortions stem from the physical capture pipeline.

A summary of the CUHK-CQRC dataset structure is provided below:

Attribute	Values/Details	Counts/Levels
Total Frames	5,390	1,506 photos, 3,884 previews
Module Size Range	105–177 per side	5 levels
Print DPI	600, 1200	2 levels
Print Size	22–70 mm	Continuous
Phones Used	iPhone 6, 6 Plus, Nexus 4, 5, OnePlus One, etc.	8 devices
Lighting Scenarios	Indoor fluorescent/incandescent, daylight, shadow	Broad

2. Data Collection Protocol and Imaging Pipeline

Data acquisition was executed to maximize real-world relevance and diversity. QR code samples were printed at two resolutions (600 and 1200 dpi), with physical print sizes and densities selected to produce a range of module granularities. Sampling covers various phone tilt angles (up to ±30° yaw and pitch, with a near-uniform empirical distribution over this range) and illumination intensities spanning from under 50 lux (low light) to over 10,000 lux (bright sunlight), producing a qualitative joint distribution given by $P(I, \theta) \simeq P(I) \cdot P(\theta)$ , where $I$ is illumination and $\theta$ is tilt.

Device diversity is ensured by the inclusion of eight widely available smartphones, representative of both premium and mid-range market segments, with sensors ranging from 5 MP to 13 MP and varying stabilization and autofocus implementations. Capture modes include both still images and live camera-preview frames, reflecting practical barcode scanning scenarios.

Critical to the dataset's relevance are real-world distortions, which include:

Cross-Channel Interference (CCI): CMY printer layers bleed into the RGB response of the camera sensor.
Cross-Module Interference (CMI): Colorant spread between adjacent modules, increasingly significant at higher module densities.
Illumination Variation: Effects from low-light, overexposure, mixed color temperature sources.
Geometric Distortion: Arbitrary yaw/pitch, affine and perspective tilt, and minor misalignments.
Motion Blur and Spatial Resolution Limits encountered with mobile devices.

3. Ground-Truth Annotation and Labeling Procedures

Each image in the dataset retains an unambiguous mapping from the code-space to physical modules, owing to algorithmic code generation by the HiQ encoder. Ground-truth annotation proceeds as follows: operators designate finder/alignment patterns in each image, a geometric transform is applied to rectify the distorted capture, and the exact 3-bit label for every module is determined by lookup from the encoder's records. No manual per-pixel labels are drawn; label assignment is deterministic after adjustment of pattern locations.

This process ensures exact ground-truth labeling for every module under the full spectrum of distortions encountered in the dataset.

4. Dataset Splits and Statistical Properties

The selected training set consists of 65 images, strategically sampled to cover all variables with representation from every phone, code density, print resolution, and illumination type. The remainder—1,441 still photos and 3,884 preview frames—comprise the test set. Cross-validation was employed during classifier development, rather than introducing a separate held-out validation partition.

Approximately 0.6 million module samples are extracted from the training images to support statistical learning of robust color-recovery models. Empirical coverage spans diverse lighting/phone/print/pose combinations, supporting the intensive modeling of color distortions and geometric deformations in high-density color QR codes.

5. Benchmarking, Methodological Innovations, and Performance Outcomes

The HiQ framework introduces two learning-based classification models for decoding color QR codes:

LSVM-CMI: Linear SVM with explicit modeling of cross-module interference.
QDA-CMI: Quadratic Discriminant Analysis with joint color and module interference modeling.

Both classifiers optimize objective functions tailored to the chromatic distortions documented in CUHK-CQRC. HiQ's decoding accuracy is empirically benchmarked on the dataset against an established baseline ([2] in the original paper). On CUHK-CQRC, HiQ outperforms the baseline by at least 188% in decoding success rate and 60% in bit error rate (Yang et al., 2017).

6. Access, Licensing, and Usage Guidelines

CUHK-CQRC is publicly available for academic research purposes via http://www.authpaper.net/colorDatabase/index.html. Users are required to cite Yang et al., “Robust and Fast Decoding of High-Capacity Color QR Codes for Mobile Applications,” IEEE Trans. on Image Processing, 2017, when reporting results or conducting comparative studies with this dataset. The dataset licensing terms restrict usage to academic research contexts and require appropriate attribution of both the dataset and URL.

7. Significance and Applications

CUHK-CQRC provides a challenging and diverse resource for the machine learning and computer vision communities working on information-rich barcode systems under adverse real-world capture conditions. It enables methodologically rigorous evaluation of color QR decoding algorithms, particularly for scenarios demanding robustness to joint chromatic, geometric, and printing distortions as encountered in mobile and pervasive computing contexts. The dataset supports the exploration of novel statistical learning, color recovery, and spatial transformation strategies and serves as a benchmark for future research in high-density color barcode decoding (Yang et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Robust and Fast Decoding of High-Capacity Color QR Codes for Mobile Applications (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HiQ Dataset.