Tacit Understanding Game (TUG) Compatibility
- TUG is a synchronous, two-player word association game designed to predict compatibility using minimal, privacy-preserving data.
- It leverages sequence-level word choices and advanced supervised learning methods with both real and synthetic datasets.
- Empirical results show TUG achieves competitive accuracy compared to traditional surveys, ensuring scalable and secure compatibility assessments.
The Tacit Understanding Game (TUG) is a synchronous, two-player, web-based word association game designed for inferring interpersonal compatibility using privacy-preserving behavioral traces. It is positioned as an alternative to traditional relationship quality measurement, mitigating the ecological and privacy limitations of lengthy questionnaires and invasive text sampling by leveraging the predictive power of sequence-level, single-word choices in a competitive, game-like setting. TUG formalizes the compatibility prediction task as a supervised learning problem, supported by both real-world and synthetic datasets, enabling downstream application in digital social platforms without the need to solicit self-disclosures or personal free-form text (Li et al., 13 Dec 2025).
1. Game Mechanics and Experimental Protocol
TUG sessions involve two participants engaging in ten rounds of a word association task. In each round, both players are presented with a shared "keyword" sampled from one of fifteen thematic categories (e.g., Music, Adventure, Humor) and a matrix of twenty candidate words contextually related to the chosen theme. Each player is required to independently select words most semantically resonant with the keyword, where (the "quota") is randomly sampled per round.
Following submission, players are shown their "Word Choice Match Rate" (WCMR), defined as the fraction of overlapped word choices in that round. WCMR is linearly mapped to a round score scaled by the quota: a maximum of $30$ points for , $40$ for , and $50$ for . Additional streak bonuses are awarded for consecutive high-alignment rounds. Cumulative scores are posted to a public leaderboard to promote both self-reflection and competitive engagement. The architecture thus combines behavioral data collection with ecological validity, leveraging a "Games-With-a-Purpose" paradigm to transform psychometric assessment into a minimally intrusive, intrinsically motivating activity (Li et al., 13 Dec 2025).
2. Data Acquisition and Ground Truth Annotation
2.1 Crowdsourced Dataset
Initial data collection involved beta testing with 15 self-nominated pairs (10 romantic couples, 5 friends), each completing at least one full session (10 rounds), yielding a total of 150 sessions and approximately 30–50 unique words per pair.
2.2 Psychological Labeling
Pairwise compatibility ground truth was assigned using the Unidimensional Relationship Closeness Scale (URCS, 12 items on a 7-point Inclusion of Other in Self (IOS) scale). Individual-level psychological attributes were assessed through the BFI-10, a short-form instrument for Big Five trait estimation (5-point Likert). Consistency was provisioned by strict adherence to validated administration protocols for these measures. Demographically, participants were aged 20–35 (mean ≈ 26), with a gender distribution of 60% female and 40% male. Reported relationship duration for couples spanned 1–5 years (Li et al., 13 Dec 2025).
3. Modeling and Feature Engineering
3.1 Mathematical Framework
Compatibility is operationalized as a continuous scalar , with an optional binary regime at threshold . For each player, selections across the 10 rounds are concatenated as "theme + keyword + selected words" text, embedded with pre-trained Sentence-BERT (384 dimensions), and mean-pooled to yield a single vector per participant. Both vectors are transformed via a shared multilayer perceptron (MLP) encoder 0, producing latent 1 and 2. Predicted compatibility is given by scaled cosine similarity: 3 Auxiliary prediction heads provide 4 and 5 from each latent 6 for additional regularization. The composite loss on a labeled pair is: 7 (Li et al., 13 Dec 2025)
3.2 Feature Pipeline
Inputs per round are embedded via Sentence-BERT, pooled across rounds to form a session-level representation, and processed as described above.
4. Synthetic Data Generation via Simulation and LLM Annotation
To attenuate data scarcity, a synthetic corpus is generated through stochastic simulation. Each synthetic round begins by sampling twenty-one words from a thematic lexicon curated through SBERT clustering; the centroid-proximal word becomes the "keyword," with the remainder as the candidate set. Simulated agents select 8 words per round via sampling proportional to SBERT-encoded semantic similarity.
Round-level compatibility labels are provided by Google Gemini 2.0 Flash, which acts as a semantic oracle. The prompting protocol enforces real-valued outputs within 9 and stresses conceptual rather than surface overlap. Ten rounds, stratified by labeled compatibility scores, are aggregated per synthetic session to approximate various levels of relational alignment. The final synthetic corpus comprises 400 session pairs (4,000 rounds). This bootstrapping pipeline enables supervised pre-training of the compatibility model prior to application on scarce human-annotated data (Li et al., 13 Dec 2025).
5. Empirical Evaluation and Benchmarking
The main results are assessed on 15 real-user pairs using both regression and classification metrics. Key findings include:
- Pearson Correlation: $30$0 (indicates moderate rank-order agreement).
- Mean Absolute Error (MAE): $30$1.
- Root Mean Squared Error (RMSE): $30$2.
- Binary Accuracy ($30$3): $30$4.
- Precision: $30$5.
- Recall: $30$6.
- F$30$7-score: $30$8.
A comparative summary is shown in the following table:
$30$9
TUG substantially outperforms the random-matching baseline (≈50% binary accuracy, 0), while approaching the performance of explicit survey-based indicators such as the URCS (correlation 1), but without requiring participants to disclose explicit relationship information. This suggests that word-choice sequences can reliably encode both individual personality traits and dyadic compatibility (Li et al., 13 Dec 2025).
6. Privacy Safeguards and Ecological Validity
TUG collects only word selections and semantic scores, with no storage of personally identifiable information (PII) or unconstrained free-form text. All logs are anonymized and encrypted, conforming to a "privacy by design" methodology. The minimized interface and quota-limited responses reduce user discomfort and mitigate risks associated with self-presentation or response bias, in distinction to conventional surveys and social media crawling. Furthermore, the synthetic data generation process further decouples model training from direct exposure to real-user information, facilitating secure deployment in contexts subject to privacy regulations such as GDPR (Li et al., 13 Dec 2025).
7. Platform Integration and Practical Implications
Potential applications include integration into dating and social networking platforms, where the lightweight, 10-round TUG mini-game can generate pairwise compatibility signals absent sensitive questioning. In group and work contexts, TUG-derived compatibility metrics may expedite team formation and collaboration prediction, leveraging behavioral signals over explicit inventories. Scalability is achieved through automated real-time session management, synthetic pre-training augmentation, and leaderboard-based incentive design.
- A plausible implication is that ongoing model improvement will require continuous feedback integration, demographic fairness auditing, and the introduction of explainability layers (e.g., SHAP) for interpretability of compatibility judgments. The TUG design space supports iterative refinement toward robust, privacy-preserving inference pipelines for social computing (Li et al., 13 Dec 2025).