Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trajectory-Anchored Tournaments

Updated 4 February 2026
  • The paper introduces trajectory-anchored tournaments as a novel framework that embeds dynamic performance trends into competitive evaluation.
  • It employs rigorous algorithmic and statistical methodologies to quantify and benchmark performance trajectories throughout tournament progressions.
  • Empirical results indicate improved predictive accuracy and fairness, underscoring the potential of this approach for diverse competitive and evaluative applications.

RusLICA is a web-based platform designed for automated linguistic inquiry and category analysis of Russian-language texts, specifically adapting the Linguistic Inquiry and Word Count (LIWC) methodology to the morphological, syntactic, and cultural specificities of Russian. Developed at the Laboratory of AI Application in Psychology, Russian Academy of Sciences, RusLICA integrates closed-dictionary category scoring with advanced NLP features, enabling the extraction of 96 linguistic and psycholinguistic features—including 42 specialized psycholinguistic categories—across user-supplied corpora. The service provides browser-based access, a programmatic API, and an extensible open-source backend for comprehensive research applications (Sigdel et al., 28 Jan 2026).

1. System Architecture and Infrastructure

RusLICA employs a three-tier web application model:

  • Presentation Layer: A browser-based UI (React/JavaScript) over HTTPS allows registered users to upload datasets (CSV/XLSX, UTF-8), monitor analysis jobs, review historical results, and download structured outputs (CSV/JSON). All client interaction occurs via a RESTful API.
  • Application/Processing Layer: Implemented in Python (≥3.7) using Flask, the backend coordinates job processing, authentication, and uploads. Job orchestration leverages Celery workers and a Redis or RabbitMQ message broker. Analysis tasks are distributed, enabling horizontal scaling and runtime limits (12 hours/job enforced by Nginx).
  • Storage Layer: PostgreSQL stores user information and job metadata, while result files and artifact dictionaries (7,492 Russian lemmas, frequency tables, RuWordNet synsets) reside on the file system. Pretrained NLP models (spaCy’s ru_core_news_lg, MyStem, ruBERT emotion classifier) are stored locally.

This architecture is fully containerized using Docker Compose, with discrete containers for Flask, Celery, the message broker, PostgreSQL, and Nginx. Nginx provides SSL offloading, rate-limiting, and job runtime enforcement.

2. API and User Interaction

All interactions occur under the /api/v1/ namespace using JSON and HTTP Bearer authentication:

  • Registration and Authentication: POST endpoints accept email/password for user accounts and return JWT tokens.
  • Job Submission: Users POST datasets and select analysis mode (general or lexical_detailed). Each job receives a unique ID and status.
  • Job Management: Users can list, query status, retrieve results, or delete jobs. Results are streamed as tab-separated CSVs (one row per text, 96 feature columns); lexical_detailed mode additionally exposes per-category lemma counts in JSON.
  • Processing Constraints: Jobs are subject to a strict 12-hour resource cap, with heavy file support enabled through task parallelism.

An example workflow includes dataset upload, asynchronous processing monitored via polling, and structured download of analysis outputs.

3. Processing Pipeline and Feature Extraction

The text analysis pipeline includes the following sequential steps:

  • Preprocessing: Lowercasing, special token replacement (URLs, hashtags, numbers, emojis), and normalization of punctuation.
  • Tokenization and Lemmatization: spaCy facilitates tokenization and sentence segmentation; MyStem performs POS disambiguation and lemma extraction.
  • Feature Extraction:
  1. General statistics (e.g., N_words, mean/max_word_length, token counts).
  2. Frequency features derive from Russian National Corpus frequency dictionaries (mean/min IPM and D).
  3. Dependency-based features leverage Universal Dependencies parse trees (spaCy) to compute metrics such as tree depth and relation counts (acl, advcl, etc.).
  4. Morphological analysis yields counts of pronouns (by person), adjectives (by degree), adverbs, verbs (by mood/aspect/voice/tense), and root-position verb forms.
  5. Lexical scoring employs a closed dictionary of 7,492 lemmas mapped to categories. For each psycholinguistic category CC, the score is:

    P(C)=wordCcount(word)NwordP(C) = \frac{\sum_{\text{word} \in C} \text{count(word)}}{N_{\text{word}}}

  6. Emotion classification uses a fine-tuned “Aniemore/rubert-tiny2” model to predict one of seven emotions (neutral, happiness, sadness, enthusiasm, fear, anger, disgust).
  • Aggregation: All features (15 general, 2 frequency, 11 syntactic, 27 morphological, 42 lexical, 1 emotion) are concatenated into a per-text feature row (96 columns, reflecting documented aggregation logic).

Horizontal scalability is achieved by augmenting the number of Celery workers; jobs exceeding permitted runtime are terminated.

4. Lexicon Construction and Category Taxonomy

RusLICA’s lexicon is not a translation of English LIWC, but was independently constructed via:

  • Expert lexicographic resources: “Alphabet of Emotions” (Babenko, 2021), ideographic verb dictionary (Babenko, 1999), Shvedova’s “Semantic Dictionary of Russian” (1998), and Dostoyevsky’s historical intensifier dictionary.
  • Semi-automatic corpus extraction from the Russian National Corpus to identify high-frequency swearing, referents, and temporal/spatial terms.
  • RuWordNet synset expansion, leveraging hyponym/hyperonym and other semantic relations, reviewed by expert curation.

Lemmas are always normalized using MyStem to ensure one-to-one correspondence with dictionary entries. The final 42 lexical/psycholinguistic categories—organized hierarchically—include:

Group Example Categories Notable Sources
Linguistic Dimensions Modality, Negations, Intensifiers Dostoyevsky’s intensifiers
Psychological Processes Love, Fear, Certitude, Causation, Family Alphabet of Emotions (2021)
Social Processes Feud, Social Relations, Gender Referents Ideographic verbs, RNC
Physical, Perceptual, Motive Illness, Cognition, Auditory, Curiosity Semantic Dictionary, RNC
Lifestyle, Time, Space Religion, Time, Space Shvedova, RuWordNet

Each dictionary lemma and its morphological variants map to a single canonical category. For example, любовь (“love”) and derivatives are grouped under “Love” with NLoveN_{Love} calculated as the sum of their frequencies.

5. Output Structure and Interpretation

The generated output per text consists of:

  • General statistics: token and character counts, number of emojis, mean/max lengths.
  • Frequency metrics: mean IPM, minimum corpus frequencies.
  • Syntactic and morphological metrics: UD relation counts, sentence tree depths, detailed part-of-speech breakdowns.
  • Lexical category percentages: e.g., Love = 0.012 indicates 1.2% of words in the “Love” category.
  • Emotion prediction: a categorical label per text.

Interpretation mirrors LIWC conventions: a higher Certitude score indexes frequent use of certainty-related lexemes (“безусловно,” “истинно”); elevated mean sentence depth and subordinate clause counts correlate with syntactic complexity; first-person pronoun frequency suggests self-referential narrative or perspective. This feature set permits quantitative psycholinguistic profiling with direct comparability across corpora and genres.

6. Deployment, Usage, and Performance

RusLICA is publicly accessible at https://ruslica.ipran.ru/ following free registration. For private or institutional deployment:

  • Docker images for Flask, Celery, Redis, PostgreSQL, and Nginx are configured; LLMs and dictionaries are placed in /app/models/ and /app/dicts/.
  • Environment variables define runtime policy (e.g., MAX_JOB_RUNTIME=12h).
  • The system is initialized via Docker Compose, launching all containers and exposing the React UI.

Software stack dependencies include Python (3.7+), Flask (RESTful, JWT-Extended), Celery 5.x, Redis, spaCy 3.x (ru_core_news_lg), pymystem3, NumPy, pandas, openpyxl, transformers. Processing throughput ranges from 1,000–2,000 texts/minute on a 4-core CPU with ~8 GB RAM, mainly limited by the dependency parsing and lemmatization components of spaCy.

7. Planned Extensions and Validation Directions

Planned enhancements include:

  • Lexicon Expansion: Incorporation of bigrams, idioms, slang, emoticons; integration of RuSentiLex for nuanced sentiment polarity.
  • Model Upgrades: Fine-tuning of transformer models for Big-Five personality traits, well-being, and deception detection; addition of semantic role labeling and discursive frame extraction.
  • Feature Innovation: Development of hybrid closed/open vocabulary metrics using topic models and contextual embeddings; enabling advanced syntactic analysis via tree kernels or graph neural networks.
  • API Ecosystem: Introduction of R/Python client libraries; optimization of bulk endpoints for large-scale, real-time analysis (e.g., social media streams).
  • Empirical Validation: Psychometric studies on personality and well-being corpora to benchmark convergent validity against both LIWC and behavioral criteria.

This suggests RusLICA aims for continual evolution toward richer, context-sensitive, and domain-adaptable psycholinguistic analysis resources for Russian-language research (Sigdel et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-Anchored Tournaments.