RusLICA: Russian Psycholinguistic Analysis Platform
- RusLICA is a web service that applies an adapted LIWC framework to analyze Russian texts through psycholinguistic and linguistic features.
- It employs advanced techniques including dependency parsing, morphological analysis, and emotion classification via pre-trained language models.
- The platform features a modular, scalable architecture with a RESTful API, secure authentication, and containerized deployment for high-throughput text processing.
RusLICA (Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis) is a publicly accessible web service for in-depth psycholinguistic and linguistic feature analysis of Russian texts. Developed by the Laboratory of AI Application in Psychology (RAS), RusLICA adapts the closed-dictionary Linguistic Inquiry and Word Count (LIWC) framework to accommodate the morphological, syntactic, and cultural specificities of the Russian language. The platform integrates dependency-based parsing, extensive morphological and frequency metrics, and emotion classification via pre-trained LLMs, addressing the limitations of direct thesaurus translation by constructing a Russian-specific lexicon from authoritative sources (Sigdel et al., 28 Jan 2026).
1. Architecture and Components
RusLICA is architected as a modular three-tier web application. The front-end is a browser-based user interface implemented in JavaScript (React), supporting functionalities such as dataset upload, job monitoring, historical result viewing, and download of analysis outputs in CSV or JSON formats. Interaction with the system is conducted via a RESTful API over HTTPS, with authentication and data exchanges encapsulated in secure tokens.
The back-end, written in Python (≥3.7) using the Flask framework, orchestrates analysis jobs through a Celery-managed task queue, with Redis or RabbitMQ functioning as the broker. Upon submission, a Celery worker processes the task, applies the analytic pipeline, updates the job status, and persists results. The relational database PostgreSQL stores metadata such as user credentials, job metadata, and file pointers, while the file system maintains the lexicon, frequency resources, parsed corpora, and pre-trained models including spaCy’s ru_core_news_lg and the rubert-tiny2 emotion classifier.
Deployment utilizes Docker Compose, segregating application logic, workers, database, and reverse proxy (Nginx) into dedicated containers. The Nginx front-end provides SSL termination, rate limiting, and enforces a 12-hour runtime cap per analysis job.
2. API Specification and Usage
All endpoints reside under /api/v1/ and operate using HTTP Bearer authentication; inputs and outputs are encoded as JSON. Key endpoints include:
- POST /api/v1/register: Registers a new user.
- POST /api/v1/login: Authenticates and returns a JWT token.
- POST /api/v1/jobs: Submits text data for analysis, accepting .csv or .xlsx files with a “text” column, and a processing mode (“general” or “lexical_detailed”).
- GET /api/v1/jobs: Returns a user’s job metadata.
- GET /api/v1/jobs/{job_id}/results: Streams analysis results as tab-separated CSV with 96 feature columns per text; detailed mode offers per-lemma category counts in JSON.
- DELETE /api/v1/jobs/{job_id}: Removes a stored job and its artifacts.
Example analysis jobs can process files containing up to 100,000 entries, with throughput of 1,000–2,000 texts per minute on typical four-core CPUs (memory ~8 GB), predominantly limited by NLP parsing steps.
3. Processing Pipeline and Feature Extraction
The analytic pipeline comprises several sequential stages:
- Preprocessing:
- Lowercasing and the replacement of URLs, hashtags, numbers, and emojis with unique tokens.
- Reduction of repeated punctuation characters.
- Tokenization and Lemmatization:
- Segmentation and sentence boundary detection via spaCy (ru_core_news_lg).
- POS disambiguation and lemmatization using MyStem.
- Feature Extraction:
- Statistical and Frequency Metrics: Word counts, mean/max word and sentence lengths, token/emoji/URL counts, repeated word statistics, mean/min/max IPM and D frequencies from Russian National Corpus-derived lexica.
- Syntactic Features: Universal Dependencies (UD) parse trees per sentence; counts of relations (acl, advcl, conj, etc.), tree depth metrics.
- Morphological Features: Counts of POS/morphological attributes (personed pronouns, adjective grades, verb moods/aspects/voices/tenses), including special tracking of verbs in root sentence positions.
- Lexical Category Scoring: Closed-dictionary lookup post-MyStem lemmatization; each of 42 psycholinguistic categories is scored as in LIWC:
- Emotion Classification: A fine-tuned “Aniemore/rubert-tiny2” model labels each input as one of seven emotions.
- Aggregation: All features (15 general statistics, 2 frequency, 11 syntactic, 27 morphological, 42 lexical, one emotion—adjusted for minima/maxima yields 96 dimensions) consolidated per input text.
The pipeline is horizontally scalable, supporting cluster-style parallelization through additional Celery workers.
4. Lexicon Construction and Category Hierarchy
RusLICA’s dictionary of 7,492 lemmatized Russian entries (8,309 category assignments) was developed de novo, rather than by bulk translation of English LIWC. Lexicon construction drew from:
- L. Babenko’s “Alphabet of Emotions” (2021) for emotional vocabulary,
- Babenko’s ideographic verb dictionary for motion and interpersonal acts,
- Shvedova’s “Semantic Dictionary of Russian” for abstract and physical domains,
- Corpus-extracted (RNC) entries for sweared, gendered, temporal, and spatial lexemes,
- RuWordNet expansions for synset-level semantic relations.
Lexical categories (42 in total) are organized as follows:
| Hierarchy | Categories (sample/total) | Examples |
|---|---|---|
| Linguistic | Modality, Negations, Intensifiers (3) | “может” (can), “не” (not), historical intensifiers |
| Affective | 11 emotions (Happiness, Love, Shame, etc.), Swear words | “любовь” (love), “грусть” (sadness), expletives |
| Social | Feud, Arrogance, Interpersonal, Social referents, Family (9) | “друг” (friend), “семья” (family), “мужчина” (male) |
| Physical | Illness, Mental Health, Sexuality, Death (4) | “болезнь” (illness), “смерть” (death) |
| Perception | Auditory, Visual, Motion (3) | “слушать” (listen), “идти” (go) |
| Motives | Curiosity, Passion (2) | “любознательный” (curious) |
| Cognition | Belief, Causation, Doubt, Certitude (4) | “сомнение” (doubt), “уверенность” (certitude) |
| Lifestyle | Religion (1) | “религия” (religion), “церковь” (church) |
| Time/Space | Time, Space (2) | “время” (time), “место” (place) |
All dictionary entries are normalized using MyStem to guarantee robust, lemma-level matching between text and category.
5. Output Interpretation
Output files contain scalar features per text, with interpretative guidelines mirroring standard LIWC practices. These include:
- General Statistics: Token counts, emoji counts, average character lengths.
- Lexical Category Percentages: Fraction of words per psycholinguistic category. For example, Love = 0.012 indicates that 1.2% of tokens are love-related lexemes.
- Syntactic and Morphological Indices: Complexity metrics (sentence depth, subordination), pronoun usage (e.g., N_pronn_pers_first for self-reference).
- Frequency Features: How representative or rare the vocabulary is across Russian registers.
- Emotion Labels: Categorical prediction among the defined emotion classes.
As in LIWC, higher lexical category values are interpreted as increased thematic salience. For example, a marked Certitude score reflects increased presence of confidence-marking lexemes (e.g., “безусловно,” “истинно”). Complex syntactic structures (high mean_sent_depth, subordinate clause counts) suggest argumentation density. Self-focused content is present if first-person pronoun counts are elevated.
6. Deployment, Access, and Integration
RusLICA is freely accessible at https://ruslica.ipran.ru/ after registration. To operate a local instance, one must set up the prescribed container stack (Flask server, Celery worker, Redis broker, PostgreSQL database, Nginx reverse proxy), preload necessary dictionaries and pre-trained models, and configure essential environment variables (e.g., FLASK_SECRET_KEY, DATABASE_URL, REDIS_URL, MAX_JOB_RUNTIME). Software dependencies include Python 3.7+, relevant Flask/Celery/Redis modules, spaCy 3.x (ru_core_news_lg), pymystem3, pandas, openpyxl, and Hugging Face transformers for model inference. No explicit input size constraint exists aside from the enforced processing time ceiling.
7. Planned Extensions and Research Directions
Extensions outlined for RusLICA focus on deeper lexical, semantic, and integrative capabilities, including:
- Lexicon augmentation to include bigrams, idioms, slang, emoticons, and RuSentiLex for sentiment analysis.
- Addition of transformer-based models for attributes such as Big-Five personality and well-being scoring.
- Integration of semantic role labeling and frame semantics to capture agent-patient-experiencer structures.
- Hybrid metric schemes leveraging both closed and open-vocabulary approaches, possibly through topic modeling and contextualized embeddings.
- Enhanced API/SDK offerings for R or Python, and real-time bulk analysis endpoints suitable for social media monitoring.
- Empirical validation studies targeting convergent validity with the LIWC standard and behavioral data sets.
RusLICA constitutes an overview of LIWC-style closed-dictionary methodology with contemporary Russian NLP resources, delivering a 96-dimensional feature representation per input and facilitating broad psycholinguistic inquiry on Russian text corpora (Sigdel et al., 28 Jan 2026).