Presenting a Dataset for Collaborator Recommending Systems in Academic Social Network: a Case Study on ReseachGate

Published 29 Dec 2020 in cs.IR | (2101.01141v3)

Abstract: Collaborator finding systems are a special type of expert finding models. There is a long-lasting challenge for research in the collaborator recommending research area, which is the lack of the structured dataset to be used by the researchers. We introduce two datasets to fill this gap. The first dataset is prepared for designing a consistent, collaborator finding system. The next one, called a co-author finding model, models an academic social network as a table that contains different relations between the pair of users. Both of them provide an opportunity for introducing potential collaborators to each other. These two models have been extracted from ResearchGate (RG) data set and are available publicly. RG dataset has been collected from Jan. 2019 to April 2019 and includes raw data of 3980 RG users. The dataset consists of almost complete information about users. In the preprocessing phase, the well-known Elmo was used for analyzing textual data. We call this as ResearchGate dataset for Recommending Systems (RGRS). For assessing the validity of data, we analyze each layer of data separately, and the results are reported. After preparing data and evaluating the collaborator finding models, we have done some assessments on RGRS. Some of these assessments are co-author, following-follower, and question answering relations. The outcomes indicate that it is the best relation in propagating knowledge in the network. To the best of our knowledge, there is no processed and analyzed dataset of this size.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces two comprehensive datasets from ResearchGate (MRGN and FTRG) that provide structured, multi-layered data for academic collaborator recommendation systems.
It details a dual preprocessing approach using an exponential similarity function for numerical features and ELMo-based cosine similarity for textual features.
Evaluation using graph-based community detection and machine learning classifiers demonstrates high accuracy and validates the datasets' design for recommendation tasks.

This paper (2101.01141) addresses a significant challenge in the field of academic collaborator recommendation systems: the lack of comprehensive, structured, and publicly available datasets. To fill this gap, the authors introduce two datasets derived from ResearchGate (RG), a large academic social network. The datasets are designed to support the development and evaluation of various collaborator finding models by integrating both structural and personal information about researchers.

The raw data was collected from 3980 ResearchGate users between January and April 2019 using a crawler. The dataset includes a wide array of information organized into 13 distinct tables:

Users: Core information about 3980 users, including name, score, education, number of research items, reads, citations, RG score breakdown (publications, questions, answers, followers), percentile, H-index (with and without self-citation), number of questions, and answers. This table serves as the central hub, linking to information in other tables via User ID.
Answers: Records of question-answer interactions, linking the user who answered to the user who asked.
Followers/Following: Directed relationship data indicating who follows whom.
Skills: Lists of skills and expertise for users.
Publications: Detailed information on user publications, including title, link, number of readers/citations, location, date, status, journal/conference name, and abstract.
Authors: Links Publication IDs to author names, facilitating the extraction of co-authorship information.
Other tables detail current research projects, awards, research interests, and references.

A key contribution is the preprocessing applied to the raw data to make it suitable for recommendation tasks. This involved handling both numerical and textual features:

Numerical Features: For features like H-Index and RG Rank, the similarity between two users $u_1$ and $u_2$ with feature values $x_1$ and $x_2$ is calculated using the formula $y = \exp^{-\{|x_1 - x_2|\}}$ . This transforms the distance into a similarity score between 0 and 1.
Textual Features: For features such as department name, research similarity (based on publication abstracts), and skill similarity, the ELMo deep contextualized word representation model (Peters et al., 2018) was used to transform textual data into vector embeddings. Cosine similarity was then applied to measure the similarity between the vectors of two users for a given textual feature. Skill similarity is calculated based on the proportion of overlapping skills relative to the total skills of one user, making it asymmetric.

Based on this processed data, the authors created two structured datasets:

Multirelation ResearchGate Network (MRGN): This dataset models the academic social network as a multi-layer directed network. Each layer represents a specific type of relationship or similarity:
- Following
- Followers
- Co-authorship (reciprocal edges added to make it directed for consistency)
- Skill Similarity
- Equal Department
- Document Similarity (based on publication content) This structure is designed for graph-based recommendation models, particularly those leveraging multi-layer network analysis or community detection. The layers have varying densities and structural properties (clustering coefficient, average degree, etc., detailed in Table \ref{table12}).
Feature Table of ResearchGate (FTRG): This dataset is structured as a pairwise comparison table, with each row representing a pair of users $(u_i, u_j)$ . It contains 10 features (independent variables) that compare or relate the two users, derived from both structural and personal information: Following (binary), Followers (binary), Question-Answers (binary), Living in Same Country (binary), Living in Same City (binary), H-index similarity, RG Rank similarity, Department similarity, Research Similarity, and Skill Similarity. The target variable (dependent variable) is 'Previous Collaboration' (binary, 1 if they have co-authored before, 0 otherwise). This structure is designed for classification-based recommendation models.

The authors performed an analysis of the raw data networks (co-authoring, follower/following, QA) using community detection algorithms to understand the community structure and knowledge propagation potential. The analysis indicated that the QA network forms many small, less connected communities, suggesting limited knowledge diffusion. The co-authoring network forms more numerous and somewhat more converged communities, while the follower/following network forms fewer, larger, and highly overlapping communities, deemed most effective for knowledge diffusion based on this analysis (Table \ref{table11}).

The utility of the prepared datasets was demonstrated by applying them to collaborator recommendation tasks:

Consistent Collaboration Model: Using the MRGN dataset, multi-layer community detection algorithms (PMM [2011] and Louvain-like [2009]) were applied. Users within the same detected community were considered recommended collaborators. The evaluation compared the detected communities against existing relationships (across all six layers) using Precision, Recall, and F1 measures for different numbers of communities (Figure \ref{comparison}). The results show that multi-layer community detection can identify potential collaborators based on these integrated features.
Co-author Finding Model: Using the FTRG dataset, the task was framed as predicting whether two users have previously collaborated ('Previous Collaboration'). This binary classification problem was tackled using several standard machine learning classifiers (AdaBoostM1, Bagging, BayesNet, DecisionTable, LMT, Logistic, etc.). The performance metrics (MAE, RMSE, TP, FP, TN, FN, precision, recall, f1, Brier Score, AUC, ACC) presented in Table \ref{table10} demonstrate that standard classifiers can achieve high accuracy (e.g., 92% precision for Decision Tree) in predicting previous collaborations using the features provided in FTRG, validating the dataset's structure and features for this type of model.

The paper concludes by highlighting that the provided datasets are, to the best of the authors' knowledge, the largest processed and analyzed datasets for academic collaborator recommendation. They offer both raw data for flexible use and preprocessed, structured datasets (MRGN and FTRG) tailored for graph-based and classification-based recommendation models, respectively. The availability of this dataset publicly aims to facilitate future research and development in the field. The directional nature of some relationships in the dataset is noted as particularly useful for graph-based analyses. The raw data is also suggested for use in other related areas like scientific text analysis or expert finding.

Markdown Report Issue