NetSense Dataset: Social Network Dynamics
- NetSense dataset is a longitudinal behavioral network dataset combining attitudinal surveys and smartphone logs to capture evolving student interactions.
- It enables rigorous modeling and prediction of social tie formation, persistence, and dissolution using logistic regression and machine learning techniques.
- Key features include dyad-level agreement scores, common neighbors, and aggregated traits, achieving approximately 80% accuracy in predicting new tie formation.
The NetSense dataset is a longitudinal, behavioral network dataset collected at the University of Notre Dame, constructed to facilitate rigorous analysis of the interplay between social network evolution and individual attitudes, backgrounds, and behavioral traits. It integrates periodic survey data from a collegiate cohort with high-resolution smartphone-based communication logs, enabling the modeling and prediction of link formation, persistence, and dissolution within dynamically evolving student networks (Bahulkar et al., 2016).
1. Data Collection Protocol
NetSense tracked a single entering class of freshmen at Notre Dame from Fall 2011 to Spring 2013. Data acquisition comprised two synchronized streams:
- Participant Cohort and Demographics: The dataset covers all consenting members of the 2011 freshman class, yielding approximately 200–250 respondents per semester. Demographic variables—gender, race, religion, major, parental income, and additional background attributes—were comprehensively surveyed.
- Surveys: At the beginning and end of each semester (excluding summers), participants completed detailed online surveys measuring:
- Political and social opinions (e.g., abortion, marijuana legalization, welfare)
- Behavioral dispositions (e.g., “Talkative,” “Outgoing”)
- Lifestyle and campus engagement (e.g., time spent partying, socializing, volunteering, club affiliations)
- Demographic and family data
- Smartphone Event Logs: Each participant used a study-issued smartphone configured to unobtrusively log all calls and SMS events exchanged among study participants. Logged information for each event included timestamp, sender/receiver identifiers, call duration (seconds), and SMS length (characters).
2. Network Construction and Temporal Resolution
NetSense supports the analysis of two complementary network types, both defined over the recurring set of survey participants, with four temporal snapshots (one per regular semester).
- Communication Activity Network: Nodes are students; an undirected edge between and is present in semester if the total number of calls plus SMS between and in that semester exceeds a minimal threshold (typically ≥1). Edges are initially weighted by communication count but are binarized for modeling; raw contact counts are subsequently leveraged in persistence studies.
- Friendship Network: Nodes are identical. A tie exists in semester if either student names the other as a friend in that semester’s survey. A single nomination suffices for tie establishment. Both network types share temporally aligned boundaries: Fall 2011, Spring 2012, Fall 2012, and Spring 2013. The interceding Summer 2012 period is excluded due to sparse data.
- Data Preparation: Networks are restricted to within-study dyads (both nodes must be active participants). Non-respondent attributes are replaced with most recent data or omitted for pairs involving missing data.
3. Feature Engineering for Dyad-Level Modeling
For every dyad at semester , a vector of agreement and structural features, denoted , is constructed to support link prediction.
- Node-Attribute Agreement:
- For each of the 27 surveyed traits , compute the normalized agreement score , where for identical or most similar responses and for maximally dissimilar (continuous and ordinal items scaled linearly).
- Structural Features:
- Common Neighbors: , the count of shared neighbors in either network.
- Total Agreement Count: .
- Aggregate Feature Vector:
- , yielding 29 features per dyad per semester.
4. Analytical and Predictive Modeling Approaches
NetSense data are suited for both statistical and modern machine-learning tasks focused on network evolution. Central analyses include:
- Link Formation:
- Objective: Using for dyads without an edge at , predict appearance of a new tie at .
- Model: Logistic regression and related classifiers (SVM, random forest), with the core form where is the sigmoid function.
- Addressing Class Imbalance: Negatives are subsampled to dyads within graph distance ≤2, reducing the dominance of non-edges.
- Dimensionality Reduction: Singular Value Decomposition (SVD) on the 29-feature matrix projects data onto top eigenfeatures, improving recall.
- Link Persistence and Dissolution:
- Objective: Given an existing edge at , predict retention at .
- Models analogously employ logistic regression and other classifiers, using .
- Evaluation and Metrics:
- Standard metrics: Accuracy, precision, recall (), and area under the ROC curve (AUC).
| Task | Prediction Target | Main Features |
|---|---|---|
| Link Formation | Edge - forms at | |
| Link Persistence | Edge - persists at | , volume |
5. Principal Empirical Findings
- Homophily as a Driver of Link Formation: Pre-existing similarity, quantified via aggregate agreement and common neighbor count , is the strongest correlate of new tie formation. Dyads that form new ties typically exhibit intermediate agreement levels between existing friend pairs and persistent non-friend pairs.
- Predictive Models: Combined SVD-feature logistic regression achieves approximately 80–83% accuracy with 72% recall for link formation. Dominant predictors include the number of common traits, common neighbors, and agreement on volunteering/campaigning time, parental income, and political views.
- Tie Persistence vs. Dissolution: Predictors of persistence parallel those for formation but with attenuated strengths. Communication volume is particularly predictive: persisting edges exchange several-fold higher call/SMS volume than dissolving ties in either network. Predictive accuracy for persistence ranges from ~62–66% using the standard 29-feature set.
- Network Co-evolution Implications: Broad, multidimensional homophily underpins tie creation, while selective tie decay plays a secondary role. Agreement across diverse behavioral and attitudinal dimensions, rather than singular demographic factors, governs the formation and stability of student relationships.
6. Research Applications and Typical Use Cases
The NetSense dataset underpins numerous lines of inquiry in social network analysis and computational social science:
- Benchmarking link prediction algorithms with rich node-attribute and temporally resolved edge data.
- Studying the co-evolution of behavioral traits and network topology, with explicit tests of contagion versus homophily mechanisms.
- Analyzing tie decay and maintenance, notably the role of communication frequency and content in sustaining relationships.
- Disentangling the effects of demographic, attitudinal, and behavioral congruence on the structure and dynamics of real-world networks composed of interacting individuals.
7. Significance and Limitations
NetSense uniquely synchronizes longitudinal, fine-grained behavioral survey data with objective communication logs, enabling the precise quantification of dyadic similarity and structural embeddedness as they influence network evolution. This supports robust inference regarding the micro-mechanisms driving social tie formation and dissolution. A plausible implication is that network co-evolution modeling must account for high-dimensional attitudinal and activity alignment, and not only static demographic similarity.
Limitations include restriction to a single university cohort and potential biases due to nonresponse or panel attrition; standard imputation strategies are used but cannot fully eliminate these factors. Nonetheless, NetSense constitutes a benchmark for longitudinal, attribute-rich social network datasets (Bahulkar et al., 2016).