Privacy-Preserving Skill Extraction
- Privacy-Preserving Skill Extraction is a validated NLP pipeline that transforms academic syllabi into quantifiable, machine-readable O*NET skill vectors.
- It integrates TEE-based decentralized systems to securely process signed educational records and ensure bias-resistant, verifiable credentialing.
- It leverages cosine similarity, advanced embedding models, and grade-level weighting to reliably align academic learning outcomes with job skill requirements.
The Syllabus-to-O*NET methodology is a validated NLP pipeline for the automated extraction and quantification of occupational skills from educational syllabi and formal academic records, mapping these to the U.S. O*NET taxonomy of job-related Descriptors of Work Activities (DWAs). This methodology serves as the backbone for several privacy-preserving, decentralized Learning and Employment Record (LER) systems, enabling the derivation of verifiable, machine-readable skill credentials from semi-structured educational artifacts within secure compute environments. The approach provides stability, transparency, and invariance to non-skill attributes, supporting bias-resistant credentialing and automated job-skill matching within decentralized architectures (Xu et al., 6 Jan 2026).
1. Definition and Motivation
The Syllabus-to-O*NET methodology operationalizes the transformation of course content—especially syllabus text—into quantifiable skill vectors anchored on the O*NET standard. It addresses the need for automatable, externalizable, and verifiable skill recognition, bridging formal academic credentials and job market language by mapping textually-described learning outcomes to occupational skill sets. This mapping——is formally defined as
where is an authenticated bundle of educational records and is the resulting person-specific skill vector aligned with the O*NET descriptors (Xu et al., 6 Jan 2026).
2. Pipeline Stages and Algorithms
The Syllabus-to-O*NET workflow, as embedded within TEE-based LER systems, follows a multi-stage pipeline:
1. Pedagogical-sentence filtering: Syllabus (and optionally transcript) text is filtered to isolate pedagogical content, excluding administrative or non-learning sentences. Regex-based and heuristic filters retain approximately 14% of raw sentences as semantically relevant for skill mapping.
2. Sentence embedding: Each retained sentence is embedded into using models such as Sentence-BERT "all-mpnet-base-v2" with default hyperparameters (max length 128 tokens, batch size 32). This reduces unstructured text to a dense vector, capturing latent semantic relationships.
3. Skill similarity scoring: Let denote the set of O*NET skills. For each sentence vector and each skill embedding , compute
The course-level score for skill is
accumulated into .
4. Grade and level weighting/aggregation: Across the set of an individual's courses, skill contributions are modulated via grade and level weights , : This produces an attested skill vector for the student.
The mapping is robust: repeated runs on fixed data yield skill vector variance in top-ranked skills. The methodology inherits extensive validation from the "Course–Skill Atlas" project (>3M syllabi across 62 fields) (Xu et al., 6 Jan 2026).
3. Integration into Decentralized and Privacy-Preserving Credentialing
The methodology is implemented inside a Trusted Execution Environment (TEE) enclave as part of privacy-preserving LER infrastructure. The enclave ingests digitally signed transcripts, syllabi, and informal artifacts, verifies issuer signatures, runs the Syllabus-to-O*NET pipeline to yield , and binds this output cryptographically:
- Inputs and artifact hashes are incorporated into an attestation measurement .
- The resultant skill vector and its provenance are encapsulated as a self-issued verifiable credential (VC), signed by the enclave's private key.
- Selective disclosure policies ensure that only the derived skill vector is exposed; raw records and keys remain enclave-confined.
Presentations of skill credentials to verifiers (e.g., employers) are accomplished via verifiable presentations, and the attestation group guarantees provenance, integrity, and recency of the computation (Xu et al., 6 Jan 2026).
4. Skill-Only Job Matching and Bias Invariance
A defining property of the Syllabus-to-O*NET methodology, as embedded in LER systems, is its strict “skill-only” matching:
- Verifiers convert job descriptions into O*NET skill vectors () using an identical NLP pipeline.
- Matching proceeds solely by comparison of (from the applicant) and (from the job), invoking semantic-overlap and cosine similarity metrics:
where is the required skill set for the job (Xu et al., 6 Jan 2026).
- Threshold-based rules (e.g., ) yield an invariant decision, independent of any non-skill attributes (gender, age, etc.). Formally, the bias-opportunity index is provably zero:
5. Empirical Performance and Stability
Evaluation details indicate:
- For a benchmark corpus (computer-science syllabi + job postings), skill extraction and mapping display variance across repeated pipeline runs.
- Example job-matching for a Java Developer role: Overlap@10 = , SemSim scores of for all top skills.
- NLP pipeline throughput: On AWS Nitro Enclave, 10–40 input files processed in $9$–$23$ seconds; for 100 files, processing completes within $50$ seconds. Job matching inside the verifier enclave executes in s per job, consuming minimal resources (Xu et al., 6 Jan 2026).
The primary bottleneck is NLP ingestion; attestation and cryptographic verification incur s overhead.
6. Security, Privacy, and Limitations
The Syllabus-to-O*NET methodology, by operating wholly inside attested TEEs, provides strong guarantees regarding:
- Confidentiality: Raw educational records and private keys remain strictly within the enclave.
- Unforgeability: Verifiable credentials issued are cryptographically bound to enclave attestation and original issuer signatures; adversarial attempts at generating valid credentials reduce to signature forgeries.
- Selective disclosure: Only attested skill vectors are exposed externally; repeated presentations guarantee unlinkability.
- Stability: Empirical results confirm output variance on fixed input, supporting reliability for longitudinal or batch processing.
Limitations noted include domain generalizability (currently validated primarily on computer-science syllabi) and the absence of formal side-channel leakage modeling; upstream disparities in background skill distributions are also not explicitly addressed (Xu et al., 6 Jan 2026).
7. Significance and Future Directions
Automating the translation from syllabus to O*NET skill profiles enables standardized, audit-friendly, and privacy-preserving skill credentials suitable for decentralized, employer-verifiable matching. The methodology underpins non-biasable, selectively disclosable, and reproducible mechanisms for linking education to employment. Proposed research avenues include hybrid TEE/zero-knowledge privacy models, human-in-the-loop calibration, and broadening to new academic domains for robust generalizability (Xu et al., 6 Jan 2026).