Entity-Guided Graph Traversal for KB-QA

Updated 10 November 2025

Entity-Guided Graph Traversal is a KB-QA method that uses detected entities to extract local subgraphs and map natural language queries into graph traversal operations.
It integrates entity linking, subgraph extraction, and joint semantic mapping with predicate similarity and type constraints to effectively resolve disambiguation.
The approach demonstrates improved recall and F1 on non-aggregation questions, streamlining query translation in large graph-structured knowledge bases like DBpedia.

Entity-Guided Graph Traversal is a knowledge base question answering (KB-QA) methodology that leverages local graph substructures, anchored on detected entities, to enable semantic parsing and answer retrieval from large graph-structured resources such as DBpedia. This approach is specifically suited for non-aggregation questions, focusing on mapping natural language questions into subgraph traversal operations that yield accurate answers while jointly resolving semantic mapping and disambiguation.

1. Knowledge Base and Formal Problem Definition

Let the underlying knowledge base (KB), such as DBpedia, be modeled as a directed, labeled graph $G = (V, E)$ , where $V$ denotes the set of RDF nodes (resources, classes, literals), and $E \subseteq V \times P \times V$ is the set of labeled edges (each $(u,p,v) \in E$ encodes a relation $u\ \xrightarrow{p}\ v$ ). Given a user query $q$ in natural language, the system must return the accurate set of answers by traversing appropriate paths in $G$ .

The entity-guided method proceeds by:

Detecting a subset of KB entities $E_q = \{e_1, e_2, \ldots, e_k\} \subseteq V$ referenced in $q$ , each corresponding to a DBpedia URI.
Extracting a “topological structure” or pattern $T_q$ from $q$ , represented as a tree with edges labeled by surface phrases (e.g., “mayor of”).
Defining $K$ as the maximal branch length (in edges) in $T_q$ , restricting traversal depth.

2. Entity Detection and Local Subgraph Construction

2.1 Entity Linking

The process employs an external entity linker (specifically, Wikipedia Miner), which:

Identifies mention–resource pairs $(m, e)$ in $q$ .
Retains only those with linker score $\geq \theta$ ( $\theta = 0.15$ ).
Excludes schema-level (e.g. dbo:Actor) or category entities, retaining only instance-level entities for $E_q$ .

2.2 Subgraph Extraction

With $E_q$ established, a local subgraph $G_s = (V_s, E_s)$ is constructed as follows:

Initialize $V_s \leftarrow E_q$ , $E_s \leftarrow \emptyset$ .
For every $e \in E_q$ $e \in E_{q}$ and for $d = 1 \ldots K$ $d = 1 \dots K$ :
- Perform breadth-first expansion to depth $d$ .
- Collect all edges $(e', p, e'')$ with $e' \in V_s$ or $e'' \in V_s$ .
- Augment $V_s$ and $E_s$ with endpoints and edges.
Expand until reaching depth $K$ (the longest path indicated by $T_q$ ). This design guarantees that any answer path conforming to $T_q$ lies entirely in $G_s$ .

3. Joint Semantic Item Mapping and Disambiguation

Entity-guided graph traversal unifies two typically disjoint subtasks: (a) Semantic-Item-Mapping: Extract the structured template $T_q$ , e.g., discerning the pattern "Who is the mayor of Berlin?” as ANSNODE—“mayor of”—Berlin, where ANSNODE is a variable for the answer; (b) Semantic-Item-Disambiguation: Select, within $G_s$ , the precise sequence of predicates and nodes realizing the intended query semantics.

The methodology:

Adopts light-weight constituency-based patterns to extract $T_q$ from $q$ (eschewing template induction).
Searches $G_s$ for candidate paths matching the topology of $T_q$ .
Scores each candidate path based on the semantic match between KB predicate labels and question phrases, and enforces answer type constraints.

For a pattern $T_q$ with $m$ edges, each labeled $t_i$ (the $i$ -th surface phrase), candidate paths $p = (e_0 \xrightarrow{p_1} e_1 \xrightarrow{p_2} \ldots \xrightarrow{p_m} e_m)$ in $G_s$ are scored:

$\mathrm{Score}(p) = \frac{1}{m} \sum_{i=1}^m \mathrm{sim}(p_i, t_i) + \mathrm{TypeSim}(e_m, \mathrm{focus})$

where

$\mathrm{sim}(p_i, t_i)$ is the semantic similarity between predicate label $p_i$ and surface phrase $t_i$ ;
$\mathrm{TypeSim}(e_m, \mathrm{focus})$ assesses the compatibility of terminal node $e_m$ with the expected answer type inferred from question focus (e.g. "person", "place").

This joint objective disambiguates both predicate path selection and answer candidate typing.

4. Path Traversal Algorithm and Pruning

The core traversal algorithm, denoted "FindCandidatePaths," operates as follows for pattern length $m$ and phrase set $(t_1,\ldots,t_m)$ :

Initialize $P_0 = \{(e, [], 0) : e \in E_q \}$
For each step $i=1 \ldots m$ $i = 1 \dots m$ :
- For each $(e_{i-1},$ $(e_{i - 1},$ path, score_so_far $) \in P_{i-1}$ $) \in P_{i - 1}$ :
  - Enumerate top- $k$ outgoing edges from $e_{i-1}$ by $\mathrm{sim}(p_e, t_i)$ .
  - Prune edges with similarity $< \tau$ .
  - For each permissible edge, create new partial path in $P_i$ .
After $m$ steps, $P_m$ contains valid, length- $m$ candidate paths.
Each $e_m$ is scored with $\mathrm{TypeSim}(e_m)$ , yielding final path score.
Return candidates ranked by total score.

The state per traversal step is a triple: (current node $e_i$ , accumulated predicates, similarity score). Branching is curtailed by top- $k$ predicate selection and threshold $\tau$ . In the worst case, the number of paths is $O(|E_q| \cdot b^m)$ (for average branching factor $b$ ), but effective pruning ensures tractable computational cost.

5. Path Scoring and Answer Selection

5.1 Predicate-Similarity Features

For each path edge $p_i$ and corresponding $t_i$ , the system computes: $\text{PredicateScore}_i = \max_{\ell \text{ label of } p_i} \left( \frac{1}{|\ell|} \sum_{w \in \ell} \max_{tw \in t_i} \mathrm{UMBCsim}(w, tw) \right)$ where $\mathrm{UMBCsim}$ is a word similarity service.

5.2 Type-Constraint Feature

A focus phrase $F$ is extracted from $q$ , typically the head noun following interrogatives. For each answer candidate $e_m$ , the best semantic match is computed: $\text{TypeScore} = \max_{\mathrm{typ} \in \text{types}(e_m)} \mathrm{UMBCsim}(\text{typ label}, \mathrm{head}(F))$

5.3 Combined Scoring

$\text{PathScore} = \frac{1}{m} \sum_{i=1}^m s_i + t_s$

with $\{s_1,\ldots,s_m\}$ the predicate scores and $t_s$ the type score. Candidates are returned ranked by this measure.

6. Query Translation and System Output

While the implementation yields URI or literal answers directly, any discovered path $p = (e_0, p_1, e_1, \ldots, p_m, e_m)$ can be rendered as a SPARQL query:

SELECT ?ans WHERE {
  <e_0> <p_1> ?x_1 .
  ?x_1 <p_2> ?x_2 .
  ...
  ?x_{m-1} <p_m> ?ans .
  OPTIONAL { ?ans rdf:type ?t .
             FILTER(regex(str(?t), "<head(F)>", "i")) }
}

The OPTIONAL-FILTER clause encodes the type constraint derived from the focus phrase.

7. Experimental Evaluation and Comparative Performance

Experiments were conducted on QALD-3 benchmarks:

The full test set contains 99 natural-language questions; QALD-3-NA is a non-aggregation subset (61 questions, excluding COUNT/ORDER BY/FILTER types).
Metrics: Precision $= |\mathcal{A} \cap \mathcal{G}| / |\mathcal{A}|$ , Recall $= |\mathcal{A} \cap \mathcal{G}| / |\mathcal{G}|$ , F1 ( $2\cdot\text{Prec}\cdot\text{Rec}/(\text{Prec}+\text{Rec})$ ), reported as averages.

Summary of results:

Dataset	System	Processed	Correct	Partial	Avg-Recall	Avg-Prec	Avg-F1
QALD-3-NA	Ours	53	30	13	0.67	0.61	0.61
	gAnswer demo	38	21	7	0.41	0.45	0.42
QALD-3-full	Ours	60	31	17	0.46	0.40	0.40
	gAnswer	76	32	11	0.40	0.40	0.40
	DEANNA	27	21	0	0.21	0.21	0.21

Entity-guided graph traversal achieves higher recall and F1 on non-aggregation questions compared to several state-of-the-art systems and leads on recall in the full QALD-3 evaluation set. This suggests its effectiveness in answer retrieval where explicit aggregation is not required and where local subgraph patterns, seeded by confidently linked entities, are salient.

8. Significance, Limitations, and Applicability

Entity-guided graph traversal simplifies the process of mapping questions to queries by:

Avoiding heavy template induction (and thus manual engineering).
Focusing on answer path ranking rather than exhaustive global search.
Enabling joint semantic matching and type-based answer disambiguation within manageable subgraphs.

A plausible implication is that the method is less suited to questions which require aggregation, counting, or global reasoning over the KB. Its computational efficiency depends critically on the effectiveness of entity linking, the restrictiveness of local subgraph expansion, and the suitability of the underlying similarity metrics. The approach is well-aligned with KBs that exhibit rich instance-level connectivity and clearly typed relations, such as DBpedia.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entity-Guided Graph Traversal.