Papers
Topics
Authors
Recent
Search
2000 character limit reached

Similar Input Similar Code (SISC)

Updated 31 January 2026
  • SISC is a principle asserting that code fragments handling similar inputs exhibit high syntactic, token, semantic, or structural similarity.
  • It enables efficient redundancy-based program repair and scalable clone detection by leveraging metrics like LCS, TFIDF, Doc2Vec, and CFG fingerprinting.
  • In symbolic communication, SISC informs the design of binary codes such as the Tap code, balancing robustness and efficiency for constrained channels.

Similar Input Similar Code (SISC) designates the principle that code fragments engaged in handling analogous inputs or computational conditions tend to exhibit high similarity at the syntactic, token, structural, or semantic levels. This relationship, operationalized in diverse domains like automated program repair, code clone detection, and symbolic communication systems, enables algorithms to exploit code similarity for search-space reduction, clone identification, and channel-adaptive code design. The SISC principle is foundational in redundancy-based program repair, scalable detection of code clones in large repositories, and the creation of alternative binary codes for constrained communication channels.

1. Formal Definitions and Theoretical Framework

At its core, SISC posits that there exists significant redundancy in programs: for a given input or usage context (often reflected in data-flow or control-flow), multiple code fragments provide equivalent or near-equivalent transformations or behaviors. Let ss and tt be source code components (statements or method bodies). Four primary similarity metrics formalize the SISC relationship:

  • Character-level similarity: Normalized Longest Common Subsequence (LCS) with

simLCS(s,t)=LCS(s,t)max(s,t)[0,1]\mathrm{sim}_{\mathrm{LCS}}(s, t)= \frac{\mathrm{LCS}(s, t)}{\max(|s|, |t|)} \in [0,1]

  • Token-level similarity: TF–IDF vectorization with cosine similarity,

simTFIDF(s,t)=vsvtvs  vt\mathrm{sim}_{\mathrm{TFIDF}}(s, t) = \frac{\mathbf{v}_s \cdot \mathbf{v}_t}{\|\mathbf{v}_s\|\;\|\mathbf{v}_t\|}

where vs\mathbf{v}_s encodes TF–IDF weights per token.

  • Semantic similarity: Doc2Vec embeddings with cosine similarity,

simDoc2Vec(s,t)=esetes  et\mathrm{sim}_{\mathrm{Doc2Vec}}(s, t) = \frac{\mathbf{e}_s \cdot \mathbf{e}_t}{\|\mathbf{e}_s\|\;\|\mathbf{e}_t\|}

  • Structural similarity: Deckard AST feature vectors with cosine similarity,

simDeckard(s,t)=fsftfs  ft\mathrm{sim}_{\mathrm{Deckard}}(s, t) = \frac{\mathbf{f}_s \cdot \mathbf{f}_t}{\|\mathbf{f}_s\|\;\|\mathbf{f}_t\|}

Functionally, SISC also refers to equivalence in code output given shared input domain, as in simion detection: for chunks C1,C2C_1, C_2 with fC1,fC2:DRf_{C_1}, f_{C_2}: D \rightarrow R, C1,C2C_1, C_2 are simions if xD:fC1(x)=fC2(x)\forall x \in D: f_{C_1}(x) = f_{C_2}(x) (Deissenboeck et al., 2018).

2. SISC in Redundancy-Based Program Repair

Redundancy-based program repair leverages SISC by searching for code fragments ("repair ingredients") that can replace faulty code to achieve correctness. Empirical studies on large-scale commit datasets formalize SISC by utilizing the four similarity metrics above to rank all statements in a program as potential repairs. Key findings include:

  • Ranking performance: On 214 single-statement replacement tasks, similarity metrics (especially TFIDF and LCS) place the correct repair in the top 4–33% of ranked candidates, with median ranks as low as 3–4 and perfect-repair rates up to 33%.
  • Search-space reduction: Similarity metrics permit ignoring at least 90% of the candidate space without missing the correct patch, with average reductions >99% using context-aware TFIDF.
  • Metric complementarity: Syntactic (LCS, TFIDF), structural (Deckard), and semantic (Doc2Vec) metrics are statistically distinct (Wilcoxon signed-rank tests, p<0.01p<0.01), indicating orthogonality.
  • Context-awareness: Ranking methods by similarity before considering statement-level similarity provides a further ~24% reduction on already pruned spaces.

This positions SISC as both a practical heuristic for search-space pruning and a theoretical lens for understanding redundancy in software artifacts (Chen et al., 2018).

3. Scalable Detection of SISC via Control-Flow Graph Fingerprinting

SISC underpins scalable clone detection in large code repositories through normalization, control-flow abstraction, and fingerprinting:

  • Normalization: Whitespace, identifier names, and literals are abstracted away to focus on structural similarity.
  • CFG Generation: Code is parsed to an Abstract Syntax Tree, from which the Control-Flow Graph G=(B,E)G = (B,E) is built, partitioning sequences into basic blocks according to leader identification.
  • Path-based SimHash fingerprinting: Each root-to-exit path in the CFG is hashed into a compact (e.g., 64-bit) SimHash fingerprint h(p)h(p), yielding H(G)={h(p):pP(G)}H(G) = \{h(p) : p\in P(G)\}.
  • Similarity comparison: The similarity of two fragments is defined as

SCFG(G,G)=Spathsmin(H(G),H(G))S_{\mathrm{CFG}}(G, G') = \frac{|S_\mathrm{paths}|}{\min(|H(G)|, |H(G')|)}

where SpathsS_\mathrm{paths} is the set of all path-pairs within Hamming distance α\leq\alpha.

Empirical results show near-perfect precision for large code fragments, recall up to 95% of near-clones, scalable per-query costs, and superior performance to industry baselines (e.g., SAP clonefinder), especially when using banded LSH indexing (Alomari et al., 2019). The approach is robust to renaming, whitespace, and minor edits.

Table: Workflow Components (code similarity fingerprinting)

Pipeline Stage Output Representation Purpose
Normalization Token stream / AST Erase irrelevant syntactic variation
CFG Extraction CFG G=(B,E)G=(B,E) Capture structural, semantic flow
Path Enumeration Paths P(G)P(G) Represent behavioral execution
SimHash Fingerprint {h(p):pP(G)}\{ h(p) : p\in P(G)\} Enable efficient similarity search

4. SISC and Dynamic Detection of Functional Simions

SISC has been extended to the notion of simion mining: identifying independently developed, functionally equivalent fragments via random testing:

  • Operational definition: Two chunks C1C_1, C2C_2 are simions if, under feasible test sets TT, their observed output vectors match (h(C1)=h(C2)h(C_1) = h(C_2) after hashing).
  • Pipeline: Extract code chunks (method-based, intent-based, or sliding-window); generate inputs (including parameter permutations); compile and execute stubs with sandboxing; compare output fingerprints; apply post-processing (subsumption, clone filtering).
  • Empirical limitations: Up to 65–94% of chunks fail input generation (abstract/generic types, interface parameters), compilation, or execution. Survivorship is low: only ~28% of chunks complete the pipeline, and just 0.5–7% of these yield simion candidates in real systems.
  • Clone filtering: Clone detection remains crucial, filtering 17× more fragments than dynamic SISC produces simions.
  • Challenges: Random testing rarely covers meaningful input space; Java's type system precludes cross-project simion mining; many routines differ on error/corner handling beyond input/output equivalence.

This suggests dynamic SISC, while theoretically appealing, is not yet generally practical on real-world object-oriented programs without further progress in input generation, type abstraction, and hybrid symbolic or static analysis (Deissenboeck et al., 2018).

5. SISC in Symbolic Communication Systems: The Tap Code

SISC provides a conceptual basis for designing codes for constrained channels where traditional codes (e.g., Morse) are suboptimal. The Tap code, as a true binary alternative to Morse, exemplifies SISC in analog communication:

  • Structure: Each Tap code word is a binary string (tap/silence) with prefix and suffix “1”, inter-letter marker “00”, and optional zero-padding for even raster alignment.
  • Derivations: Can be constructed by (a) block-coding binary patterns without “00” substring, (b) mapping Morse code to tap-only representation, or (c) Polybius-square coordinates encoded as tap runs.
  • Efficiency: Information rate R=H(X)/LR = H(X)/L, where H(X)H(X) is source entropy and LL is average codeword length per character. On English text, LTap4.61L_{\rm Tap} \approx 4.61, R0.88R \approx 0.88; Morse achieves R0.98R \approx 0.98 in ideal tapped form.
  • Comparison: Tap code is robust to amplitude/timing variation, strictly binary, and universal, but slower than Morse when dot/dash distinction is available. Tap code is preferable when a channel supports only on-off signaling (Rafler, 2013).

Table: Comparative Code Properties

Code Avg. bits/char (Eng) Info Rate RR Key Robustness
Tap Code $4.6$ $0.88$ Strictly binary, tolerant
Morse (ternary) $4.13$ (tapped equiv) $0.98$ Ternary, amplitude-sensitive
Polybius Square $7.7$ $0.53$ Simple but inefficient

6. Trade-offs, Limitations, and Practical Recommendations

SISC-based algorithms and codes demonstrate substantial utility for redundancy-based repair, scalable clone detection, and binary communication; however, challenges remain:

  • Redundancy mining: Combining metrics (syntactic, semantic, structural) yields superior search-space pruning, but many non-overlapping correct candidates persist. Overfitting and false positives must be managed (Chen et al., 2018).
  • Clone/simion detection: Static clone filtering is more tractable and general than dynamic SISC, particularly in strongly typed and object-oriented languages. Dynamic SISC is constrained by input generation, code context, and the limits of random or brute-force testing (Deissenboeck et al., 2018).
  • Scalability: Control-flow-based SISC is linearly scalable for indexing, sublinear for querying with LSH, and achieves practical precision/recall for large (>103) codebases (Alomari et al., 2019).
  • Domain adaptation: Communication SISC (e.g., Tap code) trades throughput for universality and robustness, illustrating that the most SISC-like code at the communication-symbol level is not always the globally optimal code.

Future work in all domains must address the abstraction of project-specific types, richer models of semantic similarity, domain-specific test/feature generation, and synthesis of hybrid static-dynamic detection pipelines.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Similar Input Similar Code (SISC).