Automatically detect duplicate logic in Python code changes using advanced AST analysis and semantic similarity.
Prevent code duplication, improve code quality, and maintain cleaner codebases with intelligent duplicate detection that goes beyond simple text matching.
- π§ Multi-Strategy Detection: AST analysis, semantic similarity, and function signature matching
- π― Smart Pattern Recognition: Detects business logic patterns and common code structures
- ποΈ Full Class Support: Detects duplicates in both top-level functions and class methods
- π¬ Actionable PR Comments: Provides suggestions and refactoring recommendations
- βοΈ Highly Configurable: Adjustable similarity thresholds and file patterns
- π Comprehensive Reports: JSON and Markdown reports with detailed analysis
- π Fast & Efficient: Uses uv package manager for lightning-fast dependency installation
Add this workflow to .github/workflows/duplicate-detection.yml:
name: Duplicate Logic Detection
on:
pull_request:
paths: ['**/*.py']
jobs:
detect-duplicates:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}| Parameter | Description | Required | Default |
|---|---|---|---|
github-token |
GitHub token for API access | β | ${{ github.token }} |
pr-number |
Pull request number | β | ${{ github.event.number }} |
repository |
Repository name (owner/repo) | β | ${{ github.repository }} |
base-ref |
Base reference for comparison | β | ${{ github.base_ref }} |
head-ref |
Head reference for comparison | β | ${{ github.head_ref }} |
post-comment |
Post findings as PR comment | β | true |
fail-on-duplicates |
Fail if high-confidence duplicates found | β | false |
similarity-method |
Similarity method to use (jaccard_tokens, sequence_matcher, levenshtein_norm) |
β | jaccard_tokens |
global-threshold |
Global similarity threshold (0.0-1.0) for all methods | β | 0.7 |
folder-thresholds |
Per-folder thresholds as JSON (e.g., {"src/shared": 0.1, "src/tests": 0.9}) |
β | {} |
| Output | Description |
|---|---|
duplicates-found |
Whether any duplicates were detected |
match-count |
Total number of matches found |
report-path |
Path to the generated report file |
- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
fail-on-duplicates: true- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
post-comment: false- name: Detect Duplicate Logic (High Precision)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
similarity-method: levenshtein_norm # More thorough analysis
fail-on-duplicates: true- name: Detect Duplicate Logic (Custom Thresholds)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
global-threshold: 0.8 # Higher threshold for stricter detection
folder-thresholds: '{"src/shared": 0.1, "src/tests": 0.9}'- name: Detect Duplicate Logic (Folder-Specific Thresholds)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
similarity-method: jaccard_tokens
folder-thresholds: '{"src/shared": 0.1, "src/core": 0.8, "tests": 0.9}'The action uses configurable similarity analysis to detect duplicate logic patterns:
- Parses Python files to extract function definitions
- Analyzes function signatures and structure
- Identifies code patterns and complexity
Choose from three different similarity algorithms:
- Best for: General purpose, fast analysis
- Method: Token-based Jaccard similarity coefficient
- Strengths: Fast, good balance of precision/recall
- Use when: You want reliable results with good performance
- Best for: Balanced approach between speed and accuracy
- Method: Python's difflib.SequenceMatcher
- Strengths: Good at detecting structural similarities
- Use when: You need more nuanced similarity detection
- Best for: High precision, strict duplicate detection
- Method: Normalized Levenshtein distance
- Strengths: Most thorough analysis, best precision
- Use when: You want to catch even subtle duplicates
- Excludes very small functions (< 5 lines)
- Filters out test files and common patterns
- Prioritizes business logic and complex functions
Control when functions are considered duplicates with flexible threshold settings:
- Default:
0.7(70% similarity) - Usage: Applies to all files when no folder-specific threshold is set
- Range:
0.0to1.0(0% to 100% similarity)
- Format: JSON object with folder paths as keys
- Example:
{"src/shared": 0.1, "src/tests": 0.9} - Priority: Folder-specific thresholds override global threshold
- Matching: Uses most specific (longest) matching folder path
- Dual Path Logic: Considers both the new function's path AND existing function's path
- Threshold Selection: Uses the more strict (higher) threshold between the two paths
- Fallback: If no folder threshold matches, uses global threshold
# Strict detection globally
global-threshold: 0.85
# Lenient detection globally
global-threshold: 0.5
# Mixed approach (lenient for shared code, strict for tests)
folder-thresholds: '{"src/shared": 0.1, "tests": 0.9, "src/core": 0.8}'When comparing functions from different folders, the system:
- Gets threshold for new function's folder (or global if no match)
- Gets threshold for existing function's folder (or global if no match)
- Uses the higher (more strict) threshold of the two
Example:
global-threshold: 0.3
folder-thresholds: '{"src/shared": 0.3, "src/projects/integrations": 0.4}'test.pyvssrc/shared/utils.py: Usesmax(0.3, 0.3) = 0.3threshold β 34.2% > 30% β Reportedtest.pyvssrc/projects/integrations/service.py: Usesmax(0.3, 0.4) = 0.4threshold β 30.9% < 40% β Not Reportedmain.pyvssrc/core/logic.py: Usesmax(0.3, 0.3) = 0.3threshold (both use global)
- Shared Libraries: Low threshold (0.1-0.3) to catch even minor duplications
- Test Files: High threshold (0.8-0.9) to avoid false positives on similar test patterns
- Core Business Logic: Medium-high threshold (0.6-0.8) for important code quality
- Utilities: Medium threshold (0.5-0.7) for general utility functions
## π Duplicate Logic Detection Results
Found 2 potential duplicates with high confidence:
### Match 1: Email Validation
- **New Function**: `check_email_format` (src/utils.py:15)
- **Existing Function**: `validate_email` (src/validators.py:8)
- **Similarity**: 92%
- **Suggestion**: Consider using the existing `validate_email` function instead
### Match 2: Data Processing
- **New Function**: `process_user_data` (src/handlers.py:25)
- **Existing Function**: `handle_user_info` (src/services.py:45)
- **Similarity**: 87%
- **Suggestion**: Extract common logic into a shared utility functionThe action has minimal runtime dependencies for fast execution:
- rich v14.1.0 - Console output and progress bars
For development, testing, and research, additional dependencies are available:
- Testing: pytest, pytest-mock, pytest-cov, pytest-xdist
- Code Quality: black, isort, flake8, mypy, pre-commit
- Research: GitPython, PyGithub, scikit-learn, nltk, numpy, pandas, pyyaml
The action uses modern Python packaging with pyproject.toml and uv for fast dependency management:
# Clean core dependencies
dependencies = []
# Runtime dependencies (action execution)
[project.optional-dependencies]
runtime = ["rich==14.1.0"]
# Dataset generation dependencies
dataset = ["openai>=1.0.0", "pandas>=2.0.0", "numpy>=1.24.0"]
# Research dependencies (experiments)
research = ["GitPython", "PyGithub", "scikit-learn", ...]
# Development dependencies
dev = ["black>=23.0.0", "isort>=5.12.0", ...]
test = ["pytest>=7.0.0", "pytest-mock>=3.10.0", ...]# Clone the repository
git clone https://github.com/ArthurMor4is/duplicate-logic-detector-action.git
# Install dependencies using uv (recommended)
uv sync --all-extras
# Or using traditional pip
pip install -e ".[dev,test]"
# Run tests
make test
# or
uv run pytest
# Run sample analysis
make test-sampleNote: The config/default-config.yml file is used for development and testing purposes only. The GitHub Action uses built-in configuration optimized for CI/CD workflows.
This repository includes tools to generate datasets for testing and tuning duplicate detection algorithms:
# Install dataset generation dependencies
uv pip install -e ".[dataset]"
# Generate function clones using LLM
generate-clones --source-code "./src" --dest-folder="clones_output" --n-clones=3
# Build balanced datasets
build-dataset --clones-folder="clones_output" --dataset-name="test_dataset.json" --clone-ratio=0.5Use Cases:
- π¬ Algorithm Testing: Test different similarity methods on your codebase
- π― Threshold Tuning: Find optimal detection thresholds
- π Performance Evaluation: Compare detection strategies with ground truth data
See the Dataset Generation Guide for detailed instructions.
- Usage Guide - Detailed usage instructions
- Testing Guide - How to test the action
- Dataset Generation Guide - Generate test datasets
- Examples - Complete workflow examples
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation
- π Report Issues
- π¬ Discussions