Skip to content

πŸ” Automatically detect duplicate logic in Python code changes. Prevent code duplication and improve code quality.

License

Notifications You must be signed in to change notification settings

ArthurMor4is/duplicate-logic-detector-action

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Repository files navigation

πŸ” Duplicate Logic Detector Action

GitHub Marketplace Tests License: MIT

Automatically detect duplicate logic in Python code changes using advanced AST analysis and semantic similarity.

Prevent code duplication, improve code quality, and maintain cleaner codebases with intelligent duplicate detection that goes beyond simple text matching.

✨ Key Features

  • 🧠 Multi-Strategy Detection: AST analysis, semantic similarity, and function signature matching
  • 🎯 Smart Pattern Recognition: Detects business logic patterns and common code structures
  • πŸ—οΈ Full Class Support: Detects duplicates in both top-level functions and class methods
  • πŸ’¬ Actionable PR Comments: Provides suggestions and refactoring recommendations
  • βš™οΈ Highly Configurable: Adjustable similarity thresholds and file patterns
  • πŸ“Š Comprehensive Reports: JSON and Markdown reports with detailed analysis
  • πŸš€ Fast & Efficient: Uses uv package manager for lightning-fast dependency installation

πŸš€ Quick Start

Add this workflow to .github/workflows/duplicate-detection.yml:

name: Duplicate Logic Detection

on:
  pull_request:
    paths: ['**/*.py']

jobs:
  detect-duplicates:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
          
      - name: Detect Duplicate Logic
        uses: ArthurMor4is/duplicate-logic-detector-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

πŸ“‹ Input Parameters

Parameter Description Required Default
github-token GitHub token for API access βœ… ${{ github.token }}
pr-number Pull request number ❌ ${{ github.event.number }}
repository Repository name (owner/repo) ❌ ${{ github.repository }}
base-ref Base reference for comparison ❌ ${{ github.base_ref }}
head-ref Head reference for comparison ❌ ${{ github.head_ref }}
post-comment Post findings as PR comment ❌ true
fail-on-duplicates Fail if high-confidence duplicates found ❌ false
similarity-method Similarity method to use (jaccard_tokens, sequence_matcher, levenshtein_norm) ❌ jaccard_tokens
global-threshold Global similarity threshold (0.0-1.0) for all methods ❌ 0.7
folder-thresholds Per-folder thresholds as JSON (e.g., {"src/shared": 0.1, "src/tests": 0.9}) ❌ {}

πŸ“Š Outputs

Output Description
duplicates-found Whether any duplicates were detected
match-count Total number of matches found
report-path Path to the generated report file

βš™οΈ Usage Examples

Basic Usage

- name: Detect Duplicate Logic
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}

Strict Mode (Fail on Duplicates)

- name: Detect Duplicate Logic
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    fail-on-duplicates: true

Silent Mode (No PR Comments)

- name: Detect Duplicate Logic
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    post-comment: false

Custom Similarity Method

- name: Detect Duplicate Logic (High Precision)
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    similarity-method: levenshtein_norm  # More thorough analysis
    fail-on-duplicates: true

Custom Threshold Configuration

- name: Detect Duplicate Logic (Custom Thresholds)
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    global-threshold: 0.8  # Higher threshold for stricter detection
    folder-thresholds: '{"src/shared": 0.1, "src/tests": 0.9}'

Per-Folder Threshold Configuration

- name: Detect Duplicate Logic (Folder-Specific Thresholds)
  uses: ArthurMor4is/duplicate-logic-detector-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    similarity-method: jaccard_tokens
    folder-thresholds: '{"src/shared": 0.1, "src/core": 0.8, "tests": 0.9}'

πŸ” Detection Strategies

The action uses configurable similarity analysis to detect duplicate logic patterns:

1. AST Analysis

  • Parses Python files to extract function definitions
  • Analyzes function signatures and structure
  • Identifies code patterns and complexity

2. Similarity Methods

Choose from three different similarity algorithms:

jaccard_tokens (Default)

  • Best for: General purpose, fast analysis
  • Method: Token-based Jaccard similarity coefficient
  • Strengths: Fast, good balance of precision/recall
  • Use when: You want reliable results with good performance

sequence_matcher

  • Best for: Balanced approach between speed and accuracy
  • Method: Python's difflib.SequenceMatcher
  • Strengths: Good at detecting structural similarities
  • Use when: You need more nuanced similarity detection

levenshtein_norm

  • Best for: High precision, strict duplicate detection
  • Method: Normalized Levenshtein distance
  • Strengths: Most thorough analysis, best precision
  • Use when: You want to catch even subtle duplicates

3. Smart Filtering

  • Excludes very small functions (< 5 lines)
  • Filters out test files and common patterns
  • Prioritizes business logic and complex functions

4. Configurable Thresholds

Control when functions are considered duplicates with flexible threshold settings:

Global Threshold

  • Default: 0.7 (70% similarity)
  • Usage: Applies to all files when no folder-specific threshold is set
  • Range: 0.0 to 1.0 (0% to 100% similarity)

Per-Folder Thresholds

  • Format: JSON object with folder paths as keys
  • Example: {"src/shared": 0.1, "src/tests": 0.9}
  • Priority: Folder-specific thresholds override global threshold
  • Matching: Uses most specific (longest) matching folder path
  • Dual Path Logic: Considers both the new function's path AND existing function's path
  • Threshold Selection: Uses the more strict (higher) threshold between the two paths
  • Fallback: If no folder threshold matches, uses global threshold

Threshold Examples

# Strict detection globally
global-threshold: 0.85

# Lenient detection globally
global-threshold: 0.5

# Mixed approach (lenient for shared code, strict for tests)
folder-thresholds: '{"src/shared": 0.1, "tests": 0.9, "src/core": 0.8}'

How Dual Path Logic Works

When comparing functions from different folders, the system:

  1. Gets threshold for new function's folder (or global if no match)
  2. Gets threshold for existing function's folder (or global if no match)
  3. Uses the higher (more strict) threshold of the two

Example:

global-threshold: 0.3
folder-thresholds: '{"src/shared": 0.3, "src/projects/integrations": 0.4}'
  • test.py vs src/shared/utils.py: Uses max(0.3, 0.3) = 0.3 threshold β†’ 34.2% > 30% βœ… Reported
  • test.py vs src/projects/integrations/service.py: Uses max(0.3, 0.4) = 0.4 threshold β†’ 30.9% < 40% ❌ Not Reported
  • main.py vs src/core/logic.py: Uses max(0.3, 0.3) = 0.3 threshold (both use global)

Real-World Use Cases

  • Shared Libraries: Low threshold (0.1-0.3) to catch even minor duplications
  • Test Files: High threshold (0.8-0.9) to avoid false positives on similar test patterns
  • Core Business Logic: Medium-high threshold (0.6-0.8) for important code quality
  • Utilities: Medium threshold (0.5-0.7) for general utility functions

πŸ“ˆ Example Output

## πŸ” Duplicate Logic Detection Results

Found 2 potential duplicates with high confidence:

### Match 1: Email Validation
- **New Function**: `check_email_format` (src/utils.py:15)
- **Existing Function**: `validate_email` (src/validators.py:8)  
- **Similarity**: 92%
- **Suggestion**: Consider using the existing `validate_email` function instead

### Match 2: Data Processing
- **New Function**: `process_user_data` (src/handlers.py:25)
- **Existing Function**: `handle_user_info` (src/services.py:45)
- **Similarity**: 87%
- **Suggestion**: Extract common logic into a shared utility function

πŸ“¦ Dependencies

Runtime Dependencies

The action has minimal runtime dependencies for fast execution:

  • rich v14.1.0 - Console output and progress bars

Development Dependencies

For development, testing, and research, additional dependencies are available:

  • Testing: pytest, pytest-mock, pytest-cov, pytest-xdist
  • Code Quality: black, isort, flake8, mypy, pre-commit
  • Research: GitPython, PyGithub, scikit-learn, nltk, numpy, pandas, pyyaml

Dependency Management

The action uses modern Python packaging with pyproject.toml and uv for fast dependency management:

# Clean core dependencies
dependencies = []

# Runtime dependencies (action execution)
[project.optional-dependencies]
runtime = ["rich==14.1.0"]

# Dataset generation dependencies
dataset = ["openai>=1.0.0", "pandas>=2.0.0", "numpy>=1.24.0"]

# Research dependencies (experiments)
research = ["GitPython", "PyGithub", "scikit-learn", ...]

# Development dependencies
dev = ["black>=23.0.0", "isort>=5.12.0", ...]
test = ["pytest>=7.0.0", "pytest-mock>=3.10.0", ...]

πŸ› οΈ Development

# Clone the repository
git clone https://github.com/ArthurMor4is/duplicate-logic-detector-action.git

# Install dependencies using uv (recommended)
uv sync --all-extras

# Or using traditional pip
pip install -e ".[dev,test]"

# Run tests
make test
# or
uv run pytest

# Run sample analysis
make test-sample

Note: The config/default-config.yml file is used for development and testing purposes only. The GitHub Action uses built-in configuration optimized for CI/CD workflows.

πŸ§ͺ Dataset Generation

This repository includes tools to generate datasets for testing and tuning duplicate detection algorithms:

# Install dataset generation dependencies
uv pip install -e ".[dataset]"

# Generate function clones using LLM
generate-clones --source-code "./src" --dest-folder="clones_output" --n-clones=3

# Build balanced datasets
build-dataset --clones-folder="clones_output" --dataset-name="test_dataset.json" --clone-ratio=0.5

Use Cases:

  • πŸ”¬ Algorithm Testing: Test different similarity methods on your codebase
  • 🎯 Threshold Tuning: Find optimal detection thresholds
  • πŸ“Š Performance Evaluation: Compare detection strategies with ground truth data

See the Dataset Generation Guide for detailed instructions.

πŸ“š Documentation

🀝 Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™‹β€β™‚οΈ Support

About

πŸ” Automatically detect duplicate logic in Python code changes. Prevent code duplication and improve code quality.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published