Skip to content

A lightweight research-oriented project analyzing how noise, missing values, and corrupted features affect machine learning model reliability. Includes synthetic data generation, multi-model training, and noise-impact evaluation with visualization. Designed as a teaching/learning artifact for data quality, robustness, and ML reliability concepts.

Notifications You must be signed in to change notification settings

talhayilmazc/Data-Reliability-Noisy-Input-Handling-in-ML-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Data Reliability & Noisy Input Handling in ML Models

This project analyzes how noisy and incomplete data affects the performance of machine learning models on a binary classification task. It builds a clean baseline model, then systematically corrupts the data with different levels of noise and missing values, and evaluates how model performance degrades.

Features

  • Synthetic binary classification dataset (no external data needed)
  • Train/test split and preprocessing
  • Training of Logistic Regression and RandomForestClassifier
  • Evaluation under multiple noise levels:
    • Feature noise (Gaussian noise)
    • Missing values injection
  • Summary report printed to the console
  • Optional visualization of performance vs. noise level

Project Structure

  • src/data_prep.py
    Generates a clean synthetic dataset and saves it as data/clean_data.csv.

  • src/train_model.py
    Trains baseline models on the clean dataset and saves them to models/.

  • src/evaluate_noise.py
    Loads the trained models, applies different noise levels to the data and reports performance.

  • data/clean_data.csv
    Synthetic dataset generated by data_prep.py.

How to Run

  1. Create and activate a virtual environment (optional but recommended):
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Generate the synthetic dataset:
python src/data_prep.py
  1. Train baseline models:
python src/train_model.py
  1. Evaluate robustness under noise:
python src/evaluate_noise.py

This will print accuracy scores under different noise and missing-value levels. It will also save an optional plot noise_impact.png in the project root.

Requirements

  • Python 3.9+ (should also work on 3.8+)
  • Dependencies listed in requirements.txt

The code is written to be simple, readable and easily extensible for academic experiments or teaching purposes.

About

A lightweight research-oriented project analyzing how noise, missing values, and corrupted features affect machine learning model reliability. Includes synthetic data generation, multi-model training, and noise-impact evaluation with visualization. Designed as a teaching/learning artifact for data quality, robustness, and ML reliability concepts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages