This project analyzes how noisy and incomplete data affects the performance of machine learning models on a binary classification task. It builds a clean baseline model, then systematically corrupts the data with different levels of noise and missing values, and evaluates how model performance degrades.
- Synthetic binary classification dataset (no external data needed)
- Train/test split and preprocessing
- Training of Logistic Regression and RandomForestClassifier
- Evaluation under multiple noise levels:
- Feature noise (Gaussian noise)
- Missing values injection
- Summary report printed to the console
- Optional visualization of performance vs. noise level
-
src/data_prep.py
Generates a clean synthetic dataset and saves it asdata/clean_data.csv. -
src/train_model.py
Trains baseline models on the clean dataset and saves them tomodels/. -
src/evaluate_noise.py
Loads the trained models, applies different noise levels to the data and reports performance. -
data/clean_data.csv
Synthetic dataset generated bydata_prep.py.
- Create and activate a virtual environment (optional but recommended):
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Generate the synthetic dataset:
python src/data_prep.py- Train baseline models:
python src/train_model.py- Evaluate robustness under noise:
python src/evaluate_noise.pyThis will print accuracy scores under different noise and missing-value levels. It will also save an
optional plot noise_impact.png in the project root.
- Python 3.9+ (should also work on 3.8+)
- Dependencies listed in
requirements.txt
The code is written to be simple, readable and easily extensible for academic experiments or teaching purposes.