Metadata-Version: 2.4
Name: data_cleaner_lib
Version: 2
Summary: Automated data cleaning and preparation library for analysts before EDA.
Author: Giri V
Keywords: data-cleaning,data-preprocessing,data-analysis,pandas
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Dynamic: license-file

# data_cleaner_lib

A comprehensive Python library for **automated data cleaning and preparation before Exploratory Data Analysis (EDA)**.

`data_cleaner_lib` provides a simple yet powerful pipeline that detects common data quality issues and fixes them with minimal code. It is designed for **data analysts, data scientists, and ML practitioners** who want a fast and reliable way to prepare datasets.

---

# Key Features

### Automatic Column Detection

Detects semantic column types automatically:

- Numeric
- Text
- Email
- Datetime
- Categorical

---

### Smart Data Type Conversion

Automatically converts mixed or string data to proper types:

- "25" → `Int64`
- "45.5" → `float`
- "2024-01-01" → `datetime`

---

### Missing Value Handling

Handles missing data using appropriate strategies:

| Column Type | Strategy                |
| ----------- | ----------------------- |
| Numeric     | Median                  |
| Text        | Mode                    |
| Categorical | Mode                    |
| Datetime    | Forward / Backward fill |

---

### Duplicate Removal

Detects and removes duplicate rows automatically.

---

### Outlier Detection

Supports statistical outlier detection:

- IQR method
- Z-score method

Extreme values are capped automatically.

---

### Text Standardization

Cleans textual data by:

- Lowercasing
- Removing extra spaces
- Removing special characters
- Preserving valid email formats

---

### Data Quality Score

Generates a **dataset quality score (0–100)** based on:

- Missing values
- Duplicate rows

Example:

```
Quality Score: 87.5
```

---

### Cleaning Recommendations

Automatically suggests potential cleaning actions:

Example:

```
age: 1 potential outliers detected → consider outlier treatment
```

---

### EDA-Ready Summary

Generates a structured dataset summary including:

- Column types
- Missing values
- Unique values
- Numeric statistics

Example:

| column | detected_type | dtype   | missing% | unique | min | max |
| ------ | ------------- | ------- | -------- | ------ | --- | --- |
| age    | numeric       | float64 | 0        | 3      | 25  | 42  |

---

### One-Line Cleaning Pipeline

Prepare a dataset for analysis with a single command.

```
pipeline.prepare_for_eda()
```

---

# Installation

Install using pip:

```
pip install data_cleaner_lib
```

Requirements:

- Python ≥ 3.8
- pandas
- numpy

---

# Quick Example

```python
import pandas as pd
from data_cleaner_lib import CleanPipeline


df = pd.DataFrame({
    "name": ["John", "Jane", "Giri", None],
    "email": [" John@email.com ", "JANE@email.COM", "GIri@Gmail.com", None],
    "age": [25, 30, None, 47],
    "salary": [50000, 60000.00, None, None],
    "weight": ["45.5", "90", 68.76, "100"]
})

pipeline = CleanPipeline(df)

cleaned_df = pipeline.prepare_for_eda()

print(cleaned_df)
```

Output:

```
   name           email   age  salary  weight
0  john  john@email.com  25.0   50000   45.50
1  jane  jane@email.com  30.0   60000   90.00
2  giri  giri@gmail.com  30.0   55000   68.76
3  giri  john@email.com  42.5   55000  100.00
```

---

# Generate EDA Summary

```python
summary = pipeline.eda_summary()
print(summary)
```

Example Output:

| column | detected_type | dtype   | missing_percent | unique_values |
| ------ | ------------- | ------- | --------------- | ------------- |
| name   | text          | object  | 0               | 3             |
| age    | numeric       | float64 | 0               | 3             |

---

# Get Quality Score

```python
score = pipeline.quality_score()
print(score)
```

---

# Cleaning Recommendations

```python
pipeline.recommend_cleaning()
```

---

# Public API (Available Functions)

After installing and importing the library, the following main functions are available through the `CleanPipeline` class.

Example import:

```python
from data_cleaner_lib import CleanPipeline
```

---

## Core Pipeline

| Function            | Purpose                                   |
| ------------------- | ----------------------------------------- |
| `prepare_for_eda()` | Runs the full automated cleaning pipeline |
| `detect_types()`    | Detect semantic column types              |
| `profile()`         | Generate dataset profiling report         |

---

## Cleaning Operations

| Function              | Purpose                                           |
| --------------------- | ------------------------------------------------- |
| `fix_dtypes()`        | Convert mixed data into proper datatypes          |
| `handle_missing()`    | Fill missing values using smart strategies        |
| `remove_duplicates()` | Detect and remove duplicate rows                  |
| `handle_outliers()`   | Detect and cap outliers using statistical methods |
| `clean_text()`        | Normalize and clean text columns                  |

---

## Data Intelligence

| Function               | Purpose                                     |
| ---------------------- | ------------------------------------------- |
| `quality_score()`      | Calculate dataset quality score (0–100)     |
| `recommend_cleaning()` | Generate automatic cleaning recommendations |

---

## Reporting

| Function        | Purpose                              |
| --------------- | ------------------------------------ |
| `eda_summary()` | Produce an EDA-ready dataset summary |
| `report()`      | Generate cleaning operation report   |

---

## Example Usage

```python
import pandas as pd
from data_cleaner_lib import CleanPipeline

pipeline = CleanPipeline(df)

pipeline.detect_types()
pipeline.profile()
pipeline.fix_dtypes()
pipeline.handle_missing()
pipeline.remove_duplicates()
pipeline.handle_outliers()
pipeline.clean_text()

summary = pipeline.eda_summary()
```

---

# Project Structure

```
data_cleaner_lib/

src/
  data_cleaner_lib/

    detection/
    profiling/
    cleaning/
    reporting/
    quality/
    recommendations/

examples/
tests/
```

---

# Running Tests

```
pytest
```

---

# Roadmap

Future improvements:

- Schema validation
- Rule-based cleaning engine
- Config-driven pipelines
- Advanced anomaly detection

---

# License

MIT License

---

# Contributing

Pull requests and improvements are welcome.

If you find a bug or want a feature, open an issue.

---

# Author

`Giri V`

Developed as a data cleaning framework for analysts preparing datasets before EDA and machine learning workflows.
