Metadata-Version: 2.4
Name: secretsentry
Version: 2.0.0
Summary: Advanced sensitive data scanner with Jupyter notebook support and intelligent false positive filtering
Author-email: Abdul Jilani <abdul.jilani@evolveailabs.com>
License: MIT
Project-URL: Homepage, https://github.com/y2ee201/secretsentry
Project-URL: Documentation, https://github.com/y2ee201/secretsentry#readme
Project-URL: Repository, https://github.com/y2ee201/secretsentry.git
Project-URL: Bug Tracker, https://github.com/y2ee201/secretsentry/issues
Keywords: security,secrets,scanner,pii,jupyter,notebook,api-keys,credentials,sanitization,privacy,devops,ci-cd,machine-learning,ml,transformers,false-positives
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: System :: Systems Administration
Classifier: Topic :: Utilities
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tqdm>=4.62.0
Requires-Dist: numpy>=1.19.0
Provides-Extra: full
Requires-Dist: pandas>=1.3.0; extra == "full"
Requires-Dist: matplotlib>=3.3.0; extra == "full"
Requires-Dist: seaborn>=0.11.0; extra == "full"
Requires-Dist: ipywidgets>=7.6.0; extra == "full"
Requires-Dist: scikit-learn>=1.0.0; extra == "full"
Requires-Dist: transformers>=4.20.0; extra == "full"
Requires-Dist: torch>=1.11.0; extra == "full"
Requires-Dist: joblib>=1.1.0; extra == "full"
Provides-Extra: jupyter
Requires-Dist: ipywidgets>=7.6.0; extra == "jupyter"
Requires-Dist: pandas>=1.3.0; extra == "jupyter"
Requires-Dist: matplotlib>=3.3.0; extra == "jupyter"
Requires-Dist: seaborn>=0.11.0; extra == "jupyter"
Provides-Extra: analysis
Requires-Dist: pandas>=1.3.0; extra == "analysis"
Requires-Dist: matplotlib>=3.3.0; extra == "analysis"
Requires-Dist: seaborn>=0.11.0; extra == "analysis"
Provides-Extra: ml
Requires-Dist: scikit-learn>=1.0.0; extra == "ml"
Requires-Dist: joblib>=1.1.0; extra == "ml"
Requires-Dist: numpy>=1.19.0; extra == "ml"
Provides-Extra: ml-advanced
Requires-Dist: scikit-learn>=1.0.0; extra == "ml-advanced"
Requires-Dist: transformers>=4.20.0; extra == "ml-advanced"
Requires-Dist: torch>=1.11.0; extra == "ml-advanced"
Requires-Dist: joblib>=1.1.0; extra == "ml-advanced"
Requires-Dist: numpy>=1.19.0; extra == "ml-advanced"
Dynamic: license-file

# SecretSentry 🛡️

> **The first AI-powered sensitive data scanner built for modern data science and web development workflows**

[![PyPI version](https://badge.fury.io/py/secretsentry.svg)](https://badge.fury.io/py/secretsentry)
[![Python Support](https://img.shields.io/pypi/pyversions/secretsentry.svg)](https://pypi.org/project/secretsentry/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

SecretSentry is an advanced sensitive data scanner that goes beyond traditional secret detection. Built specifically for **Jupyter notebooks**, **web development**, and **data science workflows**, it combines **machine learning** with regex patterns to intelligently filter false positives while detecting API keys, PII, credentials, and other sensitive information.

## 🎯 **Why SecretSentry?**

### **Built for Modern Workflows**
- 🔬 **Jupyter Notebook Specialist**: First scanner designed for `.ipynb` files
- 🤖 **AI-Powered Detection**: Machine learning models reduce false positives by up to 80%
- 🧠 **Smart Context Awareness**: Understands code context, not just pattern matching
- 🌐 **Multi-Environment**: CLI, Jupyter notebooks, and Python scripts
- 🎛️ **Interactive Analysis**: Built-in widgets for exploring findings

### **Comprehensive Detection**
- 🔑 **50+ Built-in Patterns**: API keys, tokens, secrets, credentials
- 👤 **PII Detection**: SSNs, credit cards, phone numbers, emails
- 💰 **Financial Data**: Salary information, bank accounts, routing numbers
- 🌍 **Geographic Data**: Coordinates, IP addresses, postal codes
- 🏥 **Sensitive Categories**: Ethnic data, religious information, medical records

### **Advanced Features**
- 🛡️ **Smart Sanitization**: Context-aware gibberish replacement
- 🤖 **Ensemble Detection**: Combines regex + ML for maximum accuracy
- 📊 **Rich Visualizations**: Charts and statistics (with matplotlib/seaborn)
- 📈 **Pandas Integration**: Export to DataFrames for analysis
- 🎯 **Confidence Scoring**: ML predictions with 0.0-1.0 confidence scores
- 🔄 **CI/CD Ready**: Perfect for automation and pipelines
- 🖥️ **Cross-Platform**: Works on macOS, Windows, and Linux

## 🚀 **Quick Start**

### **Installation**

```bash
# Basic installation (regex-only detection)
pip install secretsentry

# With machine learning capabilities
pip install secretsentry[ml]

# Advanced ML with transformers (best accuracy)
pip install secretsentry[ml-advanced]

# Full installation with all features
pip install secretsentry[full]

# For Jupyter notebooks only
pip install secretsentry[jupyter]
```

### **Basic Usage**

```python
from secretsentry import SecretSentry, quick_scan, quick_ml_scan

# Quick scan with regex detection
scanner = quick_scan("./my_project")

# Quick scan with AI/ML enhancement (recommended)
scanner = quick_ml_scan("./my_project", confidence_threshold=0.7)

# Manual scanning with ML capabilities
scanner = SecretSentry(
    use_ml_detection=True,
    ml_confidence_threshold=0.7,
    ml_ensemble_mode=True  # Combines regex + ML
)
findings = scanner.scan_directory("./my_project")
scanner.display_findings()

# Access ML-specific results
ml_findings = scanner.get_ml_findings()
high_confidence = scanner.get_high_confidence_findings(0.8)

# Sanitize files (creates backups automatically)
stats = scanner.sanitize_files(dry_run=True)  # Preview changes
stats = scanner.sanitize_files()  # Actually sanitize
```

### **Command Line**

```bash
# Basic regex scanning
secretsentry scan ./my_project --display

# AI-enhanced scanning (recommended)
secretsentry scan ./my_project --ml --display

# Quick ML scan with optimal settings
secretsentry scan ./my_project --ml-quick

# ML-only detection with custom confidence
secretsentry scan ./my_project --ml-only --ml-confidence 0.8

# Check ML requirements
secretsentry scan --check-ml

# Export findings with ML metadata
secretsentry scan ./my_project --ml --export findings.json

# Sanitize files (with backup)
secretsentry scan ./my_project --sanitize --dry-run
secretsentry scan ./my_project --sanitize

# List all detection patterns
secretsentry list-patterns
```

## 🤖 **AI-Powered Detection**

SecretSentry's machine learning capabilities provide **context-aware detection** that dramatically reduces false positives:

### **ML Detection Modes**

```python
# Ensemble Mode (recommended): Combines regex + ML
scanner = SecretSentry(
    use_ml_detection=True,
    ml_ensemble_mode=True,
    ml_confidence_threshold=0.7
)

# ML-Only Mode: Pure machine learning detection
scanner = SecretSentry(
    use_ml_detection=True,
    ml_ensemble_mode=False,
    ml_confidence_threshold=0.8
)

# Quick ML scan with optimal settings
scanner = quick_ml_scan("./my_project")
```

### **ML Features**

- 🧠 **Context Understanding**: Analyzes surrounding code context, not just patterns
- 📊 **Confidence Scoring**: Every ML detection includes a 0.0-1.0 confidence score  
- 🔬 **Feature Extraction**: Text entropy, keyword analysis, pattern recognition
- 🏋️ **Multiple Models**: Logistic Regression, Isolation Forest, optional Transformers
- 💾 **Model Caching**: Trained models cached for faster subsequent scans
- 🖥️ **Local Processing**: All ML inference happens on your machine (no data sent externally)

### **ML Requirements**

```bash
# Check what's available on your system
secretsentry scan --check-ml

# Install ML dependencies
pip install secretsentry[ml]           # Basic ML (scikit-learn)
pip install secretsentry[ml-advanced]  # Advanced ML (transformers)
```

## 🎓 **Jupyter Notebook Integration**

SecretSentry shines in Jupyter environments with **zero false positives** from notebook metadata:

```python
# In Jupyter notebook
from secretsentry import quick_scan, quick_ml_scan

# Quick ML scan with visualizations
scanner = quick_ml_scan("./test_data", show_plots=True)

# Interactive exploration with ML metadata
scanner.create_interactive_viewer()

# Data analysis with ML findings
df = scanner.to_dataframe(include_ml_findings=True)
summary = df.groupby(['pattern_type', 'detection_method']).size()

# Analyze confidence scores
ml_df = df[df['detection_method'] == 'ml']
confidence_analysis = ml_df['confidence_score'].describe()
```

## 📊 **What Makes It Special**

### **AI-Enhanced Accuracy**

**Traditional regex scanners** flag these as secrets:
```
❌ aws_secret_key: iVBORw0KGgoAAAANSUhEUgAABKYAAAMW...  # Just a PNG image!
❌ api_key: "cell_type": "code"  # Notebook metadata!  
❌ secret: #3498db  # CSS color!
❌ token: "placeholder_for_testing"  # Test data!
```

**SecretSentry with ML** understands context and only reports **real secrets**:
```
✅ aws_secret_key: AKIAIOSFODNN7EXAMPLE (confidence: 0.95)
✅ stripe_key: sk_live_1234567890abcdef123456789 (confidence: 0.89)  
✅ database_url: postgresql://user:password@localhost/db (confidence: 0.92)
```

**ML Advantages:**
- 🎯 **Context Awareness**: Understands surrounding code patterns
- 📊 **Confidence Scoring**: Know how certain each detection is
- 🧠 **Learning**: Improves over time with usage patterns
- 🛡️ **Adaptive**: Handles new secret formats without regex updates

### **Smart Sanitization**

SecretSentry doesn't just find secrets—it **fixes them safely**:

```python
# Before sanitization
API_KEY = "sk_live_1234567890abcdef"
employee_ssn = "123-45-6789"
coordinates = "40.7128, -74.0060"

# After sanitization (context-aware gibberish)
API_KEY = "sk_live_xK8mP9nQ4vL7wR2Z"
employee_ssn = "456-78-9123"  
coordinates = "38.8951, -77.0364"
```

## 🔧 **Advanced Usage**

### **Custom Patterns**

```python
# Add organization-specific patterns
custom_patterns = {
    'employee_id': r'EMP-\d{6}',
    'project_code': r'PROJ-[A-Z]{3}-\d{4}',
    'internal_api': r'internal_key_[a-zA-Z0-9]{32}'
}

scanner = SecretSentry(custom_patterns=custom_patterns)
```

### **CI/CD Integration**

```python
#!/usr/bin/env python3
# security_check.py
import sys
from secretsentry import SecretSentry

def security_gate():
    # Use ML-enhanced detection for better accuracy in CI/CD
    scanner = SecretSentry(
        use_ml_detection=True,
        ml_ensemble_mode=True,
        ml_confidence_threshold=0.8  # Higher threshold for CI/CD
    )
    findings = scanner.scan_directory(".", show_progress=False)
    
    if findings:
        print(f"❌ SECURITY CHECK FAILED: {len(findings)} secrets found")
        
        # Show high-confidence ML findings first
        if scanner.use_ml_detection:
            ml_findings = scanner.get_ml_findings()
            high_conf = scanner.get_high_confidence_findings(0.9)
            print(f"🤖 ML Analysis: {len(ml_findings)} ML findings, {len(high_conf)} high confidence")
        
        scanner.display_findings(max_display=10)
        return 1
    else:
        print("✅ SECURITY CHECK PASSED: No secrets detected")
        return 0

if __name__ == "__main__":
    sys.exit(security_gate())
```

**CI/CD CLI Usage:**
```bash
# Basic CI/CD check
secretsentry scan . --ml --quiet || exit 1

# High-confidence only for sensitive deployments  
secretsentry scan . --ml-only --ml-confidence 0.9 --quiet || exit 1
```

### **Batch Processing**

```python
# Scan multiple projects with ML
from secretsentry import SecretSentry
import os

projects = ["./frontend", "./backend", "./data-science"]
all_results = {}

for project in projects:
    if os.path.exists(project):
        # Use ML for better accuracy across different project types
        scanner = SecretSentry(
            use_ml_detection=True,
            ml_ensemble_mode=True,
            ml_confidence_threshold=0.7
        )
        findings = scanner.scan_directory(project)
        
        # Collect ML statistics
        ml_findings = scanner.get_ml_findings()
        all_results[project] = {
            'total_findings': len(findings),
            'ml_findings': len(ml_findings),
            'high_confidence': len(scanner.get_high_confidence_findings(0.8))
        }
        
        # Export detailed reports with ML metadata
        scanner.export_findings(f"{project.replace('./', '')}_security_report.json")

print("Security Summary:", all_results)
```

## 📈 **Detection Categories**

<details>
<summary><b>🔑 API Keys & Secrets (20+ patterns)</b></summary>

- AWS Access/Secret Keys
- GitHub Tokens (classic & fine-grained)  
- Google API Keys
- Stripe Keys (live & test)
- Slack Tokens & Webhooks
- SendGrid API Keys
- Twilio Keys
- Mailgun Keys
- Azure Storage Keys
- Heroku API Keys
- Generic API patterns

</details>

<details>
<summary><b>💳 Financial Data (8+ patterns)</b></summary>

- Credit Cards (Visa, MasterCard, AmEx, Discover, JCB, Diners)
- Bank Account Numbers
- Routing Numbers  
- IBAN & SWIFT Codes
- Salary Information

</details>

<details>
<summary><b>👤 Personal Information (10+ patterns)</b></summary>

- Social Security Numbers
- Phone Numbers (US & International)
- Email Addresses
- Passport Numbers
- Driver's License Numbers
- Medical Record Numbers

</details>

<details>
<summary><b>🌍 Geographic Data (5+ patterns)</b></summary>

- GPS Coordinates
- IP Addresses (IPv4 & IPv6)
- MAC Addresses  
- ZIP/Postal Codes

</details>

<details>
<summary><b>🏥 Sensitive Personal Data (5+ patterns)</b></summary>

- Ethnic/Racial Categories
- Religious Affiliations  
- Medical Information
- Disability Status

</details>

<details>
<summary><b>🔐 Cryptographic Material (5+ patterns)</b></summary>

- Private Keys (RSA, SSH)
- Public Keys & Certificates
- JWT Tokens
- OAuth Tokens  

</details>

## 🎛️ **Configuration**

### **Environment Variables**
```bash
# Disable progress bars
export SECRETSENTRY_NO_PROGRESS=1

# Custom config file
export SECRETSENTRY_CONFIG=/path/to/config.json

# ML model cache directory (optional)
export SECRETSENTRY_MODEL_CACHE=/path/to/ml/models

# Force ML detection on/off
export SECRETSENTRY_USE_ML=true
export SECRETSENTRY_ML_CONFIDENCE=0.7
```

### **Configuration File**
```json
{
    "excluded_patterns": ["test_", "example_", "demo_"],
    "excluded_files": ["*.test.js", "test_*.py"],
    "excluded_dirs": ["tests", "examples", "docs"],
    "custom_patterns": {
        "company_id": "COMP-\\d{8}"
    },
    "sanitization": {
        "create_backups": true,
        "backup_suffix": ".backup"
    },
    "ml_detection": {
        "enabled": true,
        "confidence_threshold": 0.7,
        "ensemble_mode": true,
        "use_transformers": false,
        "model_cache_dir": "~/.cache/secretsentry/models"
    }
}
```

## ⚡ **Performance & Requirements**

### **ML Performance**

| Detection Mode | Speed | Accuracy | Memory Usage | Dependencies |
|---------------|-------|----------|--------------|--------------|
| **Regex Only** | ⚡⚡⚡⚡⚡ | ✅✅✅ | 🟢 Low | Minimal |
| **ML Basic** | ⚡⚡⚡⚡ | ✅✅✅✅ | 🟡 Medium | scikit-learn |
| **ML Advanced** | ⚡⚡⚡ | ✅✅✅✅✅ | 🔴 High | transformers |

### **System Requirements**

**Minimum (Regex-only):**
- Python 3.7+
- 50MB RAM
- Any CPU

**Recommended (ML Basic):**
- Python 3.8+
- 512MB RAM
- 2+ CPU cores
- 200MB disk space

**Optimal (ML Advanced):**  
- Python 3.9+
- 2GB+ RAM
- 4+ CPU cores
- 1GB disk space

### **Installation Time**

```bash
pip install secretsentry              # ~30 seconds
pip install secretsentry[ml]          # ~2 minutes  
pip install secretsentry[ml-advanced] # ~5 minutes (downloads models)
```

### **First Run Performance**

- **Regex detection**: Instant
- **ML Basic**: ~30 seconds (model training on first run)
- **ML Advanced**: ~2 minutes (model download + training)
- **Subsequent runs**: Fast (models cached)

## 🤝 **Contributing**

We welcome contributions! Here's how to get started:

```bash
# Clone the repository
git clone https://github.com/yourusername/secretsentry.git
cd secretsentry

# Install development dependencies (includes ML dependencies)
pip install -e ".[full]"
pip install pytest black flake8

# Run tests (includes ML tests)
pytest tests/

# Test ML functionality specifically
python test_ml_detection.py

# Format code
black secretsentry/
flake8 secretsentry/
```

## 📝 **License**

MIT License - see [LICENSE](LICENSE) file for details.

## 🙏 **Acknowledgments**

- Inspired by [detect-secrets](https://github.com/Yelp/detect-secrets) and [truffleHog](https://github.com/dxa4481/truffleHog)
- ML capabilities powered by [scikit-learn](https://scikit-learn.org/) and [Transformers](https://huggingface.co/transformers/)
- Built for the data science and security communities
- Special thanks to all contributors and the open source community
- Grateful to the broader AI/ML community for advancing secret detection research

## 📞 **Support**

- 📖 **Documentation**: [Full docs](https://github.com/yourusername/secretsentry#readme)
- 🐛 **Issues**: [Report bugs](https://github.com/yourusername/secretsentry/issues)
- 💬 **Discussions**: [Community forum](https://github.com/yourusername/secretsentry/discussions)
- 📧 **Contact**: your.email@example.com

---

**SecretSentry** - *Standing guard over your sensitive data* 🛡️
