Metadata-Version: 2.4
Name: perceptionml
Version: 0.1.0
Summary: A text embedding analysis pipeline for perception modeling
Author-email: "Raymond V. Li" <raymond@raymondli.me>
License: Apache-2.0
Project-URL: Homepage, https://github.com/raymondli/perceptionml
Project-URL: Bug Tracker, https://github.com/raymondli/perceptionml/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy==1.26.4
Requires-Dist: pandas==2.2.3
Requires-Dist: scipy==1.15.3
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: statsmodels==0.14.4
Requires-Dist: torch==2.6.0
Requires-Dist: torchaudio==2.6.0
Requires-Dist: torchvision==0.21.0
Requires-Dist: transformers==4.42.4
Requires-Dist: sentence-transformers==4.1.0
Requires-Dist: accelerate==1.7.0
Requires-Dist: safetensors==0.5.3
Requires-Dist: tokenizers==0.19.1
Requires-Dist: huggingface-hub==0.32.2
Requires-Dist: datasets==3.6.0
Requires-Dist: einops==0.8.1
Requires-Dist: umap-learn==0.5.7
Requires-Dist: hdbscan==0.8.40
Requires-Dist: xgboost==3.0.1
Requires-Dist: shap==0.43.0
Requires-Dist: matplotlib==3.10.3
Requires-Dist: seaborn==0.13.2
Requires-Dist: jinja2==3.1.6
Requires-Dist: pyyaml==6.0.2
Requires-Dist: click==8.1.8
Requires-Dist: tqdm==4.67.1
Dynamic: license-file

# perceptionML

A text embedding analysis pipeline for perception modeling and topic discovery.

## Features

- Generate text embeddings using state-of-the-art models (Sentence Transformers, OpenAI, etc.)
- Dimensionality reduction with UMAP or PCA
- Advanced clustering with HDBSCAN
- Interactive HTML visualizations
- Topic analysis and statistics
- Support for zero-presence analysis and category comparisons
- Multi-GPU support for large datasets

## Installation

```bash
pip install perceptionml
```

## Quick Start

### Simplest Usage - No Configuration Needed!

Just point to your CSV file with text:

```bash
perceptionml --data your_data.csv
```

perceptionML will automatically:
- Detect your text column (longest text)
- Find or create an ID column
- Identify numeric columns as outcomes
- Generate synthetic outcomes if no numeric columns exist
- Use optimal settings for finding many detailed topics

### What Your Data Should Look Like

Minimal CSV (just text):
```csv
text
"This is my first document about..."
"Another document with different content..."
```

CSV with outcomes to analyze:
```csv
id,text,sentiment_score,rating
1,"Great product, highly recommend!",0.95,5
2,"Terrible experience, would not buy again",-0.87,1
```

### Basic Options

```bash
# Specify output file name
perceptionml --data your_data.csv --output my_analysis.html

# Sample large datasets
perceptionml --data your_data.csv --sample-size 10000

# Use specific embedding model
perceptionml --data your_data.csv --embedding-model nvidia/NV-Embed-v2

# Export results to CSV
perceptionml --data your_data.csv --export-csv
```

### Advanced Usage

For more control, you can:

1. **Use configuration files** for complex setups
2. **Adjust clustering granularity**:
   ```bash
   # Many small topics (default)
   perceptionml --data your_data.csv --auto-cluster many
   
   # Medium-sized topics  
   perceptionml --data your_data.csv --auto-cluster medium
   
   # Few large topics
   perceptionml --data your_data.csv --auto-cluster few
   ```

3. **Override specific parameters**:
   ```bash
   perceptionml --data your_data.csv \
       --min-cluster-size 30 \
       --umap-neighbors 15
   ```

## Understanding the Output

The HTML visualization shows:
- **3D scatter plot** of your texts, clustered by topic
- **Topic keywords** extracted from each cluster
- **Statistics** about outcomes in different regions
- **Interactive controls** to explore the data

Click on points to read the original texts. Use the controls to filter by outcome values or focus on specific topics.

## Requirements

- Python 3.8+
- CUDA-capable GPU recommended for faster embedding generation
- 4GB+ RAM for typical datasets

## Support

- Documentation: [https://github.com/raymondli/perceptionml](https://github.com/raymondli/perceptionml)
- Issues: [https://github.com/raymondli/perceptionml/issues](https://github.com/raymondli/perceptionml/issues)
- Author: Raymond V. Li (raymond@raymondli.me)

## License

See LICENSE file for details.
