Metadata-Version: 2.4
Name: charded
Version: 1.0.0
Summary: charded
Author-email: Alex Kalaverin <alex@kalaver.in>
Project-URL: Homepage, https://kalaver.in
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Requires-Python: <3.14,>=3.12
Description-Content-Type: text/markdown
Requires-Dist: charset-normalizer>=3.4.7
Requires-Dist: kain<2,>=1.1.3
Requires-Dist: python-magic>=0.4.27

# charded

[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/)
[![License: BSD](https://img.shields.io/badge/License-BSD-yellow.svg)](https://opensource.org/licenses/BSD-3-Clause)

`charded` is a type-safe, generic string/bytes container with automatic charset
detection. It wraps anything you throw at it — `str`, `bytes`, even `int` — and
lets you work with it as either text or raw bytes, without guessing encodings.

## Installation

Via [uv](https://docs.astral.sh/uv/):

```bash
uv add charded
```

---

## Features

- **Generic type safety** — `Str[str]` or `Str[bytes]` with compile-time checking
- **Automatic charset detection** — combines `charset-normalizer`, `python-magic`,
  and internal heuristics
- **MIME type sniffing** — detect content type via libmagic
- **Zero-copy lazy evaluation** — cached properties, decode/encode only when needed
- **Full `str` proxy** — use `len()`, `[]`, `split()`, `startswith()`, `tokenize()`,
  etc. directly on the container
- **Binary detection** — adaptive hybrid sampling strategy for large files
- **Utility functions** — `to_ascii()`, `to_bytes()`, `to_text()` for quick conversions

---

## Quick Start

### Wrap anything

```python
from charded import Str

# From bytes with unknown encoding
x = Str(b"\xc3\xa9l\xc3\xa8ve")  # UTF-8 French text
print(x.string)   # "élève"
print(x.charset)  # "utf-8"

# From str
y = Str("hello")
print(y.bytes)    # b"hello"

# Or use built-in constructors directly
print(bytes(Str("a")))   # b'a'
print(str(Str(b"asd")))  # "asd"
```

### Automatic charset detection

```python
raw = b"\xef\xf0\xe8\xea"  # some Russian text in unknown encoding
s = Str(raw)

print(s.charset)           # detected encoding, e.g. "cp1251"
print(str(s))              # decoded text
print(bytes(s))            # original bytes
```

Detection order:
1. Explicit `charset=` argument if provided
2. `charset-normalizer` result
3. Internal heuristic scan (adaptive binary-offset sampling)
4. Fallback to ASCII

### MIME type detection

```python
data = Str(b"%PDF-1.4\n1 0 obj")
print(data.mime)
# {"type": "application/pdf", "description": "PDF document, version 1.4"}
```

### Use as a string

`Str` proxies almost all `str` methods, so you can use it directly:

```python
s = Str(b"hello world")

len(s)           # 11
s[0:5]           # "hello"
s.split()        # ["hello", "world"]
s.startswith("h")  # True
s.upper()        # "HELLO WORLD"
```

### Tokenize

```python
s = Str("foo-bar_baz, qux")
print(s.tokenize(r"[\w]+"))
# ("foo", "bar", "baz", "qux")
```

### Utility functions

```python
from charded import to_ascii, to_bytes, to_text

# Force ASCII transliteration
text = to_ascii("Café résumé naïve")
# "Cafe resume naive"

# Convert anything to bytes
raw = to_bytes("hello", charset="utf-8")

# Convert anything to text
text = to_text(b"hello", charset="utf-8")
```

---

## Type Safety

`Str` is a true generic. Use `Str[str]` or `Str[bytes]` for explicit typing:

```python
from charded import Str

def process(data: Str[bytes]) -> str:
    return data.string.upper()

process(Str(b"hello"))   # OK
process(Str("hello"))     # type error (mypy/pyright will catch it)
```

Type aliases are also provided:

```python
from charded.string import StrBytes, StrText

def load(raw: StrBytes) -> StrText:
    return Str(raw.string)
```

---

## How charset detection works

For small inputs (< 64 bytes), `charded` does a full scan. For medium and large
inputs it uses an **adaptive hybrid binary-offset strategy**:

- **Header-heavy sampling** — first 1 KB scanned with moderate density
- **Exponential sparse middle** — O(log n) samples for large files
- **Footer sampling** — last 1 KB checked for trailing signatures

This guarantees detection of BOMs, null-byte injection, and encoding signatures
without reading the entire file into memory.

---

## API Reference

### `Str[T]`

| Member | Type | Description |
|--------|------|-------------|
| `Str.to_bytes(obj, charset=None)` | `classmethod` | Convert anything to `bytes` |
| `Str.to_text(obj, charset=None)` | `classmethod` | Convert anything to `str` |
| `Str.to_ascii(x, charset=None)` | `staticmethod` | Transliterate to ASCII |
| `mime` | `property` | `{"type": ..., "description": ...}` or `None` |
| `tokenize(regex)` | `method` | Regex tokenization, returns `tuple[str, ...]` |

All standard `str` methods (`split`, `startswith`, `upper`, etc.) are proxied.

---

## License

BSD 3-Clause. See [LICENSE](LICENSE) for details.
