Metadata-Version: 2.4
Name: rfmix-reader
Version: 0.2.1
Summary: RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: file parser,rfmix,gpu acceleration,local ancestry
Author: Kynon J.M. Benjamin
Author-email: kj.benjamin90@gmail.com
Maintainer: Kynon J.M. Benjamin
Maintainer-email: kj.benjamin90@gmail.com
Requires-Python: >=3.11,<3.15
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: docs
Provides-Extra: gpu
Provides-Extra: io
Provides-Extra: tests
Provides-Extra: viz
Requires-Dist: cairosvg (>=2.7,<3.0) ; extra == "viz"
Requires-Dist: cyvcf2 (>=0.31)
Requires-Dist: dask (>=2025.1,<2026.0)
Requires-Dist: matplotlib (>=3.10,<4.0) ; extra == "viz"
Requires-Dist: numpy (>=1.23,<3)
Requires-Dist: pandas (>=2.0)
Requires-Dist: psutil (>=6,<7)
Requires-Dist: seaborn (>=0.13,<0.14) ; extra == "viz"
Requires-Dist: sphinx (>=7,<9) ; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints (>=2,<3) ; extra == "docs"
Requires-Dist: sphinx-copybutton (>=0.5,<0.6) ; extra == "docs"
Requires-Dist: sphinx-rtd-theme (>=2,<3) ; extra == "docs"
Requires-Dist: torch (>=2.8) ; extra == "gpu"
Requires-Dist: tqdm (>=4.66)
Project-URL: Bug Tracker, https://github.com/heart-gen/rfmix_reader/issues
Project-URL: homepage, https://rfmix-reader.readthedocs.io/en/latest/
Project-URL: repository, https://github.com/heart-gen/rfmix_reader.git
Description-Content-Type: text/markdown

# RFMix-reader
`RFMix-reader` is a Python package for efficiently reading and processing output
files generated by [`RFMix`](https://github.com/slowkoni/rfmix), a widely used tool
for estimating local ancestry in admixed populations.  
It employs a **lazy loading approach** to minimize memory usage, and leverages **GPU acceleration**
for major speedups when available.

---

## Installation

`rfmix-reader` requires **Python 3.10+**. Install from PyPI:

```bash
pip install rfmix-reader
````

### Installation Options

* **Basic install** (CPU only):

  ```bash
  pip install rfmix-reader
  ```

* **With GPU acceleration** (`cupy`, `cudf`, `dask-cudf`):

  ```bash
  pip install rfmix-reader[gpu]
  ```

* **With documentation tools** (`sphinx`, `sphinx-rtd-theme`):

  ```bash
  pip install rfmix-reader[docs]
  ```

* **With testing tools** (`pytest`):

  ```bash
  pip install rfmix-reader[tests]
  ```

### GPU Notes

* `torch` is installed automatically.
* For CUDA builds, install a matching GPU-enabled wheel for your system following the [PyTorch guide](https://pytorch.org/get-started/locally/).
* RAPIDS (`cudf`, `cupy`) wheels are version- and CUDA-specific. See the [RAPIDS install guide](https://docs.rapids.ai/install).
* CPU-only installations will still run efficiently, just without GPU acceleration.

---

## Quickstart

```python
from rfmix_reader import read_rfmix

# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)

print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
```

---

## Key Features

* **Lazy Loading**: Reads data on-the-fly, reducing memory footprint.
* **Efficient Access**: Query specific loci or regions of interest.
* **Seamless Integration**: Works smoothly with `pandas`, `dask`, and other analysis tools.
* **Loci Imputation**: Impute local ancestry loci to dense genotype variant sites.
* **GPU Acceleration**: Automatic CUDA acceleration via PyTorch/CuPy when available.

---

## Simulation Data

Test datasets for two- and three-population admixture are available on Synapse:
[Synapse Project syn61691659](https://www.synapse.org/Synapse:syn61691659).

---

## Usage

### Binary Conversion

RFMix does not generate binary files directly.
Use `create_binaries` to generate them (also available as a CLI):

```bash
create-binaries two_pops/out/
```

```python
from rfmix_reader import create_binaries

create_binaries("two_pops/out/", binary_dir="./binary_files")
```

### Main Function

Once binaries are available, process RFMix results:

```python
from rfmix_reader import read_rfmix

loci, g_anc, admix = read_rfmix("two_pops/out/")
```

### Three Population Example

Binaries can also be generated on-the-fly within `read_rfmix` with
`generate_binary` set to `True`.

```python
loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
                               binary_dir="./binary_files",
                               generate_binary=True)
```

### Loci Imputation

Impute local ancestry loci to variant positions for integration with genotype data:

```python
from rfmix_reader import interpolate_array
import pandas as pd
import dask.array as da

variant_loci_df = pd.DataFrame({
    "chrom": ["1", "1", "1", "1"],
    "pos": [100, 200, 300, 400],
    "i": [1, None, None, 2]
})
admix = da.random.random((2, 3))  # mock admixture data

z = interpolate_array(variant_loci_df, admix, "/path/to/output")
print(z.shape)
```

### Reading Haptools simulations

Use `read_simu` to load BGZF-compressed VCF files created by
`haptools simgenotype --pop_field`:

```python
from rfmix_reader import read_simu

loci_df, g_anc, admix = read_simu("/path/to/simulations/")
```

Haptools does **not** include the chromosome length in the `##contig`
header lines, but `read_simu` requires that metadata to index each VCF.
Copy the `contigs.txt` file Haptools generates from the FASTA you used
for simulation and reheader every file with the appropriate contig entry
before calling `read_simu`. The following snippet shows one approach
using `bcftools` and `tabix`:

```bash
CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"

CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
    echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
    exit 1
fi

bcftools view -h "$IN" \
    | sed "s/^##contig=<ID=${CHR}>.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"
```

---

## Development Install

For contributors:

```bash
git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"
```

---

## Citation

If you use this software, please cite:

[![DOI](https://zenodo.org/badge/807052842.svg)](https://zenodo.org/doi/10.5281/zenodo.12629787)

Benjamin, K. J. M. (2024). **RFMix-reader (Version 0.2.0)** \[Computer software].
[https://github.com/heart-gen/rfmix\_reader](https://github.com/heart-gen/rfmix_reader)

Kynon JM Benjamin. *"RFMix-reader: Accelerated reading and processing for local ancestry studies."*
**bioRxiv** (2024).
DOI: [10.1101/2024.07.13.603370](https://www.biorxiv.org/content/10.1101/2024.07.13.603370v2).

---

## Funding

This work was supported by the National Institutes of Health,
National Institute on Minority Health and Health Disparities (NIMHD)
K99MD016964 / R00MD016964.


