Metadata-Version: 2.1
Name: cohort_creator
Version: 0.2.0
Summary: Creates a neuroimaging cohort by aggregating data across datasets.
Project-URL: Bug trakcer, https://github.com/neurodatascience/cohort_creator/issues
Project-URL: Documentation, https://cohort-creator.readthedocs.io/en/latest/
Project-URL: Homepage, https://github.com/neurodatascience/cohort_creator
Author: Rémi Gau
Maintainer-email: Rémi Gau <remi.gau@gmail.com>
License: MIT
License-File: LICENSE
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development
Requires-Python: >=3.8
Requires-Dist: datalad
Requires-Dist: pandas
Requires-Dist: pybids
Requires-Dist: rich
Requires-Dist: rich-argparse
Provides-Extra: dev
Requires-Dist: black; extra == 'dev'
Requires-Dist: codespell; extra == 'dev'
Requires-Dist: cohort-creator[doc,test]; extra == 'dev'
Requires-Dist: flake8; extra == 'dev'
Requires-Dist: flake8-docstrings; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pandas-stubs; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Provides-Extra: doc
Requires-Dist: furo; extra == 'doc'
Requires-Dist: myst-parser; extra == 'doc'
Requires-Dist: numpydoc; extra == 'doc'
Requires-Dist: sphinx; extra == 'doc'
Requires-Dist: sphinx-argparse; extra == 'doc'
Requires-Dist: sphinx-copybutton; extra == 'doc'
Provides-Extra: docs
Requires-Dist: cohort-creator[doc]; extra == 'docs'
Provides-Extra: test
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Provides-Extra: tests
Requires-Dist: cohort-creator[test]; extra == 'tests'
Description-Content-Type: text/markdown

[![Test](https://github.com/neurodatascience/cohort_creator/actions/workflows/test.yml/badge.svg)](https://github.com/neurodatascience/cohort_creator/actions/workflows/test.yml)
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/neurodatascience/cohort_creator/main.svg)](https://results.pre-commit.ci/latest/github/neurodatascience/cohort_creator/main)
![License](https://img.shields.io/badge/license-MIT-blue.svg)
![https://github.com/psf/black](https://img.shields.io/badge/code%20style-black-000000.svg)
[![Sourcery](https://img.shields.io/badge/Sourcery-enabled-brightgreen)](https://sourcery.ai)
[![Documentation Status](https://readthedocs.org/projects/cohort-creator/badge/?version=latest)](https://cohort-creator.readthedocs.io/en/latest/?badge=latest)
[![codecov](https://codecov.io/gh/neurodatascience/cohort_creator/branch/main/graph/badge.svg?token=PMQYH0DIPX)](https://codecov.io/gh/neurodatascience/cohort_creator)

# Cohort creator

> **TL;DR**
>
> Creates a neuroimaging cohort by aggregating data across datasets.

Command line tool to:

- install a set of datalad datasets from openneuro,
- get the data for a set of participants,
- copy the data to a new directory structure to create a "cohort".

It takes 2 files as input that should list:

- datasets to be included in the cohort
- subject in each dataset to be included in the cohort

Both of those files can be generated by the [neurobagel query tool](https://query.neurobagel.org/).

For examples of of inputs TSV files see this [page](https://cohort-creator.readthedocs.io/en/latest/inputs.html).

It [outputs the cohort]((https://cohort-creator.readthedocs.io/en/latest/outputs.html))
following the recommendations
from the [BIDS extension proposal 35](https://docs.google.com/document/d/1tFRNumQyIgjXBNC3brFDLO9FaikjL84noxK6Om-Ctik).

## Requirements

### Operating system

It is recommended to use this package on a linux / Mac OS.

If you are on Windows, try using WSL (Windows Subsystem for Linux) to run this package:
windows does not handle symbolic links well, and this package relies on symlinks.
If you decided to go ahead anyway make sure you have got a LOT of disk space available.

More information
[here](https://handbook.datalad.org/en/latest/intro/windows.html#ohnowindows)

### Python dependencies

Make sure you have the following installed:

- datalad and its dependencies:

  - if you are have anaconda / conda, it should be 'just' a matter of running
    ```bash
    conda install -c conda-forge datalad
    ```

  - But check the
    [installation instructions](https://handbook.datalad.org/en/latest/intro/installation.html#install)
    for more details.

Other dependencies are listed in the pyproject.toml file.

## Installation

```bash
git clone https://github.com/neurodatascience/cohort_creator.git
cd cohort_creator
pip install .
```

## Limitations

Cohorts can only be created by aggregating data from openneuro and openneuro derivatives.

### Latest datasets

Currently this should allow you to access more or less the following:

Number of datasets: 863 with 37441 subjects including:

- 692 datasets with MRI data
- with participants.tsv: 487
- with phenotype directory: 22
- with fmriprep: 90 (3937 subjects)
  - with participants.tsv: 74
  - with phenotype directory: 3
- with freesurfer: 36 (3322 subjects)
  - with participants.tsv: 34
  - with phenotype directory: 2
- with mriqc: 330 (14607 subjects)
  - with participants.tsv: 248
  - with phenotype directory: 18

It may be that very recent datasets are not available yet.

### Dataset types

Only possible to get data from:

- raw
- mriqc
- fmriprep

Not yet possible to get freesurfer data via the cohort creator, though the data
is available in the sourcedata folder of the fmriprep datasets.

### Blind spots

It may be possible that that some metadata files (JSON, TSV) are not accessed
over correctly if they are not in the root of the dataset or the same folder as
the data file.

**FIX** use pybids / ancpbids for data indexing and querying.

## Demo

To get from openneuro-derivatives for all T1w

- the MRIQC output for each file
- the corresponding T1W file

run the following command from within the cohort_creator folder:

```bash
cohort_creator install \
  --dataset_listing inputs/datasets_with_mriqc.tsv \
  --participant_listing inputs/participants_with_mriqc.tsv \
  --output_dir outputs \
  --dataset_types raw mriqc \
  --verbosity 3

cohort_creator get \
  --dataset_listing inputs/datasets_with_mriqc.tsv \
  --participant_listing inputs/participants_with_mriqc.tsv \
  --output_dir outputs \
  --dataset_types raw mriqc \
  --verbosity 3

cohort_creator copy \
  --dataset_listing inputs/datasets_with_mriqc.tsv \
  --participant_listing inputs/participants_with_mriqc.tsv \
  --output_dir outputs \
  --dataset_types raw mriqc \
  --verbosity 3
```
