Metadata-Version: 2.1
Name: raid-tool
Version: 5.0.0
Summary: RAID: Rapid Automated Interpretability Datasets tool
Home-page: https://github.com/arushisharma17/RAID
Author: Hrishikesha Kyathsandra, Zeynep Oghan, Arushi Sharma
Author-email: hk.hrishi30@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: asttokens==2.4.1
Requires-Dist: attrs==24.2.0
Requires-Dist: backcall==0.2.0
Requires-Dist: beautifulsoup4==4.12.3
Requires-Dist: bleach==6.1.0
Requires-Dist: blinker==1.8.2
Requires-Dist: certifi==2024.8.30
Requires-Dist: charset-normalizer==3.4.0
Requires-Dist: click==8.1.7
Requires-Dist: colorama==0.4.6
Requires-Dist: decorator==5.1.1
Requires-Dist: defusedxml==0.7.1
Requires-Dist: docopt==0.6.2
Requires-Dist: executing==2.1.0
Requires-Dist: fastjsonschema==2.20.0
Requires-Dist: Flask==3.0.3
Requires-Dist: idna==3.10
Requires-Dist: ipython==8.12.3
Requires-Dist: itsdangerous==2.2.0
Requires-Dist: jedi==0.19.1
Requires-Dist: Jinja2==3.1.4
Requires-Dist: jsonschema==4.23.0
Requires-Dist: jsonschema-specifications==2024.10.1
Requires-Dist: jupyter_client==8.6.3
Requires-Dist: jupyter_core==5.7.2
Requires-Dist: jupyterlab_pygments==0.3.0
Requires-Dist: MarkupSafe==2.1.5
Requires-Dist: matplotlib-inline==0.1.7
Requires-Dist: mistune==3.0.2
Requires-Dist: nbclient==0.10.0
Requires-Dist: nbconvert==7.16.4
Requires-Dist: nbformat==5.10.4
Requires-Dist: packaging==24.1
Requires-Dist: pandocfilters==1.5.1
Requires-Dist: parso==0.8.4
Requires-Dist: pickleshare==0.7.5
Requires-Dist: pipreqs==0.5.0
Requires-Dist: platformdirs==4.3.6
Requires-Dist: prompt_toolkit==3.0.48
Requires-Dist: pure_eval==0.2.3
Requires-Dist: Pygments==2.18.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pywin32==308; platform_system == "Windows"
Requires-Dist: pyzmq==26.2.0
Requires-Dist: referencing==0.35.1
Requires-Dist: requests==2.32.3
Requires-Dist: rpds-py==0.20.0
Requires-Dist: six==1.16.0
Requires-Dist: soupsieve==2.6
Requires-Dist: stack-data==0.6.3
Requires-Dist: tinycss2==1.3.0
Requires-Dist: tornado==6.4.1
Requires-Dist: traitlets==5.14.3
Requires-Dist: urllib3==2.2.3
Requires-Dist: wcwidth==0.2.13
Requires-Dist: webencodings==0.5.1
Requires-Dist: Werkzeug==3.0.4
Requires-Dist: yarg==0.1.9
Requires-Dist: contourpy==1.3.0
Requires-Dist: cycler==0.12.1
Requires-Dist: dill==0.3.4
Requires-Dist: filelock==3.16.1
Requires-Dist: fonttools==4.54.1
Requires-Dist: fsspec==2024.9.0
Requires-Dist: graphviz==0.20.3
Requires-Dist: h5py==3.12.1
Requires-Dist: huggingface-hub==0.25.1
Requires-Dist: imbalanced-learn==0.12.4
Requires-Dist: joblib==1.4.2
Requires-Dist: kiwisolver==1.4.7
Requires-Dist: matplotlib==3.9.2
Requires-Dist: memory-profiler==0.61.0
Requires-Dist: mpmath==1.3.0
Requires-Dist: networkx==3.3
Requires-Dist: numpy==1.26.4
Requires-Dist: pandas==2.2.3
Requires-Dist: pillow==10.4.0
Requires-Dist: psutil==6.0.0
Requires-Dist: pyparsing==3.1.4
Requires-Dist: pytz==2024.2
Requires-Dist: PyYAML==6.0.2
Requires-Dist: regex==2024.9.11
Requires-Dist: safetensors==0.4.5
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: scipy==1.14.1
Requires-Dist: seaborn==0.11.1
Requires-Dist: svgwrite==1.4.1
Requires-Dist: sympy==1.13.3
Requires-Dist: threadpoolctl==3.5.0
Requires-Dist: tokenizers==0.20.0
Requires-Dist: torch==2.4.1
Requires-Dist: tqdm==4.66.5
Requires-Dist: transformers==4.45.1
Requires-Dist: tree-sitter==0.23.0
Requires-Dist: tree-sitter-java==0.23.2
Requires-Dist: tree-sitter-python==0.23.2
Requires-Dist: typing_extensions==4.12.2
Requires-Dist: tzdata==2024.2

# RAID (Rapid Automated Interpretability Datasets) tool
Designed to generate binary and multiclass datasets rapidly using regular expressions, AST labels, and semantic concepts. This tool enabled us to perform fine-grained analyses of concepts learned by the model through its latent representations. 

### Features
1. Labels of different granularities: tokens, phrases, blocks, other semantic chunks.
2. Corresponding activations for tokens, phrases, blocks.
4. B-I-0 Labelling for higher-level semantic concepts (Phrase and Block level chunks)
5. Activation aggregation for higher-level semantic concepts (Phrase and Block level chunks): You can generate activations once and experiment with different granularities by aggregating the activations.
6. Integration with static analysis tools to create custom labels
   a. Tree-sitter parsers (Abstract Syntax Tree based labels) - Syntactic
   b. CK metrics (Object Oriented metrics/Design patterns) - Structural
   c. CFG, DFG, AST Nesting/Parent-child relationships, etc -Hierarchical
   d. SE datasets, Ontologies, Design Patterns - Semantic
   e. Regular Expressions to create datasets, filter datasets, edit datasets.

## Installation Instructions

Install RAID and its dependencies:

```bash
pip install git+https://github.com/arushisharma17/NeuroX.git@fe7ab9c2d8eb1b4b3f93de73b8eaae57a6fc67b7
pip install raid-tool
```

### Usage

Run RAID on a Java file:

```bash
raid path/to/your/file.java --model bert-base-uncased --device cpu --binary_filter "set:public,static" --output_prefix output --aggregation_method mean --label class_body --layer 5
```

#### Required Arguments:
- `input_file`: Path to the Java source file to analyze

#### Optional Arguments:
- `--model`: Transformer model to use (default: 'bert-base-uncased')
- `--device`: Computing device to use ('cpu' or 'cuda', default: 'cpu')
- `--binary_filter`: Filter for token labeling
  - Format: "type:pattern"
  - Types: 
    - `set`: Comma-separated list (e.g., "set:public,static")
    - `re`: Regular expression pattern
- `--output_prefix`: Prefix for output files (default: 'output')
- `--aggregation_method`: Method to aggregate activations
  - Options: mean, max, sum, concat (default: mean)
- `--label`: Type of AST label to analyze
  - Options: program, class_declaration, class_body, method_declaration, etc.
- `--layer`: Specific transformer layer to analyze (0-12, default: all layers)
- `--help`: Show help message explaining the arguments.


### Available Labels
The following labels are supported for the `--label` parameter:
- program
- class_declaration 
- class_body
- method_declaration
- formal_parameters
- block
- method_invocation
- leaves


## Link to Colab notebook for tutorial and initial instructions
https://colab.research.google.com/drive/1MfTbOMrZnQ_FkC65CCJyUE4v21u5pJ4G?usp=sharing

Note: Please keep updating readme as you add code.

[![RAID-Workflow](https://colab.research.google.com/drive/165SuE7ZWAAfUBcTuH6dVlQqoMFSL7ja-)](https://docs.google.com/drawings/d/1LEqqQ_1dJ7MWrBR_2kRZF71bF8Sv6TiDTebAvaGkFX0/edit?usp=sharing)

## Features
1. Integrate static analysis tools. 
2. Generate AST nodes and labels
3. Extract layerwise and 'all' layers activations
4. Split phrasal nodes into B-I-O tokens
5. Aggregate activations
6. Lexical Patterns or features
7. Support for multiple languages
8. Code Ontologies/HPC Ontologies
9. Software engineering datasets
10. Hierarchical Properties/Structural Properties
11. Add support for autoencoders NeuroX
12. Train, validation, test splits for probing tasks support.
