Metadata-Version: 2.4
Name: gitbook-scraper
Version: 0.1.0
Summary: A CLI tool to scrape and structure GitBook documentation
Author-email: Async <cigarette@keemail.com>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.7
Requires-Dist: beautifulsoup4>=4.9.0
Requires-Dist: click>=8.0.0
Requires-Dist: requests>=2.25.0
Requires-Dist: rich>=10.0.0
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.10.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: requests-mock>=1.11.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# GitBook Scraper

A command-line tool to scrape and structure GitBook documentation into a single, well-organized markdown file.

## Features

- 📚 Scrapes any GitBook documentation site
- 🌳 Maintains original document hierarchy and structure
- 📝 Generates a single, well-formatted markdown file
- ⚡ Fast and polite scraping with rate limiting
- 🛠️ Configurable output format and structure
- 🔄 Automatic retry on failed requests
- 📋 Table of contents generation

## Installation

```bash
pip install gitbook-scraper
```

## Quick Start

```bash
# Basic usage
gitbook-scraper https://your-gitbook-url.io

# Specify output file
gitbook-scraper https://your-gitbook-url.io -o documentation.md

# With table of contents
gitbook-scraper https://your-gitbook-url.io --toc

# Custom rate limiting
gitbook-scraper https://your-gitbook-url.io --delay 1.0
```

## Advanced Usage

### Command Line Options

```bash
Options:
  -o, --output TEXT     Output file path [default: documentation.md]
  --toc                 Generate table of contents [default: False]
  --delay FLOAT        Delay between requests in seconds [default: 0.5]
  --retries INTEGER    Number of retries for failed requests [default: 3]
  --timeout INTEGER    Request timeout in seconds [default: 10]
  --debug             Enable debug logging [default: False]
  --no-cleanup        Keep intermediate files [default: False]
  --help             Show this message and exit
```

### Python API

```python
from gitbook_scraper import GitbookScraper

scraper = GitbookScraper(
    base_url="https://your-gitbook-url.io",
    output_file="documentation.md",
    generate_toc=True,
    delay=0.5
)

scraper.scrape()
```

## Configuration

The tool can be configured using environment variables:

```bash
# Set default output directory
export GITBOOK_SCRAPER_OUTPUT_DIR="./docs"

# Set custom user agent
export GITBOOK_SCRAPER_USER_AGENT="Custom User Agent"

# Set default delay
export GITBOOK_SCRAPER_DELAY=1.0
```

## Error Handling

The scraper implements automatic retries with exponential backoff for failed requests. Common issues and solutions:

- Rate limiting: Increase the delay between requests
- Timeout errors: Increase the timeout value
- Navigation extraction fails: Try different selectors with `--selector-file`

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/feature`)
3. Commit your changes (`git commit -m 'Add feature'`)
4. Push to the branch (`git push origin feature/feature`)
5. Open a Pull Request

## License

Distributed under the MIT License. See `LICENSE` for more information.