Metadata-Version: 2.4
Name: web-scraper-tool
Version: 0.2.0
Summary: A web scraper tool to extract data from e-commerce websites
Home-page: https://github.com/hdd5ps/web-scraper-tool
Author: hdd5ps
Author-email: hdd5ps@virginia.edu
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4<5,>=4.12.3
Requires-Dist: python-dotenv<2,>=1.0.1
Requires-Dist: requests<3,>=2.32.3
Requires-Dist: selenium<5,>=4.25.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Web Scraper Tool

Secure, configurable web scraper for catalogue-style pages built with `requests`, `BeautifulSoup`, and optional Selenium.

## Highlights

- ✅ HTTPS-first fetching with optional headless browser fallback
- ✅ Robots.txt enforcement and allowed-domain guardrails
- ✅ Retry-aware HTTP session with rate limiting
- ✅ Structured parsing + safe JSON/CSV persistence
- ✅ Environment-driven configuration (`.env` support)

## Prerequisites

- Python 3.10+
- Chrome/Chromium installed if using browser mode
- Matching ChromeDriver on `PATH` for Selenium runs

## Installation

```bash
pip install -r requirements.txt
# Optional dev tooling
pip install -r requirements-dev.txt
```

## Quick Start (CLI)

```bash
python -m web_scraper_tool.cli https://example.com/catalog \
  --allow-domain example.com \
  --output results.json \
  --format json
```

Key flags:

- `--use-browser` – use headless Chrome for dynamic content.
- `--allow-domain example.com` – whitelist target domains.
- `--format csv` – persist CSV instead of JSON when writing to disk.
- `--respect-robots` / `--ignore-robots` – override robots policy per run.
- `--verbose` – emit debug logging.

## Programmatic Usage

```python
from web_scraper_tool import ScraperConfig, WebScraper
from web_scraper_tool.storage import save_products_to_json

config = ScraperConfig(
    allowed_domains=["example.com"],
    respect_robots=True,
)

scraper = WebScraper("https://example.com/catalog", config=config)
result = scraper.scrape()

for product in result.products:
    print(product.name, product.price)

save_products_to_json(result.products, "output/products.json")
```

## Configuration

All settings may come from environment variables (see `.env.example`).

| Variable | Description | Default |
| --- | --- | --- |
| `SCRAPER_TIMEOUT` | HTTP timeout (seconds) | `10` |
| `SCRAPER_MAX_RETRIES` | Retry attempts for transient errors | `3` |
| `SCRAPER_BACKOFF` | Exponential backoff factor | `0.5` |
| `SCRAPER_DELAY` | Delay between requests (seconds) | `1` |
| `SCRAPER_REQUIRE_HTTPS` | Require HTTPS URLs | `1` |
| `SCRAPER_RESPECT_ROBOTS` | Enforce robots.txt | `1` |
| `SCRAPER_ALLOWED_DOMAINS` | Comma-separated domain whitelist | *(unset)* |
| `SCRAPER_HEADLESS` | Run headless browser | `1` |
| `SCRAPER_USE_BROWSER` | Prefer Selenium over HTTP | `0` |

Settings are ingested via [python-dotenv](https://github.com/theskumar/python-dotenv) when a `.env` file is present.

## Security Checklist

- Enforce HTTPS (`SCRAPER_REQUIRE_HTTPS=1`) unless explicit permission allows HTTP.
- Maintain a domain allowlist to prevent SSRF-style misuse.
- Honor robots.txt (default) unless you have legal clearance to opt out.
- Rate limit (`SCRAPER_DELAY`) to avoid detection/bans.
- Keep credentials and proxies in environment variables.

## Running Tests

```bash
pytest
```

## Docker Usage

Build the image (once):

```bash
docker build -t web-scraper-tool:latest .
```

Run the CLI inside the container:

```bash
docker run --rm \
  --env-file .env \
  --env SCRAPER_USE_BROWSER=0 \
  web-scraper-tool:latest \
  https://example.com/catalog \
  --allow-domain example.com \
  --output /app/output/results.json
```

Retrieve generated files by mounting a host volume:

```bash
docker run --rm \
  --env-file .env \
  -v "$(pwd)/output:/app/output" \
  --env SCRAPER_USE_BROWSER=0 \
  web-scraper-tool:latest \
  https://example.com/catalog \
  --allow-domain example.com \
  --output /app/output/results.json
```

## Local Demo Catalog

To avoid rate limits and showcase the scraper safely, a static catalogue lives in `demo/catalog.html`.

1. Serve the page locally:

   ```bash
   cd demo
   python -m http.server 8000
   ```

2. In another terminal, run the scraper (native Python):

   ```bash
   python -m web_scraper_tool.cli "http://localhost:8000/catalog.html" \
     --allow-domain localhost \
     --require-http \
     --output output/demo.json
   ```

3. Or via Docker (Linux hosts) using host networking:

   ```bash
   docker run --rm \
     --network host \
     -v "$(pwd)/output:/app/output" \
     --env SCRAPER_USE_BROWSER=0 \
     web-scraper-tool:latest \
     "http://localhost:8000/catalog.html" \
     --allow-domain localhost \
     --require-http \
     --output /app/output/demo.json
   ```

> Tip: The Docker image omits Chromium, so set `SCRAPER_USE_BROWSER=0` (the default in native runs) to stay on the requests-based path. When tuning `SCRAPER_DELAY`, provide a numeric value (seconds) so the rate limiter remains happy.

> **Note:** Real-world e-commerce sites often block automated scraping with CAPTCHAs or 503 errors. Demo with the provided catalogue or any site that explicitly permits scraping.



## Troubleshooting

- ChromeDriver must match Chrome version for Selenium mode.
- Use `--verbose` to surface HTTP status codes and retry behaviour.
- Respect the target website's terms of service.

## License

MIT
