# 16. OCR Standalone Service

> Document AI OCR for converting scanned PDFs/images to editable text.

**Status:** ✅ Core implemented, 🔄 Enhancements in progress

---

## Overview

Shamra Academia offers a standalone OCR service powered by Mistral Document AI. Users can upload scanned PDFs or images and receive editable markdown/DOCX output with Arabic text support.

---

## ✅ Completed Features

### Core OCR Pipeline (Feb 2026)
- **Upload & Processing**: Users upload PDF/image → async job created → Mistral OCR API called
- **Output Formats**: Markdown (`.md`), Word (`.docx`), PDF download
- **Image Extraction**: Embedded images extracted and saved separately
- **Email Notifications**: Success/failure emails sent on job completion
- **Job Management**: List jobs, check status, delete completed jobs

### Chunked Processing for Large PDFs (Mar 10, 2026)
- **Problem**: Mistral API has 30-page hard limit per request
- **Solution**: PDFs >30 pages automatically split into chunks
- **Implementation**:
  - Uses `pdfseparate` + `pdfunite` (poppler-utils) for splitting
  - Each chunk processed separately, results merged
  - Page markers added: `*[صفحة 1-30]*`, `*[صفحة 31-60]*`, etc.
- **Max pages**: Increased from 200 → 500 pages
- **Files**:
  - [MistralOcrService.php](../src/Service/Playground/MistralOcrService.php) - `extractWithImagesChunked()`, `splitPdfRange()`

### Credit Overconsumption Prevention (Mar 10, 2026)
- **Problem**: Race condition - multiple jobs could pass credit check before any deducted
- **Solution**: Credits deducted immediately at job creation, refunded on failure
- **Flow**:
  ```
  Upload → Check credits → DEDUCT credits → Create job → Process
                                              ↓
                                         On failure → REFUND credits
  ```
- **Files**:
  - [OcrJobService.php](../src/Service/OcrJobService.php) - Step 8: Reserve credits
  - [ProcessOcrJobHandler.php](../src/MessageHandler/ProcessOcrJobHandler.php) - Refund on failure
  - [PlaygroundSubscription.php](../src/Entity/PlaygroundSubscription.php) - `refundCredits()` method

### Admin Dashboard OCR Stats
- **Route**: `/jim19ud83/playground/dashboard` → OCR Stats tab
- **API**: `/jim19ud83/playground/api/ocr-stats`
- **Metrics**: Total jobs, success rate, pages processed, credits used

---

## 🔄 In Progress / Planned

### Credit Purchase & Plan Upgrade
**Status**: Not implemented

Users currently cannot self-service:
- Upgrade from starter → researcher → professional tier
- Purchase additional credits

**Proposed Flow (Stripe)**:
1. User clicks "Upgrade" on `/playground` or when credits depleted
2. `/playground/upgrade` shows pricing table with tiers
3. User selects tier → Stripe Checkout Session created
4. Stripe hosted checkout (card, Apple Pay, Google Pay)
5. Webhook `checkout.session.completed` → Update subscription tier
6. Recurring monthly billing handled by Stripe

**Required**:
- [ ] Stripe account + API keys
- [ ] Create products/prices in Stripe dashboard
- [ ] `UpgradeController` with pricing page
- [ ] Stripe Checkout session creation
- [ ] Webhook handler `/webhook/stripe`
- [ ] DB columns: `stripe_customer_id`, `stripe_subscription_id`
- [ ] Customer portal for manage/cancel

**Estimate**: 4-6 hours

---

## 📊 Credit System

### Pricing
| Tier | Price/mo | Credits/mo | Max Pages/File |
|------|----------|------------|----------------|
| trial | $0 | 100 | 30 |
| starter | $9 | 500 | 100 |
| researcher | $19 | 1,500 | 200 |
| professional | $39 | 4,000 | 500 |
| institution | $99 | 15,000 | 500 |

### OCR Costs
- **2 credits per page** (minimum 5 credits per job)
- Shamra cost: ~$0.005/page (Mistral API)

### Manual Credit Addition (Admin)
```sql
-- Add 1000 bonus credits to user
UPDATE playground_subscription 
SET bonus_credits = bonus_credits + 1000, updated_at = NOW() 
WHERE user_id = (SELECT id FROM fos_user WHERE username = 'USERNAME');
```

---

## 🗂️ File Locations

| Component | Path |
|-----------|------|
| Controller | `src/Controller/OcrController.php` |
| Job Service | `src/Service/OcrJobService.php` |
| OCR Service | `src/Service/Playground/MistralOcrService.php` |
| Message Handler | `src/MessageHandler/ProcessOcrJobHandler.php` |
| Entity | `src/Entity/OcrJob.php` |
| Template | `templates/ocr/index.html.twig` |
| Translations | `translations/Ocr.ar.yml`, `translations/Ocr.en.yml` |
| Email Templates | `templates/emails/ocr_complete.html.twig`, `ocr_failed.html.twig` |
| Uploads | `/public/uploads/ocr/{job_hash}/` |

---

## 🐛 Known Issues & Debugging

### Retrieving Failed Job Details
```bash
# Check error message in DB
mysql academia_v2_prod2 -e "SELECT id, status, error_message FROM ocr_job WHERE status='failed' ORDER BY created_at DESC LIMIT 5"

# Check logs
grep -i 'ocr.*job' /var/www/html/academia_v2/var/log/prod.log | tail -50
```

### Re-processing a Failed Job
```bash
# Reset job to pending
mysql academia_v2_prod2 -e "UPDATE ocr_job SET status='pending', error_message=NULL, completed_at=NULL WHERE id=JOB_ID"

# Dispatch via script (on server with env vars)
cd /var/www/html/academia_v2 && source .env.local && \
  export DATABASE_URL MESSENGER_TRANSPORT_DSN LOCK_DSN MAILER_DSN && \
  sudo -E -u www-data php /tmp/dispatch_ocr.php
```

### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `document_parser_too_many_pages` | PDF >30 pages (old code) | Fixed with chunked processing |
| `Insufficient credits` | User out of credits | Add bonus credits or upgrade plan |
| `Original file not found` | File deleted before processing | Re-upload |

---

## 📅 History

- **Feb 2026**: Initial OCR service implementation
- **Mar 9, 2026**: User reported 174-page PDF failed
- **Mar 10, 2026**: Implemented chunked processing, credit protection
