# Shamra OCR — Complete Rewrite Plan

## Executive Summary

Replace the legacy OCR service (DataLab.to Marker API + standalone credit system) with a modern, async, Mistral-powered OCR pipeline unified under the Playground subscription credit system. The new service will be user-friendly, async via Symfony Messenger, and integrated with the admin dashboard for monitoring.

---

## 1. What Gets Removed (Legacy Cleanup)

| File | Reason |
|------|--------|
| `src/Controller/ShamraOcrController.php` | Entire controller — bad architecture, DataLab API, separate credit flow |
| `src/Service/ShamraOcrService.php` | DataLab.to integration — replaced by Mistral OCR |
| `src/Entity/ShamraOcrFiles.php` | Old entity — replaced by `OcrJob` |
| `src/Form/ShamraOcrFormType.php` | Legacy Symfony form — replaced by AJAX upload |
| `src/Repository/ShamraOcrFilesRepository.php` | Old repository |
| `templates/shamra_ocr/fileProcessor.html.twig` | Old UI |
| `templates/shamra_ocr/addCredit.html.twig` | Standalone credit page — credits now via Playground |
| `templates/shamra_ocr/shamraOcr_list_breif.html.twig` | Old AJAX fragment |
| `src/Resources/public/css/shamra_ocr.css` | Old CSS |
| `translations/ShamraOcr.en.yml` | Old translations (will create new ones) |
| `translations/ShamraOcr.ar.yml` | Old translations |

**Keep intact:**
- `src/Service/Playground/MistralOcrService.php` — We extend and reuse this
- `src/Message/ProcessReferenceOcr.php` + handler — Unrelated (reference library OCR)
- `User.shamraOcrCredit` field — Don't drop column yet (data migration later)
- `CreditTransaction` entity — Historical data, don't delete

---

## 2. New Architecture

### 2.1 Entity: `OcrJob`

```
Table: ocr_job
─────────────────────────────────────────
id               INT AUTO_INCREMENT PK
user_id          INT FK → user(id)
status           VARCHAR(20) — pending | processing | completed | failed
file_name        VARCHAR(255) — original upload filename
original_file    VARCHAR(255) — path to uploaded PDF/image
page_count       INT — detected page count
credits_charged  INT — credits deducted
output_markdown  TEXT (nullable) — extracted markdown content
output_images    JSON (nullable) — array of extracted image paths
error_message    TEXT (nullable) — error details if failed
slug             VARCHAR(64) UNIQUE — for public URLs
created_at       DATETIME
updated_at       DATETIME
completed_at     DATETIME (nullable)
─────────────────────────────────────────
Indexes: idx_user_status (user_id, status), idx_slug (slug), idx_created (created_at)
```

### 2.2 Messenger Message: `ProcessOcrJob`

```php
class ProcessOcrJob {
    public function __construct(private int $jobId) {}
}
```

Routed to `async` transport in `messenger.yaml`.

### 2.3 Message Handler: `ProcessOcrJobHandler`

Flow:
1. Load `OcrJob` by ID → validate status is `pending`
2. Set status to `processing`
3. Check user credits via `UsageMonitorService`
4. Call `MistralOcrService::extractText()` with `include_image_base64: true`
5. Parse response — extract markdown text + images
6. Save images to disk (`/uploads/ocr/{slug}/img_001.png`, etc.)
7. Replace base64 image refs in markdown with relative paths
8. Store markdown in `OcrJob.outputMarkdown`, image paths in `OcrJob.outputImages`
9. Deduct credits from user's `PlaygroundSubscription`
10. Log usage in `PlaygroundUsageLog` (operation: `ocr_standalone`)
11. Set status to `completed`, set `completedAt`
12. Send email notification to user with download link
13. On failure → set status to `failed`, store error, do NOT deduct credits

### 2.4 Service: `OcrJobService`

Responsibilities:
- `createJob(User, UploadedFile): OcrJob` — validate file, count pages, estimate cost, persist, dispatch message
- `estimateCost(int $pageCount): int` — calculate credit cost
- `generateMarkdownFile(OcrJob): string` — write `.md` to disk
- `generateDocxFile(OcrJob): string` — convert markdown+images to DOCX via Pandoc
- `generatePdfFile(OcrJob): string` — convert markdown+images to PDF via Pandoc
- `getDownloadPath(OcrJob, string $format): string` — resolve file path for download
- `deleteJob(OcrJob): void` — remove files from disk + DB

### 2.5 Controller: `OcrController`

| Method | Route | Name | Description |
|--------|-------|------|-------------|
| GET | `/ocr` | `app_ocr_index` | Landing page (modern UI) |
| POST | `/ocr/upload` | `app_ocr_upload` | Upload file, create job, return JSON |
| GET | `/ocr/jobs` | `app_ocr_jobs` | List user's jobs (JSON) |
| GET | `/ocr/jobs/{slug}/status` | `app_ocr_job_status` | Poll single job status (JSON) |
| GET | `/ocr/jobs/{slug}/download/{format}` | `app_ocr_download` | Download output (md/docx/pdf) |
| DELETE | `/ocr/jobs/{slug}` | `app_ocr_delete` | Delete job + files |

### 2.6 Extended: `MistralOcrService`

Changes to existing service:
- Add `include_image_base64: true` option parameter
- Increase `MAX_PAGES` from 30 → 200 for standalone OCR
- Add `extractWithImages(string $filePath): array` method returning `['text' => ..., 'images' => [...], 'pages' => int]`
- Parse image data from Mistral response and return separately

### 2.7 Admin Dashboard Extension

Add OCR monitoring to `PlaygroundAdminController`:
- New API endpoint: `GET /jim19ud83/playground/api/ocr-stats` → `admin_playground_api_ocr_stats`
- Returns: total jobs, jobs by status, pages processed today/week/month, credit revenue from OCR, avg processing time, top OCR users, error rate
- Add OCR section to `dashboard.html.twig` with charts

---

## 3. Credit System Design

### 3.1 Pricing Model

**Cost basis:** Mistral OCR = $0.005/page ($5 per 1,000 pages)

**Credit formula:** `max(5, pageCount * 2)` credits per job

| Pages | Credits | Starter ($0.018/cr) | Cost to Shamra | Margin |
|-------|---------|---------------------|---------------|--------|
| 1–2 | 5 | $0.09 | $0.01 | 89% |
| 5 | 10 | $0.18 | $0.025 | 86% |
| 30 | 60 | $1.08 | $0.15 | 86% |
| 100 | 200 | $3.60 | $0.50 | 86% |
| 200 | 400 | $7.20 | $1.00 | 86% |

**Cross-tier margins (at 100-page job = 200 credits):**

| Tier | $/credit | Revenue | Cost | Margin |
|------|----------|---------|------|--------|
| Trial (100 cr free) | — | $0 | $0.50 | subsidy (max 50 pages) |
| Starter ($9/500) | $0.018 | $3.60 | $0.50 | 86% |
| Researcher ($19/1500) | $0.0127 | $2.54 | $0.50 | 80% |
| Professional ($39/4000) | $0.00975 | $1.95 | $0.50 | 74% |
| Institution ($99/15000) | $0.0066 | $1.32 | $0.50 | 62% |

### 3.2 Trial Users

- Trial tier: 100 one-time credits, 14 days
- OCR available to trial users (same pricing: 2 credits/page, min 5)
- Max trial OCR: 50 pages total (100 credits ÷ 2 credits/page)
- **Limit per job for trial**: 20 pages (to prevent burning all credits on one file)

### 3.3 Limits

| Parameter | Value |
|-----------|-------|
| Max file size | 50 MB |
| Max pages per job | 200 (trial: 20) |
| Supported formats | PDF, PNG, JPG, JPEG, TIFF, BMP |
| Min credits per job | 5 |
| Credits per page | 2 |

### 3.4 Operation Registration

Add `ocr_standalone` to `UsageMonitorService::OPERATION_CREDITS`:
```php
'ocr_standalone' => 0, // Dynamic — calculated per job, not flat rate
```

Credit deduction is handled manually (dynamic amount) rather than via the standard flat `deductCredits()`. We'll add a new method `deductDynamicCredits(User, int $amount, string $operation)`.

---

## 4. Image Handling Strategy

### 4.1 Extraction

Call Mistral OCR with `include_image_base64: true`. The response includes base64-encoded images embedded in the markdown as `![image](data:image/png;base64,...)`.

### 4.2 Storage

1. Parse markdown for base64 image patterns: `!\[.*?\]\(data:image\/(png|jpeg|jpg);base64,([A-Za-z0-9+/=]+)\)`
2. Save each image to: `uploads/ocr/{slug}/images/img_{n}.{ext}`
3. Replace base64 references in markdown with relative paths: `![image](images/img_001.png)`

### 4.3 Output Format Handling

| Format | Image Handling |
|--------|---------------|
| **Markdown** | Images referenced as relative paths; served in a ZIP with images folder |
| **DOCX** | Pandoc embeds images from referenced paths automatically |
| **PDF** | Pandoc embeds images from referenced paths automatically |

When downloading Markdown, serve as a ZIP containing:
```
document.md
images/
  img_001.png
  img_002.png
```

---

## 5. Email Notification

### 5.1 When to Send

- **Job completed** → email with download links (md, docx, pdf)
- **Job failed** → email informing user of failure with reason

### 5.2 Template

Create `templates/emails/ocr_complete.html.twig`:
- Reuse the existing email base layout (purple gradient, RTL support)
- Subject (en): "Your document is ready — Shamra OCR"
- Subject (ar): "مستندك جاهز — شمرا OCR"
- Body: file name, page count, processing time, download buttons for each format
- Use `TemplatedEmail` with Symfony Mailer

---

## 6. Frontend UX Design

### 6.1 Page: `/ocr`

Modern, clean, single-page experience:

1. **Hero section**: Title, description, supported formats badge
2. **Upload zone**: Drag-and-drop area with click-to-browse fallback
3. **Pre-upload info**: After file selected — show filename, page count (client-side estimate), estimated credit cost, user's available credits
4. **Processing state**: Progress indicator with status polling (every 3s)
5. **History section**: Table/cards of past jobs with status badges, download buttons, delete action
6. **Credit info bar**: Current credits, subscription tier, upgrade CTA if low

### 6.2 Tech Stack

- Twig template extending `base.html.twig`
- Vanilla JS + Fetch API (no Vue — keep it simple for a standalone page)
- CSS in the template or a new `ocr.css` asset
- Responsive design, RTL-ready

---

## 7. Implementation Plan (Execution Order)

### Phase 1: Cleanup (Remove Legacy)
- [ ] Delete legacy files (controller, service, entity, form, repo, templates, CSS, translations)
- [ ] Remove legacy routes from any route config
- [ ] Keep `User.shamraOcrCredit` field (backward compat) but stop using it

### Phase 2: Backend Core
- [ ] Create `OcrJob` entity + migration
- [ ] Create `OcrJobRepository`
- [ ] Extend `MistralOcrService` — add `extractWithImages()`, raise page limit
- [ ] Create `OcrJobService` — job lifecycle management
- [ ] Create `ProcessOcrJob` message
- [ ] Create `ProcessOcrJobHandler` — async processing pipeline
- [ ] Register message routing in `messenger.yaml`
- [ ] Add `ocr_standalone` operation to `UsageMonitorService`
- [ ] Add `deductDynamicCredits()` to `UsageMonitorService`

### Phase 3: Controller & API
- [ ] Create `OcrController` with all routes
- [ ] Implement upload validation (file type, size, page count, credits)
- [ ] Implement job status polling endpoint
- [ ] Implement download endpoint (md/docx/pdf generation)
- [ ] Implement delete endpoint

### Phase 4: Email Notification
- [ ] Create `templates/emails/ocr_complete.html.twig`
- [ ] Create `templates/emails/ocr_failed.html.twig`
- [ ] Integrate email sending in `ProcessOcrJobHandler`

### Phase 5: Frontend
- [ ] Create `templates/ocr/index.html.twig` — modern upload UI
- [ ] Implement drag-and-drop upload with progress
- [ ] Implement job status polling and live updates
- [ ] Implement download buttons for all formats
- [ ] Implement job history list
- [ ] Add translations (`Ocr.en.yml`, `Ocr.ar.yml`)

### Phase 6: Admin Dashboard
- [ ] Add OCR stats API endpoint to `PlaygroundAdminController`
- [ ] Add OCR monitoring section to admin dashboard template
- [ ] Add OCR data to `UsageMonitorService::getDashboardStats()`

### Phase 7: Testing & Polish
- [ ] Write unit tests for `OcrJobService`
- [ ] Write integration tests for `ProcessOcrJobHandler`
- [ ] Test with various PDF types (scanned, text, mixed, with images)
- [ ] Test credit deduction across all tiers
- [ ] Test email delivery
- [ ] Test file download for all formats
- [ ] Load test with concurrent uploads

---

## 8. Database Migration

```sql
CREATE TABLE ocr_job (
    id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    file_name VARCHAR(255) NOT NULL,
    original_file VARCHAR(255) NOT NULL,
    page_count INT NOT NULL DEFAULT 0,
    credits_charged INT NOT NULL DEFAULT 0,
    output_markdown LONGTEXT DEFAULT NULL,
    output_images JSON DEFAULT NULL,
    error_message TEXT DEFAULT NULL,
    slug VARCHAR(64) NOT NULL,
    created_at DATETIME NOT NULL,
    updated_at DATETIME NOT NULL,
    completed_at DATETIME DEFAULT NULL,
    UNIQUE INDEX UNIQ_ocr_job_slug (slug),
    INDEX IDX_ocr_job_user_status (user_id, status),
    INDEX IDX_ocr_job_created (created_at),
    CONSTRAINT FK_ocr_job_user FOREIGN KEY (user_id) REFERENCES user(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
```

---

## 9. File System Layout

```
uploads/ocr/
  {slug}/
    original.pdf          ← uploaded file
    output.md             ← extracted markdown
    output.docx           ← generated DOCX
    output.pdf            ← generated PDF
    images/
      img_001.png         ← extracted images
      img_002.jpg
```

---

## 10. Security Considerations

- All routes require `ROLE_USER` (authenticated)
- Users can only access their own jobs (`OcrJob.user === currentUser`)
- Slug-based URLs (no sequential IDs exposed)
- File uploads validated: type, size, no executable content
- Rate limiting: max 5 concurrent pending/processing jobs per user
- Credit check before dispatching (prevents abuse)
- Download tokens: verify ownership on every download request
