# OJS Crawler — Technical Plan

> **Purpose**: Build a standalone Python application that crawls Open Journal Systems (OJS) instances, downloads research PDFs, enriches metadata with AI, and indexes complete documents into a **dedicated Elasticsearch index**. Includes a web interface for managing seed files (crawl targets).
>
> **Key Constraints**:
> - **No database writes** — zero interaction with MySQL. All data lives in Elasticsearch only.
> - **ES-only storage** — a new, empty index with the same mapping as the production `arabic_research` index, plus explicit language handling.
> - **Bilingual** — crawls both Arabic and English research. Language is defined per seed.
> - **Arabic priority** — English research gets title + abstract translated to Arabic (never full text). Arabic research is never translated.
> - **Completeness gate** — only fully complete documents (with PDF, title, abstract, content) are indexed. Incomplete data is held in a staging queue with AI-generated prompts to fill gaps.
> - **No PDF = skip** — articles confirmed to have no PDF galley are skipped entirely (not staged, not indexed). The crawler logs them and moves on.
> - **Deduplication** — before processing, the crawler checks all indices (`ojs_research`, `arabic_research`, `english_research`) for existing articles with similar titles or abstracts. Exact and fuzzy matches are detected to prevent indexing research that was already added — whether by the crawler or manually.

---

## Table of Contents

1. [Architecture Overview](#1-architecture-overview)
2. [Docker Compose & Services](#2-docker-compose--services)
3. [Seed File System](#3-seed-file-system)
4. [Web Management Interface](#4-web-management-interface)
5. [Elasticsearch Configuration](#5-elasticsearch-configuration)
6. [Index Mapping](#6-index-mapping)
7. [Document Schema & Field Reference](#7-document-schema--field-reference)
8. [Crawling Pipeline (OAI-PMH + Fallbacks)](#8-crawling-pipeline-oai-pmh--fallbacks)
9. [PDF Handling (GROBID + PyMuPDF)](#9-pdf-handling-grobid--pymupdf)
10. [AI Processing & Gap-Filling](#10-ai-processing--gap-filling)
11. [Completeness Gate & Staging](#11-completeness-gate--staging)
12. [Indexing & Duplicate Detection](#12-indexing)
13. [Translation Rules](#13-translation-rules)
14. [Project Structure](#14-project-structure)
15. [Configuration](#15-configuration)
16. [Reference: Production Index Mapping](#16-reference-production-index-mapping)
17. [Reference: Sample Document](#17-reference-sample-document)
18. [Reference: Known Field/Category IDs](#18-reference-known-fieldcategory-ids)
19. [Consistency Risks & Mitigations](#19-consistency-risks--mitigations)

---

## 1. Architecture Overview

```
  ┌─────────────────────────────────────────────────────────────┐
  │                    Docker Compose                            │
  │                                                              │
  │  ┌──────────────────────┐    ┌───────────────────────────┐  │
  │  │    web (Flask)        │    │   worker (Python)          │  │
  │  │    Port 5000          │    │   Crawl engine             │  │
  │  │                       │    │                            │  │
  │  │  • Manage seed files  │    │  1. Read seeds             │  │
  │  │  • View crawl status  │    │  2. OAI-PMH harvest        │  │
  │  │  • Review staged docs │    │     (Sickle) or scrape     │  │
  │  │  • Approve/reject     │    │  3. Download PDFs          │  │
  │  │  • Fill missing fields│    │  4. GROBID parse / PyMuPDF │  │
  │  │  • Trigger crawls     │    │  5. AI enrichment          │  │
  │  └──────────┬───────────┘    │  6. Completeness gate      │  │
  │             │                │  7. Index to ES or stage    │  │
  │             │                └──────────┬────────────────┘  │
  │             │                           │                    │
  │  ┌──────────▼───────────────────────────▼────────────────┐  │
  │  │              Shared Volumes                            │  │
  │  │  • seeds/     (YAML seed files)                        │  │
  │  │  • data/pdfs/ (downloaded PDFs)                        │  │
  │  │  • data/staging/ (incomplete docs)                     │  │
  │  └───────────────────────────────────────────────────────┘  │
  │                                                              │
  │  ┌──────────────────────┐                                   │
  │  │  grobid (container)  │  ← ML-based PDF parsing           │
  │  │  Port 8070           │     Structured metadata + text    │
  │  └──────────────────────┘                                   │
  └──────────────────────────────────────────────────────────────┘
                             │
              ┌──────────────▼──────────────────────┐
              │         Elasticsearch                 │
              │  (External — shamraindex:9200)         │
              │  Index: ojs_research                  │
              │  NO database interaction              │
              └─────────────────────────────────────┘
```

### Core Principles

1. **Seed-driven**: Every crawl target comes from a seed file. No hardcoded URLs.
2. **Language-aware**: Each seed declares the language of its journal (`ar` or `en`). This drives translation and field population logic.
3. **Completeness-first**: Documents missing critical fields go to a staging area, not the index. The web app shows what's missing and generates AI prompts to fill gaps.
4. **ES-only**: The crawler never touches MySQL. No inserts, no reads, no connections to the database. All data is self-contained in the ES index.
5. **PDF-required — skip if absent**: If the crawler confirms that no PDF galley exists for an article (no `citation_pdf_url` tag, no galley links, download returns non-PDF), the article is **skipped entirely** — it is not staged, not queued, not indexed. It is logged and counted as "skipped (no PDF)" and the crawler moves on. PDF is a hard prerequisite; there is no path through the pipeline without one.
6. **Deduplicated** — every article is checked against all ES indices (`ojs_research`, `arabic_research`, `english_research`) before processing. Uses multi-signal matching (title phrase match + fuzzy title + abstract similarity) to catch exact duplicates AND near-duplicates. Articles added manually to the main index are detected.
7. **OAI-PMH first**: Use the OAI-PMH protocol (via Sickle) as the primary harvesting method. Fall back to REST API or HTML scraping only when OAI-PMH is unavailable.
8. **GROBID-enhanced**: Use GROBID (a ML-based PDF parser running as a Docker service) for structured metadata and text extraction from PDFs, with PyMuPDF as a fallback.
9. **Containerized**: All services run via Docker Compose for reproducible deployment.

---

## 2. Docker Compose & Services

### docker-compose.yml

```yaml
version: "3.8"

services:
  web:
    build:
      context: .
      dockerfile: Dockerfile
    command: python -m webapp.app
    ports:
      - "5000:5000"
    volumes:
      - ./seeds:/app/seeds
      - ./data:/app/data
      - ./.env:/app/.env
    environment:
      - FLASK_ENV=production
    depends_on:
      - grobid
    restart: unless-stopped

  worker:
    build:
      context: .
      dockerfile: Dockerfile
    command: python -m scripts.crawl --daemon
    volumes:
      - ./seeds:/app/seeds
      - ./data:/app/data
      - ./.env:/app/.env
    depends_on:
      - grobid
    restart: unless-stopped

  grobid:
    image: grobid/grobid:0.8.1
    ports:
      - "8070:8070"
    restart: unless-stopped
    # GROBID uses ~2GB RAM. Adjust limits if needed:
    # deploy:
    #   resources:
    #     limits:
    #       memory: 4G
```

### Dockerfile

```dockerfile
FROM python:3.11-slim

WORKDIR /app

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc libxml2-dev libxslt1-dev && \
    rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Application code
COPY . .

# Create data directories
RUN mkdir -p seeds/active seeds/paused seeds/completed seeds/templates \
    data/pdfs data/staging data/indexed data/logs

EXPOSE 5000
```

### Services Summary

| Service | Image / Build | Port | Purpose |
|---|---|---|---|
| **web** | Custom (Dockerfile) | 5000 | Flask web management UI |
| **worker** | Custom (Dockerfile) | — | Crawl engine daemon (reads seeds, harvests, processes) |
| **grobid** | `grobid/grobid:0.8.1` | 8070 | ML-based PDF→structured text+metadata parser |

### Running

```bash
# Start all services
docker compose up -d

# View logs
docker compose logs -f worker

# Run a one-off crawl (instead of daemon)
docker compose run --rm worker python -m scripts.crawl --seed damascus_cs

# Create the ES index
docker compose run --rm worker python -m scripts.create_index

# Stop
docker compose down
```

### Volume Mounts

Both `web` and `worker` share the same volumes:
- `./seeds` — YAML seed files (read by worker, managed by web)
- `./data` — PDFs, staging JSON, indexed archive, logs
- `./.env` — Configuration (ES creds, API keys)

GROBID is stateless — no volumes needed.

---

## 3. Seed File System

### Seed File Format

Each seed file is a YAML file in the `seeds/` directory. One file per journal or OJS instance.

```yaml
# seeds/damascus_university_cs.yaml
seed:
  name: "مجلة جامعة دمشق للعلوم الهندسية"
  name_en: "Damascus University Journal for Engineering Sciences"
  base_url: "https://journal.damascusuniversity.edu.sy"
  language: "ar"                    # "ar" or "en"
  ojs_version: 3                    # 2 or 3
  
  # Harvesting strategy (auto-detected on first crawl, or set manually)
  harvesting:
    oai_pmh:
      available: null               # null = not yet tested, true/false after detection
      url: null                     # Auto-set to {base_url}/oai after detection
      metadata_prefix: "oai_dc"     # Dublin Core (default). Could also be "marcxml"
      set: null                     # OAI-PMH set to harvest (null = all). e.g. "driver"
      functional: null              # true if OAI-PMH returns records, false if empty/broken
    rest_api:
      available: null               # null = not yet tested, true/false after detection
      locked: false                 # true if API returns 404/403 (needs auth)
    strategy: null                  # Auto-set after detection: "oai_pmh", "rest_api", "html_scrape"
  
  # Optional overrides (if known)
  publisher_name_ar: "جامعة دمشق"
  publisher_name_en: "Damascus University"
  default_category: "article"       # article, research, thesis, book
  default_field_id: 101             # Computer Science (null = auto-detect)
  
  # Crawl settings
  crawl:
    start_from: "2020-01-01"        # Only crawl articles after this date
    max_articles: null               # null = unlimited
    rate_limit: 2                    # Seconds between requests
    respect_robots: true
    resume_token: null               # OAI-PMH resumption token (for interrupted harvests)
    
  # Access (some OJS require login)
  auth:
    type: "none"                    # "none", "basic", "api_key"
    username: null
    password: null
    api_key: null
```

#### Harvesting Strategy Auto-Detection

On the **first crawl** (or when `harvesting.strategy` is `null`), the worker runs a detection sequence:

```python
def detect_harvesting_strategy(seed: dict) -> str:
    """Probe the OJS instance and determine the best harvesting method."""
    base_url = seed["base_url"].rstrip("/")
    
    # 1. Try OAI-PMH (preferred)
    oai_url = f"{base_url}/oai"
    try:
        resp = requests.get(oai_url, params={"verb": "Identify"}, timeout=15)
        if resp.status_code == 200 and "<Identify>" in resp.text:
            seed["harvesting"]["oai_pmh"]["available"] = True
            seed["harvesting"]["oai_pmh"]["url"] = oai_url
            
            # Check if it actually has records (many OJS have OAI enabled but empty)
            test = requests.get(oai_url, params={
                "verb": "ListRecords",
                "metadataPrefix": "oai_dc",
                "from": "2020-01-01"
            }, timeout=15)
            if "<record>" in test.text:
                seed["harvesting"]["oai_pmh"]["functional"] = True
                seed["harvesting"]["strategy"] = "oai_pmh"
                return "oai_pmh"
            else:
                seed["harvesting"]["oai_pmh"]["functional"] = False
                logger.warning(f"OAI-PMH enabled but empty/broken at {oai_url}")
    except Exception:
        seed["harvesting"]["oai_pmh"]["available"] = False
    
    # 2. Try REST API (OJS 3.x only)
    if seed.get("ojs_version") == 3:
        api_url = f"{base_url}/api/v1/submissions?count=1&status=3"
        try:
            resp = requests.get(api_url, timeout=15)
            if resp.status_code == 200 and resp.json().get("items"):
                seed["harvesting"]["rest_api"]["available"] = True
                seed["harvesting"]["strategy"] = "rest_api"
                return "rest_api"
            else:
                seed["harvesting"]["rest_api"]["locked"] = True
        except Exception:
            seed["harvesting"]["rest_api"]["available"] = False
    
    # 3. Fall back to HTML scraping (always works)
    seed["harvesting"]["strategy"] = "html_scrape"
    return "html_scrape"
```

The detection result is **saved back to the seed file** so subsequent crawls skip detection.

### Seed File Directory Structure

```
seeds/
├── active/                  # Seeds currently being crawled
│   ├── damascus_cs.yaml
│   ├── tishreen_eng.yaml
│   └── ieee_access.yaml
├── paused/                  # Seeds temporarily paused
│   └── aiub_journal.yaml
├── completed/               # Seeds fully crawled (all articles indexed)
│   └── old_journal.yaml
└── templates/               # Empty seed templates
    ├── ojs3_arabic.yaml
    └── ojs2_english.yaml
```

### Seed Validation Rules

1. `base_url` must be a valid URL
2. `language` must be `"ar"` or `"en"`
3. `ojs_version` must be `2` or `3`
4. `rate_limit` minimum `1` second
5. No duplicate `base_url` across active seeds
6. `harvesting.oai_pmh.metadata_prefix` must be `"oai_dc"` or `"marcxml"` if set
7. `harvesting.strategy` must be `null`, `"oai_pmh"`, `"rest_api"`, or `"html_scrape"` if set

---

## 4. Web Management Interface

Build a lightweight web app (Flask or FastAPI + Jinja2 templates) for managing seeds and monitoring crawls.

### Pages / Routes

| Route | Method | Description |
|---|---|---|
| `/` | GET | Dashboard — crawl stats, recent activity |
| `/seeds` | GET | List all seed files (active, paused, completed) |
| `/seeds/new` | GET/POST | Create a new seed file via form |
| `/seeds/<id>` | GET | View seed details + crawl history |
| `/seeds/<id>/edit` | GET/POST | Edit seed configuration |
| `/seeds/<id>/delete` | POST | Delete a seed file |
| `/seeds/<id>/pause` | POST | Move seed to paused/ |
| `/seeds/<id>/activate` | POST | Move seed to active/ |
| `/seeds/<id>/crawl` | POST | Trigger a crawl for this seed |
| `/staging` | GET | List documents in staging (incomplete) |
| `/staging/<doc_id>` | GET | View staged document — show missing fields |
| `/staging/<doc_id>/fill` | POST | Submit AI-generated or manual field values |
| `/staging/<doc_id>/approve` | POST | Move from staging to index |
| `/staging/<doc_id>/reject` | POST | Discard staged document |
| `/index` | GET | Browse indexed documents (search, filter) |
| `/index/<doc_id>` | GET | View indexed document details |
| `/crawls` | GET | Crawl history — runs, articles found/indexed/staged/skipped (no PDF)/skipped (duplicate) |
| `/crawls/<run_id>` | GET | Single crawl run details (skip reasons, duplicate match details) |
| `/api/stats` | GET | JSON stats for dashboard |

### Dashboard Widgets

- **Total indexed documents** (from ES count)
- **Skipped articles (no PDF)** — total and per-seed count of articles skipped because no PDF was available
- **Skipped articles (duplicate)** — total and per-seed count of articles skipped because they already exist in an index (shows match type: exact/fuzzy/title+abstract)
- **Staging queue size** (incomplete documents waiting)
- **Active seeds count**
- **Crawl history chart** (documents indexed / skipped-dup / skipped-noPDF / staged per day)
- **Language distribution** (Arabic vs English pie chart)
- **Top publishers** (bar chart)

### Seed Creation Form Fields

| Field | Type | Required | Notes |
|---|---|---|---|
| Journal Name (Arabic) | text | Yes | |
| Journal Name (English) | text | No | |
| Base URL | url | Yes | OJS instance root URL |
| Language | select | Yes | Arabic / English |
| OJS Version | select | Yes | 2 / 3 |
| Default Category | select | No | Article, Research, Thesis, Book |
| Default Field | select | No | Dropdown of known fields |
| Start Date | date | No | Only crawl after this date |
| Max Articles | number | No | Limit articles per crawl |
| Rate Limit | number | No | Default: 2 seconds |
| OAI-PMH URL | url | No | Auto-detected. Override if non-standard. |

#### OAI-PMH Auto-Detection on Seed Creation

When the user enters a Base URL and clicks "Create" (or a dedicated "Detect" button), the backend probes the OJS instance in real time:

1. **OAI-PMH**: `GET {base_url}/oai?verb=Identify` → check for `<Identify>` response → then `ListRecords` to verify records exist
2. **REST API**: `GET {base_url}/api/v1/submissions?count=1&status=3` → check for `items` array
3. **Fallback**: HTML scraping always available

The form displays a status badge after detection:
- 🟢 **OAI-PMH (functional)** — best case, full metadata via standard protocol
- 🟡 **OAI-PMH (enabled but empty)** — endpoint exists but returns no records. Fall back to REST/scrape
- 🟢 **REST API** — OJS 3.x API accessible
- 🟠 **HTML scraping only** — no structured API available. Slower, relies on `citation_*` meta tags
- 🔴 **Unreachable** — base URL doesn't respond

### Staging Review Interface

For each staged (incomplete) document, the web app should show:

1. **What we have**: Title, abstract, PDF status, extracted text preview
2. **What's missing**: Red-highlighted missing fields
3. **AI suggestions**: For each missing field, display an AI-generated prompt and suggested value
4. **Actions**: 
   - "Accept AI suggestion" (one-click fill)
   - "Edit manually" (text input)
   - "Regenerate suggestion" (re-run AI with different prompt)
   - "Approve & Index" (when all fields are filled)
   - "Reject" (discard the document)

---

## 5. Elasticsearch Configuration

| Setting | Value |
|---|---|
| **Host (public)** | `https://shamraindex:9200` |
| **Host (internal)** | `https://shamraindex:9200` |
| **Auth** | `elastic:<ES_PASSWORD>` |
| **New index name** | `ojs_research` |
| **Shards** | 1 |
| **Replicas** | 0 |
| **ES version** | 7.17.x |

### Index Creation

Create a fresh, empty index with the same mapping as `arabic_research`. The index should be called `ojs_research` (or configurable).

```python
from elasticsearch import Elasticsearch

es = Elasticsearch(
    "https://shamraindex:9200",
    basic_auth=("elastic", "<ES_PASSWORD>"),
    request_timeout=30
)

# Create the index with the mapping from Section 5
es.indices.create(index="ojs_research", body={
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": { ... }  # See Section 6
})
```

---

## 6. Index Mapping

Use the **exact same mapping** as the production `arabic_research` index. The language of each document is captured in the `language` field (already present in the mapping).

```json
{
  "mappings": {
    "properties": {
      "id":                     { "type": "keyword" },
      "arabic_full_title":      { "type": "text", "analyzer": "arabic", "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } } },
      "english_full_title":     { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } } },
      "arabic_abstract":        { "type": "text", "analyzer": "arabic", "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } } },
      "english_abstract":       { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } } },
      "content":                { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } } },
      "slug":                   { "type": "keyword" },
      "document_name":          { "type": "keyword" },
      "language":               { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
      "tag":                    { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
      "tag_id":                 { "type": "keyword" },
      "arabic_fields":          { "type": "keyword" },
      "english_fields":         { "type": "keyword" },
      "fields_id":              { "type": "keyword" },
      "arabic_publisher_name":  { "type": "keyword" },
      "english_publisher_name": { "type": "keyword" },
      "publisher_id":           { "type": "keyword" },
      "arabic_category_name":   { "type": "keyword" },
      "english_category_name":  { "type": "keyword" },
      "category_id":            { "type": "keyword" },
      "textualPublisher":       { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
      "research_references":    { "type": "text", "index": false, "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } } },
      "related_researches":     { "type": "text", "index": false, "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } } },
      "createdAt":              { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
      "publication_date":       { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
      "last_updated_at":        { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
      "downloads":              { "type": "long" },
      "hits":                   { "type": "long" },
      "cites":                  { "type": "long" },
      "raters":                 { "type": "long" },
      "rate":                   { "type": "float" },
      "deleted":                { "type": "boolean" },
      "creator_id":             { "type": "long" },
      "updater_id":             { "type": "long" },
      "name":                   { "type": "long" },
      "chatGPT":                { "type": "object", "enabled": false },
      "chatGPTen":              { "type": "object", "enabled": false },
      "ai_summary":             { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } } },
      "ai_critique":            { "type": "text" },
      "ai_keywords":            { "type": "text", "fields": { "keyword": { "type": "keyword" } } }
    }
  }
}
```

### Language Values

The `language` field stores the original language of the research:
- `"العربية"` — Arabic research
- `"English"` — English research

These match the values used in the production `arabic_research` index. The seed's `language` field (`"ar"` / `"en"`) maps to these values.

---

## 7. Document Schema & Field Reference

### All Fields by Category

#### Identity

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `id` | integer | Generated | **Yes** | Sequential. Get max from index, increment. |
| `slug` | string | Generated | **Yes** | 13-14 char hex unique ID. Use `uuid4().hex[:14]`. |

#### Titles (Bilingual)

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `arabic_full_title` | string | OJS (if `ar`) or AI translation (if `en`) | **Yes** | For English seeds: AI-translate the English title to Arabic. |
| `english_full_title` | string | OJS (if `en`) or from OJS when available (if `ar`) | No | For Arabic seeds: use from OJS if bilingual metadata exists, otherwise leave empty. Do NOT translate Arabic→English. |

#### Abstracts (Bilingual)

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `arabic_abstract` | string | OJS (if `ar`) or AI translation (if `en`) | **Yes** | For English seeds: AI-translate the English abstract to Arabic. |
| `english_abstract` | string | OJS (if `en`) or from OJS when available (if `ar`) | No | For Arabic seeds: use from OJS if bilingual metadata exists, otherwise leave empty. Do NOT translate Arabic→English. |

#### Full Text

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `content` | string | PDF text extraction | **Yes** | Keep in original language. NEVER translate full text. |
| `document_name` | string | Generated | **Yes** | PDF filename stored on disk. |

#### Language

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `language` | string | Seed config | **Yes** | `"العربية"` or `"English"` based on seed language. |

#### Tags / Keywords

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `tag` | array[string] | OJS metadata + AI | **Yes** | Mix of Arabic and English keywords. |
| `tag_id` | array[string] | Generated | No | Use `"ojs-{n}"` placeholder IDs (no DB). |

#### Subject Fields

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `arabic_fields` | array[string] | Seed config or AI | **Yes** | At least one field required. |
| `english_fields` | array[string] | Seed config or AI | **Yes** | At least one field required. |
| `fields_id` | array[string] | Seed config or AI | **Yes** | ID from known fields list (Section 17). |

#### Publisher

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `arabic_publisher_name` | string | Seed config | **Yes** | From seed `publisher_name_ar` or OJS metadata. |
| `english_publisher_name` | string | Seed config | No | From seed `publisher_name_en` or OJS metadata. |
| `publisher_id` | integer | Generated | **Yes** | Use a crawler-generated sequence (starting from 100000 to avoid clashing with production IDs). |
| `textualPublisher` | string | OJS metadata | **Yes** | Author names, comma-separated. |

#### Category

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `arabic_category_name` | string | Seed config / mapped | **Yes** | See category table in Section 17. |
| `english_category_name` | string | Seed config / mapped | **Yes** | |
| `category_id` | integer | Seed config / mapped | **Yes** | Default: `2` (Article). |

#### Dates

| Field | Type | Source | Required | Notes |
|---|---|---|---|---|
| `createdAt` | string | Generated | **Yes** | Crawl timestamp. Format: `"yyyy-MM-dd HH:mm:ss"` |
| `publication_date` | string | OJS metadata | **Yes** | Article publication date. Same format. |
| `last_updated_at` | string | Generated | No | Crawl timestamp. |

#### Counters (all default to zero)

| Field | Default | Notes |
|---|---|---|
| `downloads` | `0` | |
| `hits` | `0` | |
| `cites` | `0` | |
| `raters` | `0` | |
| `rate` | `0.0` | |

#### Flags

| Field | Default | Notes |
|---|---|---|
| `deleted` | `false` | Always false for new documents. |

#### Metadata / Admin

| Field | Type | Notes |
|---|---|---|
| `creator_id` | integer | Set to `0` or a configurable crawler ID. |
| `updater_id` | integer | Same as creator_id. |

#### References / Relations

| Field | Default | Notes |
|---|---|---|
| `research_references` | `"[]"` | JSON array of DOIs/URLs from OJS if available. |
| `related_researches` | `"{}"` | Empty — no cross-referencing in this index. |

#### AI-Generated Fields

| Field | Type | Source | Notes |
|---|---|---|---|
| `ai_summary` | string | AI | Summary in the document's original language. |
| `ai_critique` | string | AI | Critique in the document's original language. |
| `ai_keywords` | string | AI | Comma-separated keywords (bilingual). |
| `chatGPT` | object | AI | Q&A pairs about the research, or `{}`. |
| `chatGPTen` | object | AI | English Q&A pairs, or `{}`. |

---

## 8. Crawling Pipeline

```
┌──────────┐    ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────────┐    ┌──────────┐
│ 1. Read  │───▶│ 2. Crawl │───▶│ 2b. Dedup │───▶│ 3. Down- │───▶│ 4. AI Enrich │───▶│ 5. Gate  │
│ Seeds    │    │ OJS      │    │ Check     │    │ load PDF │    │ & Translate  │    │ Check    │
└──────────┘    └──────────┘    └─────┬─────┘    └────┬─────┘    └──────────────┘    └────┬─────┘
                                      │               │                                   │
                            DUPLICATE │        NO PDF │                    ┌──────────────┼──────────┐
                                      ▼               ▼                    │ COMPLETE     │          │ INCOMPLETE
                              ┌──────────┐    ┌──────────┐               ▼              │          ▼
                              │  SKIP    │    │  SKIP    │         ┌──────────┐         │   ┌──────────┐
                              │  (dup)   │    │  (no PDF)│         │ 6. Index │         │   │ 6. Stage │
                              └──────────┘    └──────────┘         │ to ES    │         │   │ + Prompts│
                                                                   └──────────┘         │   └──────────┘
                                                                                        │
                                                                              Web app for review
```

> **Dedup = Skip**: Step 2b checks the article's title (and abstract when available) against
> **all three indices** — `ojs_research`, `arabic_research`, and `english_research`. This catches
> duplicates from previous crawls AND manually-added research. Exact phrase matches are auto-skipped;
> fuzzy near-matches (≥85% title similarity) are also skipped. The counter `skipped_duplicate`
> is incremented. See the [Duplicate Detection](#duplicate-detection) section for full algorithm.

> **No-PDF = Skip**: If Step 3 determines that no PDF exists for an article, the article is
> **skipped immediately**. It does NOT proceed to AI enrichment, staging, or the completeness
> gate. The skip is logged with the article URL and reason, and the counter `skipped_no_pdf`
> is incremented. This is a hard rule — there is no way to manually add a PDF later via
> the staging UI. If the journal later adds a PDF galley, a re-crawl will pick it up.

### Step 1: Read Seeds

```python
def load_active_seeds() -> list[dict]:
    """Load all YAML seed files from seeds/active/."""
    seeds = []
    for f in Path("seeds/active").glob("*.yaml"):
        with open(f) as fh:
            seed = yaml.safe_load(fh)
            seed["_file"] = f.name
            seeds.append(seed["seed"])
    return seeds
```

### Step 2: Crawl OJS Metadata

For each seed, discover articles using the harvesting strategy determined during seed setup (see [Harvesting Strategy Auto-Detection](#harvesting-strategy-auto-detection)). The strategy is stored in the seed file as `harvesting.strategy`.

#### Strategy A: OAI-PMH via Sickle (preferred)

When `harvesting.strategy == "oai_pmh"`, use the [Sickle](https://sickle.readthedocs.io/) library for standards-compliant OAI-PMH harvesting:

```python
from sickle import Sickle

def harvest_oai_pmh(seed: dict) -> list[dict]:
    """Harvest all articles via OAI-PMH. Returns list of article metadata dicts."""
    oai_url = seed["harvesting"]["oai_pmh"]["url"]
    prefix = seed["harvesting"]["oai_pmh"].get("metadata_prefix", "oai_dc")
    oai_set = seed["harvesting"]["oai_pmh"].get("set")
    start_from = seed["crawl"].get("start_from")
    resume_token = seed["crawl"].get("resume_token")
    
    sickle = Sickle(oai_url, max_retries=3, timeout=30)
    articles = []
    
    params = {"metadataPrefix": prefix}
    if start_from:
        params["from"] = start_from
    if oai_set:
        params["set"] = oai_set
    
    try:
        if resume_token:
            records = sickle.ListRecords(resumptionToken=resume_token)
        else:
            records = sickle.ListRecords(**params)
        
        for record in records:
            if record.deleted:
                continue
            
            meta = record.metadata
            article = {
                "oai_identifier": record.header.identifier,
                "title": _first(meta.get("title", [])),
                "abstract": _first(meta.get("description", [])),
                "authors": meta.get("creator", []),
                "keywords": meta.get("subject", []),
                "date": _first(meta.get("date", [])),
                "url": _first([u for u in meta.get("identifier", []) if u.startswith("http")]),
                "citation_pdf_url": _first([u for u in meta.get("identifier", []) if u.endswith(".pdf") or "/download/" in u]),
                "source": "oai_pmh"
            }
            articles.append(article)
            
            # Save resumption token for crash recovery
            if hasattr(records, 'resumption_token') and records.resumption_token:
                seed["crawl"]["resume_token"] = records.resumption_token.token
            
            if seed["crawl"].get("max_articles") and len(articles) >= seed["crawl"]["max_articles"]:
                break
                
    except sickle.oaiexceptions.NoRecordsMatch:
        logger.info(f"No records found via OAI-PMH for {oai_url}")
    
    return articles

def _first(lst: list) -> str:
    """Return first non-empty element from a list, or empty string."""
    return next((x for x in lst if x), "")
```

**Resumption tokens**: OAI-PMH uses resumption tokens for pagination. If a crawl is interrupted, the token is saved to the seed file and the next run resumes from where it stopped.

#### Strategy B: REST API (OJS 3.x fallback)

When `harvesting.strategy == "rest_api"`:

```
GET {base_url}/api/v1/submissions?count=100&offset=0&status=3
```
Status 3 = published. Response includes title, abstract, authors, dates, keywords, galleys (PDF links). Paginate using `offset`.

#### Strategy C: HTML Scraping (universal fallback)

When `harvesting.strategy == "html_scrape"`:

```
GET {base_url}/issue/archive          → list of issues
GET {base_url}/issue/view/{issue_id}  → articles in issue
GET {base_url}/article/view/{art_id}  → article metadata page
```

Parse with BeautifulSoup. Extract `citation_*` meta tags (`citation_title`, `citation_author`, `citation_keywords`, `citation_pdf_url`, `citation_date`, `citation_abstract_html_url`) and Dublin Core meta tags (`DC.Title`, `DC.Creator`, `DC.Subject`, `DC.Description`, `DC.Date`).

> **Real-world note**: Many Syrian OJS instances (e.g., Hama University) have OAI-PMH technically
> enabled but returning zero records (`noRecordsMatch`), and REST API locked behind authentication.
> For these, HTML scraping via `citation_*` meta tags is the only viable method. The auto-detection
> logic handles this gracefully — see [Harvesting Strategy Auto-Detection](#harvesting-strategy-auto-detection).

**Deduplication (Step 2b)**: Before downloading the PDF, run the multi-signal duplicate check against **all three indices** (`ojs_research`, `arabic_research`, `english_research`). This is critical because research may have been added manually to the main production indices. See the [Duplicate Detection](#duplicate-detection) section for the full algorithm. Duplicates are skipped immediately — no PDF download, no AI processing.

### Step 3: Download PDF (or Skip)

This is the **hard gate** in the pipeline. If no PDF can be obtained, the article is skipped entirely.

#### 3a. Detect PDF URL

The crawler checks for a PDF URL in this order:
1. `citation_pdf_url` meta tag (most reliable on OJS)
2. OAI-PMH `<dc:identifier>` pointing to a PDF
3. Galley links on the article page (`/article/download/{id}/{galley_id}`)
4. REST API galley objects (if API is accessible)

If **none** of these yield a PDF URL → **SKIP** the article.

#### 3b. Download & Validate

```python
def download_pdf(pdf_url: str, seed: dict) -> Path | None:
    """Download PDF to local storage. Returns local path or None on failure."""
    resp = requests.get(pdf_url, timeout=60, headers={"User-Agent": "ShamraCrawler/1.0"})
    if resp.status_code != 200 or "pdf" not in resp.headers.get("content-type", "").lower():
        return None
    
    filename = f"document_{hashlib.md5(uuid.uuid4().hex.encode()).hexdigest()}.pdf"
    path = Path("data/pdfs") / filename
    path.write_bytes(resp.content)
    return path
```

If download fails (404, non-PDF content-type, network error) → **SKIP** the article.

#### 3c. Skip Logic

```python
def process_article(article_meta: dict, seed: dict, stats: dict, es):
    """Process a single article. Skip if duplicate or no PDF."""
    
    # --- Dedup gate (Step 2b) — runs BEFORE PDF download to save bandwidth ---
    title = article_meta.get("title", "")
    abstract = article_meta.get("abstract", "")
    if title:
        dup_result = check_duplicate(es, title, abstract, seed.get("language", "ar"))
        if dup_result["is_duplicate"]:
            logger.info(
                f"SKIP (duplicate — {dup_result['match_type']} in {dup_result['matched_index']}): "
                f"{title} — score={dup_result['score']:.2f}"
            )
            stats["skipped_duplicate"] += 1
            return  # ← Article is abandoned here. No PDF download.
    
    # --- PDF gate (hard requirement) ---
    pdf_url = article_meta.get("citation_pdf_url")
    if not pdf_url:
        pdf_url = find_galley_pdf(article_meta)  # Try alternative detection
    
    if not pdf_url:
        logger.info(f"SKIP (no PDF URL): {article_meta.get('title')} — {article_meta.get('url')}")
        stats["skipped_no_pdf"] += 1
        return  # ← Article is abandoned here. No staging.
    
    pdf_path = download_pdf(pdf_url, seed)
    if not pdf_path:
        logger.info(f"SKIP (PDF download failed): {article_meta.get('title')} — {pdf_url}")
        stats["skipped_no_pdf"] += 1
        return  # ← Article is abandoned here. No staging.
    
    # --- PDF obtained — continue pipeline ---
    content = extract_text(pdf_path)
    if not content or len(content) < 100:
        logger.info(f"SKIP (PDF has no extractable text): {article_meta.get('title')} — {pdf_url}")
        stats["skipped_no_pdf"] += 1
        pdf_path.unlink(missing_ok=True)  # Clean up useless PDF
        return  # ← Article is abandoned here. No staging.
    
    # Proceed to AI enrichment (Step 4) ...
    enriched = ai_enrich(article_meta, content, seed["language"])
    
    # Build document and run completeness gate (Step 5) ...
    doc = build_document(article_meta, enriched, content, pdf_path, seed)
    is_complete, missing = check_completeness(doc)
    
    if is_complete:
        index_document(es, doc)
        stats["indexed"] += 1
    else:
        stage_document(doc, missing)
        stats["staged"] += 1
```

> **Key point**: The staging system is for documents that HAVE a PDF but are missing
> metadata fields (keywords, subject, publisher, etc.). Documents without a PDF never
> reach staging — they are skipped and logged.

### Step 4: Extract Text + AI Enrichment

See Sections 9 (GROBID + PyMuPDF) and 13 (Translation Rules) for details.

### Step 5: Completeness Gate

See Section 11.

---

## 9. PDF Handling (GROBID + PyMuPDF)

### Local Storage

PDFs are downloaded to `data/pdfs/` in the crawler's working directory. They stay local — the crawler does NOT upload to the Shamra production server.

### File Naming

```
document_{md5_of_uuid}.pdf
```

Example: `document_a1b2c3d4e5f6789012345678abcdef01.pdf`

### Text Extraction Strategy

The crawler uses a **two-tier extraction** approach:

1. **GROBID (primary)** — ML-based PDF parser running as a Docker service on port 8070. Extracts structured text, metadata, references, and section headers. Best for academic PDFs.
2. **PyMuPDF (fallback)** — Used when GROBID is unavailable or fails. Extracts raw text without structure.

#### GROBID Extraction (Primary)

```python
import requests
from pathlib import Path

GROBID_URL = "http://grobid:8070"  # Docker service name

def extract_with_grobid(pdf_path: Path) -> dict:
    """
    Send PDF to GROBID for structured extraction.
    
    Returns:
        {
            "text": str,           # Full text content
            "title": str | None,   # Extracted title (can supplement OJS metadata)
            "abstract": str | None,
            "authors": list[str],
            "references": list[str],
            "keywords": list[str],
            "sections": list[dict],  # {"heading": str, "text": str}
            "source": "grobid"
        }
    """
    # Full document processing (most comprehensive)
    with open(pdf_path, "rb") as f:
        resp = requests.post(
            f"{GROBID_URL}/api/processFulltextDocument",
            files={"input": f},
            data={
                "consolidateHeader": "1",
                "consolidateCitations": "0",    # Skip citation resolution (slow)
                "includeRawAffiliations": "1",
                "teiCoordinates": "false"
            },
            timeout=120
        )
    
    if resp.status_code != 200:
        logger.warning(f"GROBID failed ({resp.status_code}) for {pdf_path.name}")
        return None
    
    # Parse TEI XML response
    from lxml import etree
    tei = etree.fromstring(resp.content)
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    
    # Extract full text from <body>
    body = tei.find(".//tei:body", ns)
    text = " ".join(body.itertext()).strip() if body is not None else ""
    
    # Extract title
    title_el = tei.find(".//tei:titleStmt/tei:title", ns)
    title = title_el.text.strip() if title_el is not None and title_el.text else None
    
    # Extract abstract
    abstract_el = tei.find(".//tei:profileDesc/tei:abstract", ns)
    abstract = " ".join(abstract_el.itertext()).strip() if abstract_el is not None else None
    
    # Extract authors
    authors = []
    for author in tei.findall(".//tei:fileDesc//tei:persName", ns):
        parts = [p.text for p in author if p.text]
        if parts:
            authors.append(" ".join(parts))
    
    # Extract references
    references = []
    for ref in tei.findall(".//tei:listBibl/tei:biblStruct", ns):
        ref_title = ref.find(".//tei:title", ns)
        if ref_title is not None and ref_title.text:
            references.append(ref_title.text.strip())
    
    # Extract keywords
    keywords = []
    for kw in tei.findall(".//tei:profileDesc/tei:textClass//tei:term", ns):
        if kw.text:
            keywords.append(kw.text.strip())
    
    return {
        "text": text,
        "title": title,
        "abstract": abstract,
        "authors": authors,
        "references": references,
        "keywords": keywords,
        "source": "grobid"
    }
```

#### PyMuPDF Extraction (Fallback)

```python
import fitz  # PyMuPDF

def extract_with_pymupdf(pdf_path: Path) -> dict:
    """Fallback: extract raw text from PDF using PyMuPDF."""
    doc = fitz.open(str(pdf_path))
    text = ""
    for page in doc:
        text += page.get_text("text")
    doc.close()
    
    # Basic cleanup
    text = text.strip()
    text = re.sub(r'\n{3,}', '\n\n', text)  # Collapse excessive newlines
    
    return {
        "text": text,
        "title": None,      # PyMuPDF doesn't extract structured metadata
        "abstract": None,
        "authors": [],
        "references": [],
        "keywords": [],
        "source": "pymupdf"
    }
```

#### Combined Extraction Function

```python
def extract_text(pdf_path: Path) -> dict:
    """
    Extract text and metadata from PDF.
    Tries GROBID first, falls back to PyMuPDF.
    
    Returns dict with 'text', 'title', 'abstract', 'authors', 'references', 'keywords', 'source'.
    """
    # Try GROBID first
    try:
        result = extract_with_grobid(pdf_path)
        if result and result["text"] and len(result["text"]) >= 100:
            logger.info(f"GROBID extraction successful for {pdf_path.name} ({len(result['text'])} chars)")
            return result
        else:
            logger.warning(f"GROBID returned insufficient text for {pdf_path.name}, trying PyMuPDF")
    except Exception as e:
        logger.warning(f"GROBID unavailable or errored: {e}, falling back to PyMuPDF")
    
    # Fallback to PyMuPDF
    result = extract_with_pymupdf(pdf_path)
    if result["text"] and len(result["text"]) >= 100:
        logger.info(f"PyMuPDF extraction successful for {pdf_path.name} ({len(result['text'])} chars)")
        return result
    
    logger.warning(f"Both extractors failed for {pdf_path.name}")
    return {"text": "", "title": None, "abstract": None, "authors": [], 
            "references": [], "keywords": [], "source": "none"}
```

#### GROBID Metadata Supplement

When GROBID extracts structured metadata (title, abstract, authors, keywords) from the PDF, it can **supplement** — but never **override** — metadata already obtained from OJS:

```python
def supplement_metadata(article_meta: dict, grobid_result: dict) -> dict:
    """Fill gaps in OJS metadata using GROBID-extracted data."""
    if grobid_result["source"] != "grobid":
        return article_meta  # PyMuPDF doesn't provide metadata
    
    # Only fill if OJS didn't provide it
    if not article_meta.get("abstract") and grobid_result.get("abstract"):
        article_meta["abstract"] = grobid_result["abstract"]
        article_meta["abstract_source"] = "grobid"
    
    if not article_meta.get("authors") and grobid_result.get("authors"):
        article_meta["authors"] = grobid_result["authors"]
        article_meta["authors_source"] = "grobid"
    
    if not article_meta.get("keywords") and grobid_result.get("keywords"):
        article_meta["keywords"] = grobid_result["keywords"]
        article_meta["keywords_source"] = "grobid"
    
    if grobid_result.get("references"):
        article_meta["references"] = grobid_result["references"]
    
    return article_meta
```

### GROBID Health Check

The worker checks GROBID availability at startup:

```python
def check_grobid() -> bool:
    """Verify GROBID service is running and healthy."""
    try:
        resp = requests.get(f"{GROBID_URL}/api/isalive", timeout=5)
        return resp.status_code == 200
    except Exception:
        return False
```

If GROBID is down, the crawler logs a warning and continues with PyMuPDF only.

### PDF Validation

Before accepting a PDF:
1. File size > 10KB (reject tiny/corrupt files)
2. At least 1 page
3. Extracted text > 100 characters (reject image-only PDFs without OCR)
4. Valid PDF header (`%PDF-`)

---

## 10. AI Processing & Gap-Filling

### 10.1 When AI Is Used

AI is invoked for:
1. **Translation** (English→Arabic title + abstract only — see Section 13)
2. **Missing keywords** — if OJS metadata has no keywords
3. **Missing field/subject** — if seed has no `default_field_id` and OJS has no subject metadata
4. **Summary generation** (`ai_summary`)
5. **Critique generation** (`ai_critique`)
6. **Keyword generation** (`ai_keywords`)
7. **Q&A generation** (`chatGPT`)

### 10.2 Prompt Templates for Gap-Filling

The system should generate specific prompts for each missing field. These prompts are stored with the staged document so the web app can display them and let the user trigger AI completion or manually fill them.

#### Missing Keywords Prompt

```
You are an academic research metadata specialist.

Given the following research article:
- Title: {title}
- Abstract: {abstract}
- Language: {language}

Generate 5-10 academic keywords for this research.
Return keywords in BOTH Arabic and English, even if the article is in only one language.

Format your response as a JSON object:
{
  "keywords_ar": ["كلمة1", "كلمة2", ...],
  "keywords_en": ["keyword1", "keyword2", ...]
}
```

#### Missing Field/Subject Prompt

```
You are an academic research classifier.

Given the following research article:
- Title: {title}
- Abstract: {abstract}
- Keywords: {keywords}

Classify this research into ONE of the following academic fields:

| ID | Arabic | English |
|---|---|---|
| 101 | علوم الحاسوب | Computer Science |
| 102 | الرياضيات | Mathematics |
| 103 | الفيزياء | Physics |
| 104 | الكيمياء | Chemistry |
| 105 | الأحياء | Biology |
| 106 | الطب | Medicine |
| 107 | الذكاء الاصناعي | Artificial Intelligence |
| 108 | الهندسة | Engineering |
| 109 | الأدب | Literature |
| 110 | القانون | Law |
| 111 | الاقتصاد | Economics |
| 112 | التربية | Education |
| 113 | الزراعة | Agriculture |
| 114 | الصيدلة | Pharmacy |

Return a JSON object:
{
  "field_id": <id>,
  "arabic_name": "<Arabic name>",
  "english_name": "<English name>",
  "confidence": <0.0-1.0>
}
```

#### Missing Publisher Prompt

```
You are a metadata extraction specialist.

From the following PDF text (first page), extract the journal/publisher name:

Text: {first_page_text}

Return a JSON object:
{
  "publisher_ar": "<Arabic name or null>",
  "publisher_en": "<English name or null>"
}
```

#### Summary Prompt

```
You are an academic research analyst writing in {language_name}.

Summarize the following research article in 3-5 sentences. Cover the objective, methodology, and key findings. Use formal academic language.

Title: {title}
Abstract: {abstract}
Content (first 3000 chars): {content[:3000]}

Write your summary in {language_name}.
```

#### Critique Prompt

```
You are an academic peer reviewer writing in {language_name}.

Provide a brief academic critique (3-5 sentences) of the following research. Assess the methodology, highlight strengths and limitations, and evaluate the contribution to the field.

Title: {title}
Abstract: {abstract}

Write your critique in {language_name}.
```

#### Q&A Generation Prompt

```
You are creating educational Q&A pairs about a research article.

Based on this research:
- Title: {title}
- Abstract: {abstract}
- Content (first 3000 chars): {content[:3000]}

Generate 3-5 question-answer pairs in {language_name}. Questions should help a student understand the key aspects of this research.

Return as JSON:
{
  "q1": "question",
  "a1": "answer",
  "q2": "question",
  "a2": "answer",
  ...
}
```

### 10.3 AI Processing Flow

```python
def ai_enrich(article: dict, content: str, seed_language: str) -> dict:
    """Run all AI enrichment on an article."""
    
    lang_name = "العربية" if seed_language == "ar" else "English"
    result = {}
    
    # 1. Translation (English→Arabic only)
    if seed_language == "en":
        result["arabic_full_title"] = ai_translate(article["english_title"], "en", "ar")
        result["arabic_abstract"] = ai_translate(article["english_abstract"], "en", "ar")
    
    # 2. Keywords (if missing from OJS)
    if not article.get("keywords"):
        result["keywords"] = ai_generate_keywords(article, content)
    
    # 3. Field classification (if not set in seed)
    if not article.get("field_id"):
        result["field"] = ai_classify_field(article, content)
    
    # 4. Summary
    result["ai_summary"] = ai_generate_summary(article, content, lang_name)
    
    # 5. Critique
    result["ai_critique"] = ai_generate_critique(article, lang_name)
    
    # 6. AI Keywords (always generate, even if OJS has some)
    result["ai_keywords"] = ai_generate_keyword_string(article, content)
    
    # 7. Q&A pairs
    result["chatGPT"] = ai_generate_qa(article, content, lang_name)
    
    return result
```

---

## 11. Completeness Gate & Staging

### Required Fields for Indexing

A document is considered **complete** and ready for indexing only if ALL of these are present:

| # | Field | Validation | If Missing |
|---|---|---|---|
| 1 | `arabic_full_title` | Non-empty, ≥ 10 chars | → Stage |
| 2 | `arabic_abstract` OR `english_abstract` | At least one, ≥ 50 chars | → Stage |
| 3 | `content` | Non-empty, ≥ 100 chars (extracted PDF text) | → **SKIP** (pre-gate) |
| 4 | `document_name` | PDF file exists on disk | → **SKIP** (pre-gate) |
| 5 | `language` | `"العربية"` or `"English"` | → Stage |
| 6 | `tag` | At least 1 keyword | → Stage |
| 7 | `arabic_fields` + `fields_id` | At least 1 field | → Stage |
| 8 | `textualPublisher` | Non-empty (author names) | → Stage |
| 9 | `publication_date` | Valid date in `yyyy-MM-dd HH:mm:ss` | → Stage |
| 10 | `arabic_publisher_name` OR `english_publisher_name` | At least one | → Stage |

> **SKIP vs Stage**: Fields #3 and #4 (PDF content and file) are never stageable — if they
> are missing, the article was already skipped in Step 3 of the pipeline before reaching
> this gate. The completeness gate only evaluates documents that already have a valid PDF.
> Fields #1, #2, #5–#10 are stageable — the document goes to the staging queue with AI
> prompts to fill the gaps.

### Completeness Check

```python
def check_completeness(doc: dict) -> tuple[bool, list[str]]:
    """Returns (is_complete, list_of_missing_fields)."""
    missing = []
    
    if not doc.get("arabic_full_title") or len(doc["arabic_full_title"]) < 10:
        missing.append("arabic_full_title")
    
    if not doc.get("arabic_abstract") and not doc.get("english_abstract"):
        missing.append("abstract (arabic or english)")
    
    if not doc.get("content") or len(doc["content"]) < 100:
        missing.append("content (PDF text extraction)")
    
    pdf_path = Path("data/pdfs") / doc.get("document_name", "")
    if not pdf_path.exists():
        missing.append("document_name (PDF file)")
    
    if not doc.get("language"):
        missing.append("language")
    
    if not doc.get("tag") or len(doc["tag"]) == 0:
        missing.append("tag (keywords)")
    
    if not doc.get("arabic_fields") or not doc.get("fields_id"):
        missing.append("fields (subject area)")
    
    if not doc.get("textualPublisher"):
        missing.append("textualPublisher (authors)")
    
    if not doc.get("publication_date"):
        missing.append("publication_date")
    
    if not doc.get("arabic_publisher_name") and not doc.get("english_publisher_name"):
        missing.append("publisher_name")
    
    return (len(missing) == 0, missing)
```

### Staging Storage

Staged (incomplete) documents are stored as JSON files in `data/staging/`:

```
data/staging/
├── {slug}.json            # The document data so far
├── {slug}.missing.json    # List of missing fields + AI prompts
└── {slug}.pdf             # Symlink to data/pdfs/{document_name}
```

#### staging/{slug}.missing.json format:

```json
{
  "slug": "a1b2c3d4e5f6g7",
  "missing_fields": ["tag (keywords)", "fields (subject area)"],
  "prompts": {
    "tag": {
      "prompt": "You are an academic research metadata specialist...",
      "ai_suggestion": null,
      "manually_filled": false
    },
    "fields": {
      "prompt": "You are an academic research classifier...",
      "ai_suggestion": null,
      "manually_filled": false
    }
  },
  "created_at": "2026-02-22 14:30:00",
  "seed_file": "damascus_cs.yaml",
  "source_url": "https://journal.damascusuniversity.edu.sy/article/view/1234"
}
```

### Staging→Index Workflow

1. User opens staged document in web app
2. Web app shows current data + missing fields highlighted
3. For each missing field:
   - Show the pre-generated prompt
   - Button: "Generate AI suggestion" → calls LLM with the prompt, displays result
   - Button: "Accept" → fills the field with AI suggestion
   - Text input: "Manual entry" → user types the value
4. When all required fields are filled → "Approve & Index" button
5. On approval: run completeness check again → if passed, index to ES and move JSON to `data/indexed/`

---

## 12. Indexing

### Index to ES (Complete Documents Only)

```python
def index_document(es, doc: dict, index_name: str = "ojs_research"):
    """Index a complete document to ES. No database interaction."""
    is_complete, missing = check_completeness(doc)
    if not is_complete:
        raise ValueError(f"Cannot index incomplete document. Missing: {missing}")
    
    es.index(index=index_name, body=doc)
```

### Bulk Indexing

```python
from elasticsearch.helpers import bulk

def bulk_index(es, documents: list[dict], index_name: str = "ojs_research"):
    actions = []
    for doc in documents:
        is_complete, missing = check_completeness(doc)
        if is_complete:
            actions.append({"_index": index_name, "_source": doc})
    
    success, errors = bulk(es, actions, raise_on_error=False)
    return success, errors
```

### Get Next Available ID

```python
def get_next_id(es, index_name: str = "ojs_research") -> int:
    try:
        result = es.search(
            index=index_name,
            body={"size": 1, "sort": [{"id": {"order": "desc"}}], "_source": ["id"]}
        )
        if result["hits"]["hits"]:
            return int(result["hits"]["hits"][0]["_source"]["id"]) + 1
    except Exception:
        pass
    return 1  # First document in empty index
```

### Duplicate Detection

The crawler uses a **multi-signal, cross-index** deduplication strategy to prevent indexing research that already exists — whether it was previously crawled or manually added to the production indices.

#### Indices Checked

Every article is checked against **all three** Elasticsearch indices:
1. `ojs_research` — previously crawled articles
2. `arabic_research` — production Arabic research (includes manually-added articles)
3. `english_research` — production English research (includes manually-added articles)

#### Match Signals (in priority order)

| # | Signal | Query Type | Threshold | Result |
|---|---|---|---|---|
| 1 | **Exact title** | `match_phrase` on title field | Any hit | → **SKIP** (definite duplicate) |
| 2 | **Fuzzy title** | `match` on title with `minimum_should_match: "85%"` | Score ≥ 15 | → **SKIP** (near-duplicate) |
| 3 | **Title + Abstract combo** | `match` on title (75%) + `match` on abstract (75%) | Combined score ≥ 20 | → **SKIP** (same research, different title wording) |

> **Why 85% for fuzzy title?** Arabic research titles often have minor variations: different
> diacritics, ال prefix differences, date formats, punctuation. 85% catches these while avoiding
> false positives on genuinely different papers in the same field.

#### Language-Aware Field Selection

| Seed Language | Title Field Checked | Abstract Field Checked |
|---|---|---|
| `ar` | `arabic_full_title` | `arabic_abstract` |
| `en` | `english_full_title` | `english_abstract` |
| Both (fallback) | Both title fields | Both abstract fields |

#### Full Algorithm

```python
# src/crawler/dedup.py

from typing import Optional

# All indices to check for duplicates (including manually-added research)
DEDUP_INDICES = ["ojs_research", "arabic_research", "english_research"]

def check_duplicate(
    es,
    title: str,
    abstract: str = "",
    language: str = "ar"
) -> dict:
    """
    Multi-signal duplicate detection across all indices.
    
    Returns:
        {
            "is_duplicate": bool,
            "match_type": "exact_title" | "fuzzy_title" | "title_abstract" | None,
            "matched_index": str | None,
            "matched_doc_id": str | None,
            "matched_title": str | None,
            "score": float
        }
    """
    if not title or len(title.strip()) < 10:
        return {"is_duplicate": False, "match_type": None, "matched_index": None,
                "matched_doc_id": None, "matched_title": None, "score": 0.0}
    
    title_field = "arabic_full_title" if language == "ar" else "english_full_title"
    abstract_field = "arabic_abstract" if language == "ar" else "english_abstract"
    
    for index in DEDUP_INDICES:
        try:
            result = _check_index(es, index, title, abstract, title_field, abstract_field)
            if result["is_duplicate"]:
                result["matched_index"] = index
                return result
        except Exception as e:
            # Index may not exist (e.g., fresh setup) — continue
            logger.warning(f"Dedup check failed for index {index}: {e}")
            continue
    
    return {"is_duplicate": False, "match_type": None, "matched_index": None,
            "matched_doc_id": None, "matched_title": None, "score": 0.0}


def _check_index(
    es,
    index: str,
    title: str,
    abstract: str,
    title_field: str,
    abstract_field: str
) -> dict:
    """Run all duplicate signals against a single index."""
    
    base = {"is_duplicate": False, "match_type": None, "matched_index": None,
            "matched_doc_id": None, "matched_title": None, "score": 0.0}
    
    # --- Signal 1: Exact title match (match_phrase) ---
    result = es.search(index=index, body={
        "query": {"match_phrase": {title_field: title}},
        "size": 1,
        "_source": ["id", title_field]
    })
    if result["hits"]["total"]["value"] > 0:
        hit = result["hits"]["hits"][0]
        return {
            "is_duplicate": True,
            "match_type": "exact_title",
            "matched_index": index,
            "matched_doc_id": hit["_id"],
            "matched_title": hit["_source"].get(title_field, ""),
            "score": hit["_score"]
        }
    
    # --- Signal 2: Fuzzy title match (85% of terms must match) ---
    result = es.search(index=index, body={
        "query": {
            "match": {
                title_field: {
                    "query": title,
                    "minimum_should_match": "85%"
                }
            }
        },
        "size": 1,
        "_source": ["id", title_field]
    })
    if result["hits"]["total"]["value"] > 0:
        hit = result["hits"]["hits"][0]
        if hit["_score"] >= 15:
            return {
                "is_duplicate": True,
                "match_type": "fuzzy_title",
                "matched_index": index,
                "matched_doc_id": hit["_id"],
                "matched_title": hit["_source"].get(title_field, ""),
                "score": hit["_score"]
            }
    
    # --- Signal 3: Title (75%) + Abstract (75%) combined ---
    if abstract and len(abstract) > 50:
        result = es.search(index=index, body={
            "query": {
                "bool": {
                    "must": [
                        {"match": {title_field: {
                            "query": title,
                            "minimum_should_match": "75%"
                        }}},
                        {"match": {abstract_field: {
                            "query": abstract,
                            "minimum_should_match": "75%"
                        }}}
                    ]
                }
            },
            "size": 1,
            "_source": ["id", title_field]
        })
        if result["hits"]["total"]["value"] > 0:
            hit = result["hits"]["hits"][0]
            if hit["_score"] >= 20:
                return {
                    "is_duplicate": True,
                    "match_type": "title_abstract",
                    "matched_index": index,
                    "matched_doc_id": hit["_id"],
                    "matched_title": hit["_source"].get(title_field, ""),
                    "score": hit["_score"]
                }
    
    return base
```

#### Dedup in the Pipeline

The duplicate check runs at **Step 2b** — after metadata is crawled but **before** the PDF is downloaded. This saves bandwidth and processing time: if an article is already in the index, there's no point downloading its PDF or calling AI APIs.

```
Crawl metadata → Dedup check → [SKIP if dup] → Download PDF → AI → Gate → Index
```

#### Dashboard Logging

Every skipped duplicate is logged with:
- Article title and URL
- Match type (`exact_title`, `fuzzy_title`, `title_abstract`)
- Which index it matched against
- The matched document's title (for human verification in the dashboard)

---

## 13. Translation Rules

### Direction: English → Arabic ONLY

| Seed Language | Field | Action |
|---|---|---|
| `ar` (Arabic) | `arabic_full_title` | Use from OJS as-is |
| `ar` | `english_full_title` | Use from OJS if available, otherwise leave empty. **Do NOT translate.** |
| `ar` | `arabic_abstract` | Use from OJS as-is |
| `ar` | `english_abstract` | Use from OJS if available, otherwise leave empty. **Do NOT translate.** |
| `ar` | `content` | Keep Arabic. **NEVER translate.** |
| `en` (English) | `arabic_full_title` | **AI-translate** from English title |
| `en` | `english_full_title` | Use from OJS as-is |
| `en` | `arabic_abstract` | **AI-translate** from English abstract |
| `en` | `english_abstract` | Use from OJS as-is |
| `en` | `content` | Keep English. **NEVER translate.** |

### Translation Prompt

```
Translate the following academic text from English to Arabic.
Maintain formal academic register. Preserve technical terms where appropriate.
Do not add or remove content.

Text to translate:
---
{text}
---

Return ONLY the Arabic translation, nothing else.
```

### What Is NEVER Translated

- Full text (`content`) — always kept in original language
- Author names (`textualPublisher`)
- Keywords/tags — generate bilingual keywords via AI instead of translating
- References
- Publisher names (use seed config or OJS metadata)

---

## 14. Project Structure

```
ojs-crawler/
├── README.md
├── requirements.txt
├── .env                          # ES creds, API keys, config
├── .env.example                  # Template without secrets
│
├── seeds/                        # Seed file storage
│   ├── active/                   # Currently crawled
│   ├── paused/                   # Temporarily paused
│   ├── completed/                # Fully crawled
│   └── templates/                # Seed templates
│       ├── ojs3_arabic.yaml
│       └── ojs2_english.yaml
│
├── data/                         # Runtime data (gitignored)
│   ├── pdfs/                     # Downloaded PDFs
│   ├── staging/                  # Incomplete documents (JSON)
│   ├── indexed/                  # Successfully indexed (JSON archive)
│   └── logs/                     # Crawl logs
│
├── webapp/                       # Web management interface
│   ├── __init__.py
│   ├── app.py                    # Flask/FastAPI app factory
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── dashboard.py          # / — stats dashboard
│   │   ├── seeds.py              # /seeds — CRUD for seed files
│   │   ├── staging.py            # /staging — review incomplete docs
│   │   ├── index_browse.py       # /index — browse indexed docs
│   │   └── crawls.py             # /crawls — crawl history
│   ├── templates/
│   │   ├── base.html             # Layout with sidebar nav
│   │   ├── dashboard.html
│   │   ├── seeds/
│   │   │   ├── list.html
│   │   │   ├── form.html
│   │   │   └── detail.html
│   │   ├── staging/
│   │   │   ├── list.html
│   │   │   └── review.html       # Missing fields + AI prompts UI
│   │   └── index/
│   │       ├── list.html
│   │       └── detail.html
│   └── static/
│       ├── css/
│       └── js/
│
├── src/                          # Core crawler logic
│   ├── __init__.py
│   ├── crawler/
│   │   ├── __init__.py
│   │   ├── ojs3_client.py        # OJS 3.x REST API client
│   │   ├── ojs2_scraper.py       # OJS 2.x HTML scraper
│   │   ├── pdf_downloader.py     # Download + validate PDFs
│   │   └── dedup.py              # Duplicate detection
│   ├── processor/
│   │   ├── __init__.py
│   │   ├── text_extractor.py     # PDF → text (PyMuPDF)
│   │   ├── ai_processor.py       # LLM calls (translate, summarize, etc.)
│   │   ├── field_classifier.py   # Auto-classify subject field
│   │   ├── keyword_generator.py  # Generate missing keywords
│   │   └── prompt_templates.py   # All prompt templates
│   ├── indexer/
│   │   ├── __init__.py
│   │   ├── es_client.py          # Elasticsearch connection + ops
│   │   ├── document_builder.py   # Build ES-ready document
│   │   ├── validator.py          # Completeness check
│   │   └── staging.py            # Stage incomplete docs + generate prompts
│   ├── models/
│   │   ├── __init__.py
│   │   ├── article.py            # Article dataclass
│   │   └── seed.py               # Seed dataclass
│   └── utils/
│       ├── __init__.py
│       ├── slug.py               # Slug generation
│       ├── date_parser.py        # Date format normalization
│       └── logger.py             # Logging setup
│
├── scripts/
│   ├── crawl.py                  # CLI: Run crawl for active seeds
│   ├── create_index.py           # CLI: Create the ojs_research index
│   ├── stage_review.py           # CLI: Review staged docs (no web)
│   └── migrate_to_production.py  # CLI: Copy ojs_research docs to arabic_research (future)
│
└── tests/
    ├── test_crawler.py
    ├── test_processor.py
    ├── test_indexer.py
    └── test_validator.py
```

---

## 15. Configuration

### .env

```bash
# Elasticsearch
ES_HOST=https://shamraindex:9200
ES_USER=elastic
ES_PASSWORD=<ES_PASSWORD>
ES_INDEX=ojs_research

# GROBID (Docker service)
GROBID_URL=http://grobid:8070
GROBID_TIMEOUT=120

# AI API (pick one)
ANTHROPIC_API_KEY=sk-ant-...
# OPENAI_API_KEY=sk-...

# Web App
FLASK_SECRET_KEY=change-me-random-string
FLASK_PORT=5000
FLASK_DEBUG=false

# Crawler Defaults
CRAWLER_USER_ID=0
DEFAULT_RATE_LIMIT=2
PDF_LOCAL_DIR=./data/pdfs
STAGING_DIR=./data/staging

# Deduplication
DEDUP_INDICES=ojs_research,arabic_research,english_research
DEDUP_FUZZY_THRESHOLD=85
DEDUP_SCORE_THRESHOLD=15

# NO DATABASE CONFIGURATION
# This crawler has ZERO database interaction
```

### requirements.txt

```
# Elasticsearch
elasticsearch>=7.17,<8.0

# Web framework
flask>=3.0
flask-wtf>=1.2        # Forms

# OAI-PMH harvesting
sickle>=0.7.0          # OAI-PMH client (preferred harvesting method)

# Crawling
requests>=2.31
beautifulsoup4>=4.12
lxml>=5.1              # XML/HTML parsing (also for GROBID TEI output)

# PDF processing
PyMuPDF>=1.23          # Fallback text extraction
# GROBID runs as a Docker service (no Python package needed)
# The grobid_client.py module uses requests to call the GROBID REST API

# AI
anthropic>=0.18
# openai>=1.0         # Alternative

# Config
python-dotenv>=1.0
pyyaml>=6.0

# Utilities
python-dateutil>=2.9
```

---

## 16. Reference: Production Index Mapping

The `ojs_research` index uses the **exact same mapping** as the production `arabic_research` index (defined in Section 6). This ensures that if/when documents are migrated to the production index, they are already in the correct format.

Key analyzers:
- `arabic_full_title`, `arabic_abstract` → use ES built-in `arabic` analyzer
- `english_full_title`, `english_abstract`, `content` → use default `standard` analyzer
- No custom analyzers are defined

Index settings:
- 1 shard, 0 replicas (single-node cluster)
- ES version 7.17.x

---

## 17. Reference: Sample Document

An actual document from the production `arabic_research` index showing the expected format:

```json
{
  "id": 10309,
  "arabic_full_title": "مزج نجاح المهمة ورضا المستخدم: تحليل سلوك الحوار المستفاد مع مكافآت متعددة",
  "english_full_title": "Blending Task Success and User Satisfaction: Analysis of Learned Dialogue Behaviour with Multiple Rewards",
  "arabic_abstract": "في الآونة الأخيرة، تستخدم مكونات المكافآت الرئيسية ...",
  "english_abstract": "Recently, principal reward components for dialogue policy...",
  "textualPublisher": "Ultes Stefan,Maier Wolfgang",
  "research_references": "a:1:{i:0;s:25:\"https://aclanthology.org/\";}",
  "tag": ["task success", "multiple rewards", "نجاح المهمة", "مكافآت متعددة"],
  "tag_id": ["50941", "50942", "50943", "50944"],
  "arabic_fields": ["الذكاء الاصناعي"],
  "english_fields": ["Artificial Intelligence"],
  "fields_id": ["107"],
  "arabic_publisher_name": "جمعية اللغويات الحاسوبية ACL",
  "english_publisher_name": "Association for Computation Linguistics",
  "publisher_id": 62,
  "arabic_category_name": "مقالة",
  "english_category_name": "Article",
  "category_id": 2,
  "createdAt": "2022-04-01 18:30:58",
  "publication_date": "2021-07-01 00:00:00",
  "downloads": 0,
  "rate": 0.0,
  "raters": 0,
  "hits": 0,
  "deleted": false,
  "related_researches": "{\"0\": 10312, \"1\": 10154, \"2\": 10848}",
  "document_name": "ultes-maier-2021-blending.pdf",
  "slug": "3a9417cc9cf86d",
  "cites": 0,
  "creator_id": 6251,
  "updater_id": 6251,
  "language": "English",
  "content": "Proceedings of the 22nd Annual Meeting..."
}
```

---

## 18. Reference: Known Field/Category IDs

### Fields (Subject Areas)

| ID | Arabic Name | English Name |
|---|---|---|
| 101 | علوم الحاسوب | Computer Science |
| 102 | الرياضيات | Mathematics |
| 103 | الفيزياء | Physics |
| 104 | الكيمياء | Chemistry |
| 105 | الأحياء | Biology |
| 106 | الطب | Medicine |
| 107 | الذكاء الاصناعي | Artificial Intelligence |
| 108 | الهندسة | Engineering |
| 109 | الأدب | Literature |
| 110 | القانون | Law |
| 111 | الاقتصاد | Economics |
| 112 | التربية | Education |
| 113 | الزراعة | Agriculture |
| 114 | الصيدلة | Pharmacy |

> To get the full list, query the production index: `GET arabic_research/_search {"size":0, "aggs": {"fields": {"terms": {"field": "fields_id", "size": 200}}}}`

### Categories

| ID | Arabic | English | Typical Use |
|---|---|---|---|
| 1 | بحث | Research | General research papers |
| 2 | مقالة | Article | **Default for OJS journal articles** |
| 3 | رسالة ماجستير | Master Thesis | |
| 4 | أطروحة دكتوراه | PhD Thesis | |
| 5 | كتاب | Book | |

---

## 19. Consistency Risks & Mitigations

### How the Shamra App Handles ES-Only Documents

The Shamra app **already fully supports** documents that exist only in Elasticsearch with no corresponding MySQL rows. This has been verified in the codebase:

**Search results** (`ElasticResearch.php`):
- The DTO stores text fallback values (`arabicPublisherName`, `englishPublisherName`, `arabicCategoryName`, `englishCategoryName`, `arabicFieldNames`, `englishFieldNames`) directly from ES _source **before** attempting DB lookups.
- DB lookups for Publisher, ResearchCategory, Field, Tag, and User are all guarded with null checks — if the DB row doesn't exist, the code continues gracefully using the ES text values.
- Tags fall back to raw string names when `tag_id` doesn't match any DB row.

**Research detail page** (`ResearchController.php`):
- The show action tries DB lookup first, then falls back to an **ES-only code path** that renders `show_elastic.html.twig` — a dedicated template for documents with no DB row.
- Hit counter updates happen directly in ES (no DB write).
- Tag interest tracking accepts both Tag entities and raw strings.

**Therefore**: Crawler documents indexed into the production `arabic_research` index will display correctly in both search results and detail pages without any DB rows.

### Risk Assessment

| # | Risk | Severity | Mitigation |
|---|---|---|---|
| 1 | **ID collisions at migration** — `id` starting at 1 in `ojs_research` would clash with production IDs (1–12,839+) if migrated | Medium | Start IDs at 1,000,000. For slugs, use `uuid4().hex[:14]` which won't collide with production `uniqid()` slugs. |
| 2 | **Mapping drift** — If production `arabic_research` mapping changes, `ojs_research` won't auto-update | Medium | Add `scripts/sync_mapping.py` that pulls production mapping and compares. Run before each crawl batch. |
| 3 | **PDF not on production server** — `document_name` references files in crawler's local `data/pdfs/`, but the production app checks `public/uploads/documents/` | Medium | This is fine while using `ojs_research` locally. At migration time, PDFs must be copied to the server. The detail page already handles missing PDFs gracefully (`documentFound` flag). |
| 4 | **Cross-index duplicates** — Crawled articles may already exist in production indices (added manually or from other sources) | Low | **Resolved**: Multi-signal duplicate detection checks all 3 indices (`ojs_research`, `arabic_research`, `english_research`) with exact, fuzzy, and title+abstract matching. See [Duplicate Detection](#duplicate-detection). |
| 5 | **Stale reference IDs** — Field/category IDs hardcoded in the plan may change in production | Low | At startup, query production ES via aggregations to get current valid field/category IDs. Cache locally. |

### Cross-Index Duplicate Detection

See the comprehensive [Duplicate Detection](#duplicate-detection) section under Indexing (Section 11). The `check_duplicate()` function queries all three indices (`ojs_research`, `arabic_research`, `english_research`) with three match signals: exact title phrase, fuzzy title (85%), and title+abstract combo. This catches both previously-crawled articles and manually-added research.

### Mapping Sync Script

```python
def sync_mapping(es):
    """Verify ojs_research mapping matches production arabic_research."""
    prod = es.indices.get_mapping(index="arabic_research")
    local = es.indices.get_mapping(index="ojs_research")
    
    prod_props = prod["arabic_research"]["mappings"]["properties"]
    local_props = local["ojs_research"]["mappings"]["properties"]
    
    missing = set(prod_props.keys()) - set(local_props.keys())
    if missing:
        print(f"WARNING: ojs_research is missing fields: {missing}")
        # Optionally auto-add missing fields via PUT mapping
```

### Future Migration Path

When ready to move `ojs_research` data into the production `arabic_research` index:

1. **No DB rows needed** — the app renders ES-only documents with `show_elastic.html.twig`
2. **Copy PDFs** to `/var/www/html/academia_v2/public/uploads/documents/` on the production server
3. **Remap IDs** — reassign `id` values to continue from production's max ID
4. **Re-index** — use `_reindex` API or bulk copy from `ojs_research` to `arabic_research`
5. **Verify** — spot-check a few documents via the web UI

---

## Summary of Key Rules

1. **NO DATABASE** — The crawler never connects to, reads from, or writes to MySQL. Period.
2. **ES-only** — All data stored in the `ojs_research` Elasticsearch index (new, empty index).
3. **Production-compatible** — The Shamra app already supports ES-only documents with no DB rows. Text fallback values in the ES document (publisher names, category names, field names) are used when DB lookups return null.
4. **Seed-driven** — Every crawl is initiated from a seed file. The web app manages seeds.
5. **Bilingual seeds** — Each seed specifies `language: "ar"` or `"en"`. Both are fully supported.
6. **Arabic priority** — Arabic data is preserved as-is. English title+abstract are translated TO Arabic via AI. Never translate Arabic→English. Never translate full text.
7. **No PDF = skip** — If the crawler confirms no PDF exists for an article, the article is skipped entirely (logged, not staged). Only documents that have a PDF but are missing metadata fields go to the staging queue.
8. **Deduplicated** — Every article is checked against all three ES indices (`ojs_research`, `arabic_research`, `english_research`) before processing. Multi-signal matching (exact title, fuzzy title at 85%, title+abstract combo) catches duplicates from previous crawls AND manually-added research. Duplicates are skipped before PDF download to save bandwidth.
9. **Completeness gate** — Documents with PDF but missing metadata go to staging with AI prompts to fill gaps. Documents with all required fields are indexed directly.
10. **Web interface** — Flask/FastAPI app for seed management, staging review, and index browsing.
11. **Same mapping** — The `ojs_research` index uses the identical mapping as production `arabic_research`, ensuring future migration compatibility.
