# Expand Index — Finding New OJS Journals to Crawl

> **Agent**: Use the `seed-scout` agent mode for this workflow.
> **Frequency**: Run quarterly or when search CTR data shows coverage gaps.

---

## Overview

This document describes the process of identifying under-represented topics in the Shamra Academia index and finding new OJS journals to fill those gaps. The workflow uses real user search data (queries without clicks, zero-result queries) to drive decisions.

---

## Step-by-Step Workflow

### 1. Extract Queries Without Clicks from Prod

These are searches where users got results but nothing was relevant enough to click.

```bash
ssh -i /tmp/shamramain_user.pem azureuser@20.241.4.71 \
  "cd /var/www/html/academia_v2 && sudo -u www-data php bin/console dbal:run-sql --env=prod --force-fetch \
  \"SELECT sq.query_text, sq.results_count, sq.search_type, COUNT(*) as search_count \
    FROM search_query_log sq \
    LEFT JOIN search_click_log scl ON sq.id = scl.search_query_id \
    WHERE scl.id IS NULL AND sq.results_count > 0 \
    GROUP BY sq.query_text, sq.results_count, sq.search_type \
    ORDER BY search_count DESC LIMIT 50\""
```

### 2. Extract Zero-Result Queries

These are total coverage gaps — users searched and got nothing.

```bash
ssh -i /tmp/shamramain_user.pem azureuser@20.241.4.71 \
  "cd /var/www/html/academia_v2 && sudo -u www-data php bin/console dbal:run-sql --env=prod --force-fetch \
  \"SELECT sq.query_text, sq.search_type, COUNT(*) as search_count \
    FROM search_query_log sq \
    LEFT JOIN search_click_log scl ON sq.id = scl.search_query_id \
    WHERE scl.id IS NULL AND sq.results_count = 0 \
    GROUP BY sq.query_text, sq.search_type \
    ORDER BY search_count DESC LIMIT 30\""
```

> **Note**: Filter out SQL injection attempts (`Mr.`, `sleep()`, `DBMS_PIPE`, `XOR`, etc.) — these are attack payloads, not real searches.

### 3. Test Queries Against Elasticsearch

For each legitimate query, check how many AND-match and OR-match results exist:

```bash
ssh -i /tmp/shamramain_user.pem azureuser@20.241.4.71 'bash -s' << 'SCRIPT'
queries=(
  "query 1 here"
  "query 2 here"
)

for q in "${queries[@]}"; do
  count=$(curl -sk -u elastic:1oYT-qgBqrF3e+ZQ5OhP \
    "https://shamraindex:9200/arabic_research/_search" \
    -H "Content-Type: application/json" \
    -d "{\"size\":0,\"query\":{\"multi_match\":{\"query\":\"$q\",\"fields\":[\"arabic_full_title\",\"content\"],\"operator\":\"and\"}}}" \
    2>/dev/null | python3 -c "import sys,json;d=json.load(sys.stdin);print(d['hits']['total']['value'])" 2>/dev/null)
  count_or=$(curl -sk -u elastic:1oYT-qgBqrF3e+ZQ5OhP \
    "https://shamraindex:9200/arabic_research/_search" \
    -H "Content-Type: application/json" \
    -d "{\"size\":0,\"query\":{\"multi_match\":{\"query\":\"$q\",\"fields\":[\"arabic_full_title\",\"content\"]}}}" \
    2>/dev/null | python3 -c "import sys,json;d=json.load(sys.stdin);print(d['hits']['total']['value'])" 2>/dev/null)
  echo "AND=$count | OR=$count_or | $q"
done
SCRIPT
```

- **AND=0**: Zero exact coverage — strong signal for a gap
- **AND < 5**: Very weak coverage
- **High OR, low AND**: Content exists but doesn't match the specific topic well

### 4. Check Current ES Field Distribution

```bash
ssh -i /tmp/shamramain_user.pem azureuser@20.241.4.71 \
  'curl -sk -u elastic:1oYT-qgBqrF3e+ZQ5OhP \
  "https://shamraindex:9200/arabic_research/_search" \
  -H "Content-Type: application/json" \
  -d "{\"size\":0,\"aggs\":{\"fields_en\":{\"terms\":{\"field\":\"english_fields\",\"size\":30}},\"fields_ar\":{\"terms\":{\"field\":\"arabic_fields\",\"size\":30}},\"publishers_ar\":{\"terms\":{\"field\":\"arabic_publisher_name\",\"size\":30}}}}"' \
  | python3 -m json.tool
```

Total doc count:

```bash
ssh -i /tmp/shamramain_user.pem azureuser@20.241.4.71 \
  'curl -sk -u elastic:1oYT-qgBqrF3e+ZQ5OhP \
  "https://shamraindex:9200/arabic_research/_count"'
```

> **Note on field types**: `arabic_fields`, `english_fields`, `arabic_publisher_name`, `english_publisher_name` are `keyword` type (not text). Do NOT append `.keyword` in aggregations.

### 5. Cross-Reference: Gaps Analysis

Build a table comparing:

| Field | ES Docs | User Demand (searches) | Coverage Level |
|-------|---------|------------------------|----------------|
| ...   | ...     | ...                    | Strong/Weak/None |

Fields with **low doc count + high search demand** = priority targets.

### 6. Find OJS Journals to Fill Gaps

**Where to search:**

- **Iraqi university OJS portals** (most reliable, Arabic-first):
  - `https://*.uobaghdad.edu.iq/index.php/*` — University of Baghdad (largest collection)
  - `https://journals.uokufa.edu.iq/index.php/*` — University of Kufa
  - `https://*.mosuljournals.com/` — University of Mosul
- **Kuwait University**: `https://journals.ku.edu.kw/*/index.php/*`
- **DOAJ search**: `https://doaj.org/search/journals?source={"query":{"query_string":{"query":"arabic {field}"}}}`
- **Google Scholar**: `"مجلة" "ojs" site:*.edu.iq OR site:*.edu.kw`

**Verification checklist for each candidate:**

1. **Is it OJS?** — Look for "Platform and Workflow by OJS/PKP" in footer
2. **Is it reachable?** — `curl -sk -o /dev/null -w "%{http_code}" {URL}`
3. **Is it open access?** — Check for CC license badge
4. **Is it Arabic?** — Check article titles/abstracts language
5. **How many articles?** — Browse issue archive, estimate total
6. **Does it have OAI-PMH?** — Test `{base_url}/oai?verb=Identify`

### 7. Verify Reachability (Batch)

```bash
urls=(
  "https://journals.ku.edu.kw/jol/index.php/jol"
  "https://jols.uobaghdad.edu.iq/index.php/jols"
  # ... add more
)

for url in "${urls[@]}"; do
  code=$(curl -sk -o /dev/null -w "%{http_code}" --connect-timeout 10 "$url")
  echo "$code | $url"
done
```

- `200` = good
- `403` = blocked/suspended
- `000` = connection refused / DNS failure

---

## ES Field Mapping Reference

| Field Name | Type | Use in Aggregation |
|------------|------|--------------------|
| `arabic_fields` | keyword | `{"terms":{"field":"arabic_fields","size":30}}` |
| `english_fields` | keyword | `{"terms":{"field":"english_fields","size":30}}` |
| `arabic_publisher_name` | keyword | `{"terms":{"field":"arabic_publisher_name","size":30}}` |
| `english_publisher_name` | keyword | `{"terms":{"field":"english_publisher_name","size":30}}` |
| `arabic_full_title` | text (.keyword) | Use text for search, .keyword for aggs |
| `tag` | text (.keyword) | Use .keyword for tag aggregation |

---

## Academic Field IDs

| ID | Arabic | English |
|----|--------|---------|
| 101 | علوم الحاسوب | Computer Science |
| 102 | الرياضيات | Mathematics |
| 103 | الفيزياء | Physics |
| 104 | الكيمياء | Chemistry |
| 105 | الأحياء | Biology |
| 106 | الطب | Medicine |
| 107 | الذكاء الاصطناعي | Artificial Intelligence |
| 108 | الهندسة | Engineering |
| 109 | الأدب | Literature |
| 110 | القانون | Law |
| 111 | الاقتصاد | Economics |
| 112 | التربية | Education |
| 113 | الزراعة | Agriculture |
| 114 | الصيدلة | Pharmacy |

---

## Previous Runs

### April 5, 2026

**Index size**: 31,327 docs

**Top gaps identified** (low docs + high search demand):

| Field | ES Docs | Gap Level |
|-------|---------|-----------|
| Law | 275 | Very Weak |
| Business Mgmt | 135 | Very Weak |
| Agriculture | 498 | Weak |
| Chemistry | 376 | Weak |
| Literature | 781 | Weak |
| Nursing | 0 (under Medicine) | None |

**7 journals recommended** (all verified OJS, reachable, open access):

| # | Journal | URL | Field |
|---|---------|-----|-------|
| 1 | مجلة الحقوق — جامعة الكويت | `https://journals.ku.edu.kw/jol/index.php/jol` | Law (110) |
| 2 | مجلة العلوم القانونية — جامعة بغداد | `https://jols.uobaghdad.edu.iq/index.php/jols` | Law (110) |
| 3 | مجلة العلوم الزراعية العراقية — جامعة بغداد | `https://jcoagri.uobaghdad.edu.iq/index.php/intro` | Agriculture (113) |
| 4 | مجلة كلية الآداب — جامعة بغداد | `https://jcoart.uobaghdad.edu.iq/index.php/jcoart` | Literature (109) |
| 5 | مجلة التمريض العراقية | `https://injns.uobaghdad.edu.iq/index.php/INJNS` | Medicine (106) |
| 6 | مجلة العلوم الاقتصادية والإدارية — جامعة بغداد | `https://jeasiq.uobaghdad.edu.iq/index.php/JEASIQ` | Economics (111) |
| 7 | المجلة العراقية للهندسة الكيميائية وهندسة النفط | `https://ijcpe.uobaghdad.edu.iq/index.php/ijcpe` | Chemistry (104) |

**3 journals rejected** (unreachable):

| URL | Issue |
|-----|-------|
| `https://journals.uokufa.edu.iq/index.php/kjas` | Account suspended (403) |
| `https://aja.journals.ekb.eg/` | Connection refused — not OJS (EKB platform) |
| `https://journals.ju.edu.jo/index.php/JJC` | Connection refused — retry later |

---

## Seed YAML Template

When the OJS crawler is built, use this template for new seeds:

```yaml
seed:
  name: "اسم المجلة بالعربية"
  name_en: "English Journal Name"
  base_url: "https://..."
  language: "ar"
  platform: "ojs"
  default_field_id: 110
  default_category: "article"
  seed_tags: []
  publisher_name_en: "University Name"
  publisher_name_ar: "اسم الجامعة"
  crawl:
    start_from: "2020-01-01"
```
