# Skill: Topic Enrichment Pipeline

> **Purpose**: Generate LLM-powered Arabic topic packs from search data to improve ES retrieval quality.  
> **When to use**: When user wants to improve search relevance, add topic coverage, or process GSC/search log data.

---

## Session Startup Checklist

1. **Ask user to upload GSC export** (Excel/CSV from Google Search Console → Performance → Export)
2. **Check current enrichment stats** on production
3. **Review CTR data** to identify high-impression, low-CTR queries
4. **Run enrichment** for prioritized queries
5. **Review drafts** and bulk approve quality ≥ 0.5

---

## Step 1: Request GSC Data

Ask the user:

> "Please upload your Google Search Console export (Excel or CSV). I need the Performance report with queries, impressions, clicks, CTR, and position columns. This helps prioritize which topics to enrich first."

Expected columns:
- `Query` or `Top queries`
- `Impressions`
- `Clicks`
- `CTR`
- `Position`

---

## Step 2: Check Current Stats

Run on production:

```bash
ssh -i C:\Users\shadisaleh\Documents\linux\shamramain_user.pem azureuser@20.241.4.71 \
  "sudo -u www-data php /var/www/html/academia_v2/bin/console dbal:run-sql --env=prod \
   'SELECT status, COUNT(id) as cnt FROM topic_enrichment GROUP BY status'"
```

Expected output: count of `draft`, `approved`, `rejected` enrichments.

---

## Step 3: Analyze GSC Data for Priority Queries

From uploaded GSC data, identify:

### High Priority (Zero-Click Recovery)
- **High impressions + low CTR** (impressions > 100, CTR < 2%)
- These are queries where we're ranking but users aren't clicking

### Medium Priority (Coverage Gaps)  
- **High impressions, position 4-10** (we're close to page 1)
- Enrichment can improve relevance and rankings

### Low Priority (Long Tail)
- **Low impressions but relevant topics**
- Worth enriching for completeness

Extract topic names by stripping common patterns:
- Remove: `رسائل ماجستير عن`, `رسائل دكتوراه عن`, `pdf`, `doc`
- Keep: the core topic (e.g., `الطاقة الشمسية`, `سلوك المستهلك`)

---

## Step 4: Run Enrichment Command

### Dry run first (see what would be processed):

```bash
ssh -i C:\Users\shadisaleh\Documents\linux\shamramain_user.pem azureuser@20.241.4.71 \
  "sudo -u www-data php /var/www/html/academia_v2/bin/console app:enrich-topics \
   --limit=20 --dry-run --env=prod"
```

### Process topics:

```bash
ssh -i C:\Users\shadisaleh\Documents\linux\shamramain_user.pem azureuser@20.241.4.71 \
  "sudo -u www-data php /var/www/html/academia_v2/bin/console app:enrich-topics \
   --limit=50 --env=prod"
```

Command options:
- `--limit=N` — max queries to process
- `--days=30` — lookback window for search logs
- `--min-search-count=2` — minimum query frequency
- `--skip-recent-hours=24` — skip recently enriched
- `--auto-approve` — auto-approve instead of draft
- `--dry-run` — preview without writing

---

## Step 5: Review Draft Quality

### View sample drafts:

```bash
ssh -i C:\Users\shadisaleh\Documents\linux\shamramain_user.pem azureuser@20.241.4.71 \
  "sudo -u www-data php /var/www/html/academia_v2/bin/console dbal:run-sql --env=prod \
   'SELECT id, source_query, canonical_topic_ar, synonyms_ar, quality_score 
    FROM topic_enrichment WHERE status = \"draft\" LIMIT 10'"
```

### Quality criteria:
- `quality_score >= 0.5` — good enough for auto-approve
- `canonical_topic_ar` — should be clean Arabic topic name
- `synonyms_ar` — should have 3-8 relevant Arabic variants
- `query_intents_ar` — should include رسائل ماجستير, مراجعة أدبيات, etc.

### Bulk approve high-quality drafts:

```bash
ssh -i C:\Users\shadisaleh\Documents\linux\shamramain_user.pem azureuser@20.241.4.71 \
  "sudo -u www-data php /var/www/html/academia_v2/bin/console dbal:run-sql --env=prod \
   'UPDATE topic_enrichment SET status = \"approved\" WHERE status = \"draft\" AND quality_score >= 0.5'"
```

---

## Step 6: Verify Feature Flag

Ensure search expansion is enabled:

```bash
grep "topic_enrichment_search_enabled" config/services.yaml
```

Should show: `topic_enrichment_search_enabled: true`

If not enabled, update and deploy:

```yaml
# config/services.yaml
parameters:
    topic_enrichment_search_enabled: true
```

---

## Key Files

| File | Purpose |
|------|---------|
| `src/Entity/Topic.php` | Topic entity (SEO identity) |
| `src/Entity/TopicEnrichment.php` | LLM enrichment data |
| `src/Command/EnrichTopicsCommand.php` | CLI command |
| `src/Service/TopicEnrichmentService.php` | LLM call + persistence |
| `src/Repository/TopicEnrichmentRepository.php` | Query methods |
| `config/services.yaml` | Feature flag |

---

## Integration Points

### ES Search Expansion
- `HomepageController::getApprovedTopicExpansionTerms()` looks up approved enrichment
- Adds `synonyms_ar` + `related_concepts_ar` to ES query with 0.3 boost
- Only triggers for approved enrichments

### Future: Topic Pages
- `TopicController` (not yet built) will use enrichment for `/topics/{slug}` pages
- See `futures/04-topic-landing-pages.md` for full plan

---

## Troubleshooting

### "No candidate queries found"
- Search logs may be empty or too recent
- Try `--days=60` for longer lookback
- Check `search_query_log` table has data

### LLM failures
- Check `var/log/prod.log` for errors
- Verify Azure OpenAI credentials in `.env.local`
- Service: `TopicEnrichmentService::generateEnrichmentData()`

### Enrichments not appearing in search
- Verify `status = 'approved'` (not draft)
- Check feature flag is `true`
- Clear cache after flag change

---

## Session Workflow Summary

```
1. User uploads GSC export
2. Analyze: find high-impression, low-CTR queries  
3. Check current stats: how many approved/draft/rejected?
4. Run: app:enrich-topics --limit=50
5. Review: sample 5-10 drafts for quality
6. Approve: bulk approve quality >= 0.5
7. Verify: feature flag enabled, test search
```
