# 17. MLT Improvements & Related Searches Feature

> Improve the "More Like This" (MLT) related research feature and add "People Also Searched For" suggestions.

---

## Overview

Two interconnected improvements to search and discovery:

1. **MLT Improvements** - Fix intermittent ES connection issues, add caching, improve resilience
2. **Related Searches** - Show "people also searched for" suggestions based on actual user search data

---

## Part 1: MLT (More Like This) Improvements

### Current State

| Component | Location | Purpose |
|-----------|----------|---------|
| `RelatedResearchService` | `src/syndex/AcademicBundle/Service/RelatedResearchService.php` | Finds related papers via ES MLT query |
| `RecommendationService` | `src/Service/RecommendationService.php` | Personalized recommendations via MLT |
| ES Connection | `src/syndex/AcademicBundle/Service/Elasticsearch/Elasticsearch.php` | Shared ES client |

### Known Issues (March 2026)

1. **"No alive nodes" errors** - 4,000+ errors/day
   - ES server is reachable but intermittent connection failures
   - Likely: connection pool exhaustion or stale connections
   - Impact: "Related Research" section fails silently
   - **STATUS: ✅ MITIGATED** - Circuit breaker + retry + caching deployed March 10

2. **"Syntax error" / "Malformed UTF-8"** - ~100/day
   - Bad encoding in some research titles/abstracts
   - MLT query construction fails
   - **STATUS: ✅ IMPROVED** - Better UTF-8 sanitization deployed March 10

3. **No caching** - Every research page view triggers fresh MLT query
   - Same related papers re-computed repeatedly
   - Adds latency + ES load
   - **STATUS: ✅ FIXED** - 6-hour Redis caching deployed March 10

4. **Silent failures** - Users see empty "Related Research" section
   - No indication that feature is temporarily unavailable
   - **STATUS: Unchanged** - Consider adding "temporarily unavailable" message

### Proposed Improvements

#### Phase 1: Connection Resilience (Priority: HIGH)

**1.1 Add connection retry with exponential backoff**

```php
// In RelatedResearchService::runSearch()
private function runSearch(array $body, string $index, int $limit): array
{
    $maxRetries = 2;
    $lastException = null;
    
    for ($attempt = 0; $attempt <= $maxRetries; $attempt++) {
        try {
            if ($attempt > 0) {
                usleep(100000 * pow(2, $attempt)); // 200ms, 400ms backoff
            }
            return $this->doSearch($body, $index, $limit);
        } catch (NoNodesAvailableException $e) {
            $lastException = $e;
            $this->logger->warning('ES retry {attempt}/{max}', [
                'attempt' => $attempt + 1,
                'max' => $maxRetries + 1,
            ]);
        }
    }
    
    throw $lastException;
}
```

**1.2 Circuit breaker pattern**

If MLT fails 5+ times in 60 seconds, skip MLT queries for 30 seconds.

```php
// Redis-backed circuit breaker
private function isCircuitOpen(): bool
{
    $failures = $this->redis->get('mlt:failures:count') ?? 0;
    return $failures > 5;
}

private function recordFailure(): void
{
    $this->redis->incr('mlt:failures:count');
    $this->redis->expire('mlt:failures:count', 60);
}

private function recordSuccess(): void
{
    $this->redis->del('mlt:failures:count');
}
```

#### Phase 2: Caching (Priority: MEDIUM)

**2.1 Cache related research results**

- **Cache key**: `related_research:{researchId}:{locale}`
- **TTL**: 6 hours (research content rarely changes)
- **Storage**: Redis (already configured)
- **Invalidation**: On research content update (optional)

```php
public function findRelated(...): array
{
    $cacheKey = sprintf('related_research:%s:%s', $researchId, $isEnglish ? 'en' : 'ar');
    
    $cached = $this->cache->get($cacheKey);
    if ($cached !== null) {
        return $cached;
    }
    
    $results = $this->doFindRelated(...);
    
    if (!empty($results)) {
        $this->cache->set($cacheKey, $results, 21600); // 6 hours
    }
    
    return $results;
}
```

**2.2 Warm cache for popular research**

Cron job to pre-compute related research for top 1000 most-viewed papers.

```bash
# Add to cron
0 4 * * * php bin/console app:warm-related-cache --limit=1000
```

#### Phase 3: Query Robustness (Priority: MEDIUM)

**3.1 Better UTF-8 sanitization**

Current `sanitizeUtf8()` is basic. Improve:

```php
private function sanitizeUtf8(string $text): string
{
    // Remove null bytes
    $text = str_replace("\0", '', $text);
    
    // Force UTF-8 encoding
    if (!mb_check_encoding($text, 'UTF-8')) {
        $text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
    }
    
    // Remove control characters except newlines
    $text = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $text);
    
    // Normalize whitespace
    $text = preg_replace('/\s+/u', ' ', $text);
    
    return trim($text);
}
```

**3.2 Validate query before sending**

```php
private function validateMltQuery(string $likeText): bool
{
    // Min 3 characters of actual content
    $cleaned = preg_replace('/\s+/u', '', $likeText);
    if (mb_strlen($cleaned) < 3) {
        return false;
    }
    
    // Must be valid UTF-8
    if (!mb_check_encoding($likeText, 'UTF-8')) {
        return false;
    }
    
    return true;
}
```

### Metrics to Track

| Metric | Target | Before (est.) | After Deploy |
|--------|--------|---------------|---------------|
| MLT query success rate | >99% | ~95% | Monitoring... |
| MLT cache hit rate | >70% | 0% | 980 entries after warmup (50 test + organic) |
| Avg MLT query time | <100ms | ~200ms | <1ms (cache hits) |
| Related Research shown | >95% of pages | ~80% | Improving with cache |
| Circuit breaker trips | Rare | - | Active during ES outages |

---

## Part 2: Related Searches Feature

### Concept

Show "People also searched for" / "عمليات البحث ذات الصلة" suggestions below search results.

**Example:**
```
User searches: "machine learning"

People also searched for:
• deep learning
• artificial intelligence  
• neural networks
• natural language processing
```

### Data Sources

We already have excellent data:

| Table | Records | What it contains |
|-------|---------|------------------|
| `search_query_log` | 100K+ | All search queries with results count |
| `search_click_log` | 50K+ | Which results users clicked |

### Strategy: Query Co-occurrence

Two queries are "related" if:
1. **Same session** - User searched A then B in same session
2. **Same result clicks** - Users who searched A and B clicked same papers
3. **Semantic similarity** - Queries share significant terms

#### Strategy A: Session-Based Co-occurrence (Recommended)

Users who search "X" often also search "Y" in the same session.

```sql
-- Find queries that co-occur with a given query in same session
SELECT 
    sq2.query_text,
    COUNT(*) as co_occurrence_count
FROM search_query_log sq1
JOIN search_query_log sq2 
    ON sq1.session_id = sq2.session_id 
    AND sq1.id != sq2.id
    AND sq2.results_count > 0  -- Only suggest queries that have results
WHERE sq1.query_text = :currentQuery
    AND sq1.created_at >= DATE_SUB(NOW(), INTERVAL 90 DAY)
GROUP BY sq2.query_text
HAVING co_occurrence_count >= 3  -- Minimum threshold
ORDER BY co_occurrence_count DESC
LIMIT 5;
```

#### Strategy B: Click-Based Similarity

Users who searched different queries but clicked the same paper.

```sql
-- Queries that led to clicks on same research as current query
SELECT 
    sq2.query_text,
    COUNT(DISTINCT sc2.target_id) as shared_clicks
FROM search_query_log sq1
JOIN search_click_log sc1 ON sc1.search_query_id = sq1.id
JOIN search_click_log sc2 ON sc2.target_id = sc1.target_id
JOIN search_query_log sq2 ON sq2.id = sc2.search_query_id
    AND sq2.query_text != sq1.query_text
WHERE sq1.query_text = :currentQuery
    AND sq1.created_at >= DATE_SUB(NOW(), INTERVAL 90 DAY)
GROUP BY sq2.query_text
ORDER BY shared_clicks DESC
LIMIT 5;
```

### Implementation Plan

#### Phase 1: Data Collection (Status: ✅ DONE)

Search queries already logged in `search_query_log` with session_id.

#### Phase 2: Build Related Queries Table

**2.1 Create materialized/summary table**

```sql
CREATE TABLE related_search_queries (
    id INT AUTO_INCREMENT PRIMARY KEY,
    query_text VARCHAR(500) NOT NULL,
    related_query VARCHAR(500) NOT NULL,
    co_occurrence_count INT DEFAULT 1,
    confidence_score DECIMAL(5,4) DEFAULT 0,
    search_type ENUM('arabic', 'english') NOT NULL,
    last_computed DATETIME NOT NULL,
    
    INDEX idx_query_text (query_text(191)),
    INDEX idx_search_type (search_type),
    UNIQUE KEY unique_pair (query_text(191), related_query(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
```

**2.2 Command to compute related queries**

```php
// src/Command/ComputeRelatedSearchesCommand.php
class ComputeRelatedSearchesCommand extends Command
{
    protected static $defaultName = 'app:compute-related-searches';
    
    protected function execute(InputInterface $input, OutputInterface $output): int
    {
        // 1. Get top 1000 queries by volume
        $topQueries = $this->queryLogRepo->getTopQueries(90, 1000);
        
        foreach ($topQueries as $query) {
            // 2. Find co-occurring queries (session-based)
            $related = $this->findRelatedByCo occurrence($query['query_text']);
            
            // 3. Upsert into related_search_queries table
            $this->saveRelatedQueries($query['query_text'], $related);
        }
        
        // 4. Cleanup old entries
        $this->pruneOldEntries(90); // Keep 90 days
        
        return Command::SUCCESS;
    }
}
```

**2.3 Schedule computation**

```bash
# Cron: Run nightly to refresh related queries
0 3 * * * php bin/console app:compute-related-searches --env=prod
```

#### Phase 3: API & UI

**3.1 Service class**

```php
// src/Service/RelatedSearchService.php
class RelatedSearchService
{
    public function getRelatedSearches(string $queryText, string $searchType = 'arabic', int $limit = 5): array
    {
        // First try: exact match from precomputed table
        $related = $this->repository->findBy([
            'queryText' => $queryText,
            'searchType' => $searchType,
        ], ['coOccurrenceCount' => 'DESC'], $limit);
        
        if (count($related) >= 3) {
            return array_map(fn($r) => $r->getRelatedQuery(), $related);
        }
        
        // Fallback: fuzzy match on similar queries
        return $this->findFuzzyRelated($queryText, $searchType, $limit);
    }
    
    private function findFuzzyRelated(string $query, string $type, int $limit): array
    {
        // Use LIKE or FULLTEXT search to find similar query roots
        // e.g., "machine learning algorithms" → related for "machine learning"
    }
}
```

**3.2 Controller endpoint**

```php
// Add to HomepageController or new SearchController
#[Route('/api/related-searches', name: 'api_related_searches', methods: ['GET'])]
public function getRelatedSearches(Request $request): JsonResponse
{
    $query = $request->query->get('q', '');
    $type = $request->query->get('type', 'arabic');
    
    if (mb_strlen($query) < 2) {
        return $this->json([]);
    }
    
    $related = $this->relatedSearchService->getRelatedSearches($query, $type, 5);
    
    return $this->json([
        'query' => $query,
        'related_searches' => $related,
    ]);
}
```

**3.3 UI Component**

```twig
{# templates/components/related_searches.html.twig #}
{% if related_searches is not empty %}
<div class="related-searches">
    <h4>{{ 'search.related_searches'|trans }}</h4>
    <ul class="related-searches__list">
        {% for search in related_searches %}
        <li>
            <a href="{{ path('shamra_academia_filter', {title: search, type: search_type}) }}">
                {{ search }}
            </a>
        </li>
        {% endfor %}
    </ul>
</div>
{% endif %}
```

**3.4 Translations**

```yaml
# translations/messages.ar.yml
search:
    related_searches: "عمليات البحث ذات الصلة"
    people_also_searched: "يبحث الآخرون أيضًا عن"

# translations/messages.en.yml
search:
    related_searches: "Related searches"
    people_also_searched: "People also searched for"
```

### Display Logic

1. **When to show**: Only on search results pages with actual query
2. **Minimum data**: Only show if ≥3 related queries found
3. **Placement**: Below search results, above pagination
4. **Cache**: Cache suggestions for 1 hour per query

### Privacy Considerations

- No user identification in related searches
- Aggregated data only (≥3 occurrences to show)
- No logging of who sees what suggestions

---

## Database Migrations

### Migration 1: Related Search Queries Table

```php
// migrations/Version2026031XXXXXX.php
public function up(Schema $schema): void
{
    $this->addSql('
        CREATE TABLE related_search_queries (
            id INT AUTO_INCREMENT PRIMARY KEY,
            query_text VARCHAR(500) NOT NULL,
            related_query VARCHAR(500) NOT NULL,
            co_occurrence_count INT DEFAULT 1,
            shared_click_count INT DEFAULT 0,
            confidence_score DECIMAL(5,4) DEFAULT 0,
            search_type VARCHAR(10) NOT NULL DEFAULT "arabic",
            last_computed DATETIME NOT NULL,
            INDEX idx_query_text (query_text(191)),
            INDEX idx_search_type (search_type),
            INDEX idx_confidence (confidence_score),
            UNIQUE KEY unique_pair (query_text(191), related_query(191))
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
    ');
}
```

---

## Implementation Phases

### Phase 1: MLT Resilience (Week 1) ✅ IMPLEMENTED (March 10, 2026)
- [x] Add retry logic with backoff to `RelatedResearchService`
- [x] Implement circuit breaker (Redis-backed)
- [x] Improve UTF-8 sanitization
- [x] Add Redis caching (6hr TTL) for MLT results

#### Deployment Verification (March 10, 2026 12:20 UTC)

**Features Confirmed Working:**

| Feature | Status | Evidence |
|---------|--------|----------|
| Circuit breaker | ✅ Active | `mlt_circuit_failures` key in Redis; "circuit breaker opened" logs |
| Redis caching | ✅ Working | 82 cached entries (`related_research:*` keys) in first 30 min |
| Retry logic | ✅ Deployed | Code active, triggers on transient ES errors |
| UTF-8 sanitization | ✅ Improved | Arabic encoding detection added |

**Redis Cache Keys:**
```
6Yb9ujB2pq:related_research:1809.10089:en:6
6Yb9ujB2pq:related_research:1803.08603:en:6
6Yb9ujB2pq:mlt_circuit_failures (TTL: 60s)
```

**Circuit Breaker Logs:**
```
[2026-03-10T12:20:40] RelatedResearch circuit breaker opened {"failures":7,"threshold":5}
```

#### How to Verify

```bash
# Check cache entries
ssh ... "redis-cli keys '*related_research*' | wc -l"

# Check circuit breaker state
ssh ... "redis-cli ttl '6Yb9ujB2pq:mlt_circuit_failures'"

# Monitor circuit breaker activity
ssh ... "grep 'circuit breaker' /var/www/html/academia_v2/var/log/prod.log | tail -5"

# Compare error rates (next day)
ssh ... "grep '2026-03-11' .../prod.log | grep -c 'RelatedResearch MLT failed'"
```

### Phase 2: MLT Cache Warmup (Week 2) ✅ IMPLEMENTED (March 10, 2026)
- [x] Create warmup command for popular research
- [x] Add ES-only document support
- [x] Set up daily cron job
- [ ] Add cache hit/miss metrics logging (optional - deferred)

#### Deployment Verification (March 10, 2026 12:35 UTC)

**Command:** `app:warm-related-cache`

```bash
# Dry run (preview)
sudo -u www-data php bin/console app:warm-related-cache --dry-run --limit=10 --env=prod

# Full warmup (500 most clicked research)
sudo -u www-data php bin/console app:warm-related-cache --limit=500 --env=prod
```

**Test Results:**
| Run | Limit | DB Warmed | ES-Only Warmed | Skipped | Not Found |
|-----|-------|-----------|----------------|---------|----------|
| Initial (v1) | 50 | 37 | 0 | 0 | 13 |
| With ES support (v2) | 100 | 61 | 20 | 0 | 0 |

**v2 Update (March 10, 2026 12:50 UTC):** Added ES-only document support. The command now:
1. First checks MySQL DB (Arabic, then English tables)
2. If not found, queries Elasticsearch indices directly
3. This fixes gap where popular ES-only documents weren't being cached

**Options:**
- `--limit=N` — Number of popular pages to warm (default: 500)
- `--days=N` — Lookback period for click popularity (default: 30)
- `--dry-run` — Preview without actual cache warming

**Cron Job:** ✅ ACTIVE (added March 10, 2026)
```bash
# Daily at 3:15 AM UTC (www-data crontab)
15 3 * * * cd /var/www/html/academia_v2 && php bin/console app:warm-related-cache --env=prod >> /var/log/shamra/warm-related-cache.log 2>&1
```

### Phase 3: Related Searches Data (Week 3)
- [ ] Create `related_search_queries` table
- [ ] Build `ComputeRelatedSearchesCommand`
- [ ] Schedule nightly computation
- [ ] Verify data quality with top queries

### Phase 4: Related Searches UI (Week 4)
- [ ] Create `RelatedSearchService`
- [ ] Add API endpoint
- [ ] Build Twig component
- [ ] Add translations (ar/en)
- [ ] A/B test placement

---

## Success Metrics

### MLT Improvements
| Metric | Before | Target |
|--------|--------|--------|
| MLT error rate | ~5% | <1% |
| Related Research shown | ~80% | >95% |
| Avg MLT latency | 200ms | <50ms (cached) |

### Related Searches
| Metric | Target |
|--------|--------|
| Coverage | >60% of queries show suggestions |
| Suggestions per query | 3-5 |
| Click-through rate | >5% |
| User engagement lift | +10% pages/session |

---

## Open Questions

1. **Minimum threshold**: How many co-occurrences before showing? (Proposed: 3)
2. **Real-time vs batch**: Start with nightly batch, add real-time later?
3. **Cross-language**: Show Arabic suggestions for English queries?
4. **Trending queries**: Also show "trending searches" section?

---

## Files to Create/Modify

| File | Action |
|------|--------|
| `src/Entity/RelatedSearchQuery.php` | CREATE |
| `src/Repository/RelatedSearchQueryRepository.php` | CREATE |
| `src/Service/RelatedSearchService.php` | CREATE |
| `src/Command/ComputeRelatedSearchesCommand.php` | CREATE |
| `src/syndex/AcademicBundle/Service/RelatedResearchService.php` | MODIFY - add caching + retry |
| `src/Controller/HomepageController.php` | MODIFY - add API endpoint |
| `templates/components/related_searches.html.twig` | CREATE |
| `templates/filter/list.html.twig` | MODIFY - include related searches |
| `translations/messages.ar.yml` | MODIFY |
| `translations/messages.en.yml` | MODIFY |

---

## References

- [Elasticsearch More Like This](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html)
- [Circuit Breaker Pattern](https://martinfowler.com/bliki/CircuitBreaker.html)
- [Google Related Searches](https://support.google.com/websearch/answer/7220196)
