# Prod Stability Follow-up: Homepage tab latency + indexing cron guard

Date: 2026-03-08

## What happened

- Production became intermittently non-responsive.
- Apache hit `MaxRequestWorkers` saturation and queued requests.
- A root cron-launched indexing process (`academiaelastic/main.py --indexArabicResearch`) consumed high CPU at peak time.
- This amplified request backlog and caused external timeouts.

## Immediate mitigations applied

- Restarted Apache to clear worker backlog.
- Stopped active indexing process for now.
- Temporarily disabled the two heavy root cron jobs by commenting them with marker `TEMP_DISABLED_20260308`:
  - `30 */8 * * * ... --indexArabicResearch`
  - `30 */12 * * * ... --indexArabicResearchv1`
- Confirmed no active connections to old ES host (`20.30.123.130:9200`).
- Confirmed active ES traffic points to new host (`20.106.250.185:9200`).

## Why this matters

- Running heavy reindex jobs during active traffic windows can starve Apache workers.
- Current `mpm_prefork` cap (`MaxRequestWorkers 75`) is vulnerable to spikes when expensive requests overlap with CPU-heavy background jobs.

## Follow-up tasks

1. Homepage tabs performance investigation (`/recent`, `/interests`, etc.)
   - Map exact request path (frontend tab click → AJAX endpoint → controller → DB/ES query).
   - Add timing instrumentation around tab handlers and query calls.
   - Apply targeted query/index/cache optimizations.

## Homepage tabs investigation (2026-03-08)

- Endpoint path confirmed: `/en/homepage/tab/{type}` via route `academia_homepage_tab`.
- Controller path: `HomepageController::homepageTabAction` with handlers:
   - `getInterestsResults()`
   - `getTrendingResults()`
   - `getRecentResults()`

### Observed production timings (local-to-server curl)

- First cold run (same minute):
   - interests: ~10.8s
   - trending: ~20.8s
   - recent: ~28.7s
- Warm run right after:
   - interests: ~2.5s
   - trending: ~1.6s
   - recent: ~0.5s

### Root cause found in code

- `getInterestsResults()` skipped cache whenever `recently_visited_research` session list was non-empty.
- This forced repeated Elasticsearch work for users with browsing history, creating intermittent long waits.

### Fix applied

- Updated `HomepageController` to:
   - Always reuse cached interests results when available.
   - Filter recently visited slugs in memory after cache retrieval.
   - Keep caching of interests results active regardless of visited list state.

### Notes

- ES host connectivity to new cluster is healthy from prod (`20.106.250.185:9200`).
- Remaining cold-start latency is still visible on some tab combinations and should be addressed next by adding per-tab execution timing logs and optional cache warm strategy.

## Browser waterfall findings (user capture)

- Slow requests were not limited to tab endpoints.
- Same timeline showed long waits on background poll APIs:
   - `/api/chat/unread-count`
   - `/api/notifications/count`
   - `/api/gamification/new-badges`
- Some background calls stretched to ~20s–55s, which can overlap with tab requests and consume worker slots.

## Client-side stabilization applied (2026-03-08)

- Added in-flight guards to prevent overlapping polling calls in:
   - `templates/header.html.twig` (`unread-count`, `notifications/count`, `notifications/recent`)
   - `templates/base.html.twig` (`new-badges` checker)
- Added client timeouts/abort behavior:
   - Fetch polling: 8–10s abort timeout
   - XHR badge polling: 10s timeout
- Added tab-request cancellation on fast tab switching in:
   - `src/syndex/AcademicBundle/Resources/views/Default/homepage_new.html.twig`

These changes reduce request pile-up during transient backend slowness and lower risk of worker saturation from overlapping background polling.

2. Cron hardening
   - Re-enable indexing jobs only after:
     - Off-peak schedule,
     - `nice/ionice` throttling,
     - overlap guard (lockfile/flock),
     - execution timeout.

3. Apache capacity tuning
   - Reassess `MaxRequestWorkers`/`ServerLimit` with memory-based sizing.
   - Keep `KeepAliveTimeout` low (already tuned to 2s).

4. Guardrails
   - Add a lightweight deploy/ops check to detect and alert when:
     - web queue/backlog grows,
     - worker saturation appears in logs,
     - indexing jobs run during blocked windows.

## Success criteria

- Homepage tab endpoints consistently respond in <2s at normal load.
- No `AH00161 MaxRequestWorkers` events during peak hours.
- Indexing jobs complete off-peak without impacting user traffic.
