# Plan: Reference Library (`/myreferences`) — Current State & Future Improvements

## TL;DR
The Reference Library is one of Shamra Academia's highest-value features: users upload PDFs, get automatic OCR + ES indexing + AI summarization + translation, and can ground their AI-assisted writing in their own papers. The pipeline works end-to-end but has major opportunities for improvement in **auto-metadata extraction** (eliminate manual entry), **DOI/URL import** (skip the upload entirely), **collaborative libraries**, **annotation/highlights**, **citation generation in multiple formats**, and **semantic search/Q&A over the user's own corpus**. These improvements would transform the library from a storage + summary tool into a full **personal research knowledge base**.

---

## Current State (March 2026)

### Access & Eligibility
- **URL**: `/myreferences` (Vue.js 3 SPA)
- **Who**: Subscribed users only (PlaygroundSubscription or legacy AcademicSubscription; admins bypass)
- **Storage**: 500 MB per user, 50 MB per file, PDF/DOCX only
- **Header nav**: "My References" / "مراجعي"

### Current Pipeline

```
Upload PDF/DOCX ──→ Save file + metadata ──→ [async] Mistral OCR ──→ [async] ES Chunking ──→ [async] AI Summary
       │                                          │                        │                        │
  BibTeX import                              extractedText            user_reference_chunks     Map-Reduce GPT
  (client-side)                              stored on entity          (ES index)              → HTML summary
                                                                           │
                                                              Used as grounding in
                                                              "Write with AI" feature
```

**Async chain**: `ProcessReferenceOcr` → `IndexReferenceChunks` → `SummarizeReference`  
**On-demand**: `TranslateReferenceSummary` (user-triggered)

### Step-by-step breakdown

| Step | What happens | Service / Handler | Cost |
|------|-------------|-------------------|------|
| 1. **Upload** | User uploads PDF + fills metadata (title, authors, year, type, etc.) or pastes BibTeX | `UserReferenceController::api_upload`, `UserReferenceService::uploadReference()` | Free |
| 2. **OCR** | Mistral Document AI extracts text/markdown from PDF (max 30 pages) | `ProcessReferenceOcrHandler` → `MistralOcrService::extractText()` | 2 credits/page |
| 3. **Chunking** | Extracted text split into semantically meaningful chunks (heading-based, 1500 char target, 100 char overlap) | `ReferenceChunkingService` | Free |
| 4. **ES Indexing** | Chunks indexed in `user_reference_chunks` with Arabic + standard analyzers | `IndexReferenceChunksHandler` → `ReferenceIndexingService::indexReference()` | Free |
| 5. **Summarization** | Map-reduce: sections → GPT-4o-mini summaries → GPT-4o final ~800-word summary in Markdown→HTML | `SummarizeReferenceHandler` → `ReferenceSummaryService::summarize()` | Credits |
| 6. **Translation** | Summary stripped of HTML → `AcademicTranslationService::translateLargeText()` → re-wrapped with RTL | `TranslateReferenceSummaryHandler` | Credits |
| 7. **AI Grounding** | When user writes with AI and checks "My References", chunks are searched via ES multi_match and injected as grounding context | `PlaygroundAPIController::searchForGeneration()` | Free |

### Entity: `UserReference` (table `user_reference`)
- 30+ fields: file storage, citation metadata, identifiers (DOI/ISBN/ISSN/ArXiv/PMID), extracted text, AI summary, status flags
- Publication types: journal, book, book_chapter, conference, thesis, report, website, other
- Languages: en, ar, de, fr, es, ru, tr, zh, ja, cs, it
- Status tracking: `isProcessed` (OCR), `isIndexed` (ES), `summaryStatus` (pending/processing/completed/failed/translating)
- Soft delete for external references (ArXiv/Shamra imports), hard delete for user uploads

### ES Index: `user_reference_chunks`
- Fields: `text` (arabic_text analyzer + standard), `title`, `section_heading` (multilingual analyzer), `summary`, `reference_id`, `user_id`, `language`
- Search: `multi_match` on `text^3`, `text.standard^2`, `title`, `section_heading` with `fuzziness: AUTO`, filtered by `user_id`

### Frontend Features
- Folder sidebar, search, publication type filter, bulk select
- Detail panel with tabs: "AI Summary" (TinyMCE rich editor) + "Document Preview" (PDF viewer / mammoth.js for DOCX)
- Upload modal with drag-and-drop, BibTeX import (client-side regex parser), full metadata form
- OCR status polling (every 8s), summary generation/polling, translation trigger
- Copy summary, edit summary in rich editor, save, download as .docx
- About/landing page at `/myreferences` for non-subscribers

### Current Limitations
1. **Manual metadata entry** — User must type title, authors, year, etc. (or paste BibTeX). No auto-extraction from PDF.
2. **No DOI auto-lookup** — User must enter all fields even when DOI is available.
3. **No URL/link import** — Cannot add a paper by pasting a URL or DOI link.
4. **Single citation format** — Only APA 7th edition. No MLA, Chicago, IEEE, Vancouver, etc.
5. **No annotations** — Cannot highlight, annotate, or comment on specific passages.
6. **No collaboration** — Libraries are private. No sharing, team libraries, or public collections.
7. **No duplicate detection** — Same paper can be uploaded multiple times.
8. **No auto-tagging** — Tags/folders are manual. No AI-based topic classification.
9. **Summarization is one-shot** — Cannot ask follow-up questions or get specific section summaries.
10. **No related paper discovery** — No "find similar papers" or "papers that cite this" feature.
11. **30-page OCR limit** — Longer documents (books, theses) cannot be fully processed.
12. **No reading progress** — No way to track what you've read or mark papers as read/unread.

---

## Future Improvements

### Feature 1: Auto-Extract Metadata from PDF (High Impact, Medium Effort)

**Problem**: Users must manually fill 10+ metadata fields when uploading. This is the biggest friction point — many users upload and leave half the fields empty.

**Solution**: After OCR completes, run a metadata extraction prompt against the first 2 pages of extracted text.

**Implementation**:
1. Add a new async step after OCR: `ExtractReferenceMetadata` message
2. Send first ~3000 chars of extracted text to GPT-4o-mini with a structured extraction prompt
3. Extract: title, authors, year, DOI, journal/conference name, volume, issue, pages, abstract, keywords, language
4. Auto-fill only empty fields (never overwrite user-entered data)
5. Set a `metadataExtracted` flag on the entity
6. Show a "Review extracted metadata" prompt in the UI so users can verify/correct

**Prompt sketch**:
```
Extract bibliographic metadata from this academic paper text. Return JSON:
{title, authors: [], year, doi, publication_name, volume, issue, pages, abstract, keywords: [], language, publication_type}
Only include fields you can confidently extract. Use null for uncertain fields.
```

**Pipeline change**: `ProcessReferenceOcr` → `ExtractReferenceMetadata` → `IndexReferenceChunks` → `SummarizeReference`

**Effort**: ~2 days. New message + handler + prompt. Minimal credit cost (single short completion).

---

### Feature 2: Import by DOI / URL / ArXiv ID (High Impact, Low Effort)

**Problem**: Users often already have the DOI, ArXiv link, or URL. They shouldn't need to download the PDF and re-upload it.

**Solution**: Add "Import by identifier" input alongside the upload modal.

**Implementation**:
1. Add identifier input field in upload modal: "Paste DOI, ArXiv ID, PubMed ID, or URL"
2. **DOI** → Call CrossRef API (`https://api.crossref.org/works/{doi}`) → auto-fill all metadata
3. **ArXiv** → Call ArXiv API → auto-fill metadata + download PDF from `arxiv.org/pdf/{id}`
4. **PubMed** → Call NCBI E-utilities API → auto-fill metadata
5. **URL** → Fetch page, extract Open Graph / meta citation tags, offer to save as web reference
6. Duplicate check: if DOI already exists in user's library, warn before creating

**New routes**:
- `POST /myreferences/api/import-doi` — resolve DOI and create reference
- `POST /myreferences/api/import-arxiv` — resolve ArXiv ID and download PDF
- `POST /myreferences/api/lookup` — preview metadata before importing

**Effort**: ~3 days. CrossRef API is free and well-documented. ArXiv already partially implemented via `createFromArxiv()`.

---

### Feature 3: Ask AI About a Paper — Conversational Q&A (High Impact, Medium Effort)

**Problem**: Summarization is one-shot. Users often want to ask specific questions: "What methodology did they use?", "What are the limitations?", "How does this compare to X?"

**Solution**: Add a chat interface per reference, grounded in that paper's chunks.

**Implementation**:
1. Add a "Chat" tab alongside "AI Summary" and "Document Preview" in the detail panel
2. On each question, search `user_reference_chunks` filtered by `reference_id` (not user-wide)
3. Inject top-5 relevant chunks as context into GPT-4o prompt
4. Stream the response back to the UI
5. Maintain chat history in the session (no DB persistence initially)
6. Pre-populate with suggested questions: "What is the main contribution?", "What methodology was used?", "What are the key findings?", "What are the limitations?"

**Credit cost**: ~1-3 credits per question (small context, short response).

**Effort**: ~3-4 days. Requires new API route, streaming SSE endpoint, and Vue component.

---

### Feature 4: Multiple Citation Formats (Medium Impact, Low Effort)

**Problem**: Only APA 7th edition is supported. Researchers need Chicago, MLA, IEEE, Vancouver, Harvard, etc.

**Solution**: Implement citation formatting for the 6 most common styles.

**Implementation**:
1. Create `CitationFormatterService` with format methods for: APA 7, MLA 9, Chicago 17, IEEE, Vancouver, Harvard
2. Add citation format dropdown in reference detail panel
3. "Copy citation" button with format selector
4. "Export BibTeX" button to re-export the entry (round-trip: import BibTeX → use → export BibTeX)
5. Batch export: select multiple references → export as `.bib` file

**Entity mapping** → Each format has different rules for author names, punctuation, italics, etc. The `UserReference` entity already has all needed fields.

**Effort**: ~2 days. Pure formatting logic, no external APIs.

---

### Feature 5: Smart Duplicate Detection (Medium Impact, Low Effort)

**Problem**: Users can upload the same paper multiple times with no warning. Wastes storage and credits.

**Solution**: Multi-signal duplicate detection before upload completes.

**Implementation**:
1. **DOI match** (exact): Check `UserReferenceRepository::findByDoi()` — already exists
2. **Title similarity** (fuzzy): Normalize title (lowercase, strip punctuation) and check Levenshtein distance < 10% or ES `match_phrase` with slop
3. **File hash** (exact): Compute SHA-256 of uploaded file, store in new `fileHash` column, check before saving
4. **On duplicate found**: Show modal: "This looks similar to [existing reference]. Do you want to: (a) Skip upload, (b) Upload anyway, (c) Replace the existing one?"

**Effort**: ~1 day. Add `fileHash` column + pre-upload check endpoint.

---

### Feature 6: AI Auto-Tagging & Smart Folders (Medium Impact, Medium Effort)

**Problem**: Organizing references into folders and adding tags is entirely manual. Most users don't bother.

**Solution**: After summarization, auto-classify papers into topics and suggest tags.

**Implementation**:
1. After `SummarizeReference` completes, dispatch `ClassifyReference` message
2. Send summary + title + abstract to GPT-4o-mini with prompt: "Classify this paper. Return: {topics: [up to 3], methodology: string, field: string}"
3. Store as `autoTags` (JSON column) on the entity
4. Show auto-tags in the UI with option to accept/reject
5. **Smart Folders**: Auto-generate virtual folders from most common auto-tags (e.g., "Machine Learning (8)", "NLP (5)")
6. Add a tag filter in the sidebar alongside folder filter

**Effort**: ~2-3 days. New message, handler, UI tag display, filter logic.

---

### Feature 7: Highlights & Annotations (High Impact, High Effort)

**Problem**: Researchers need to highlight passages, write marginalia, and link notes to specific parts of papers.

**Solution**: Add a PDF annotation layer with persistent highlights and notes.

**Implementation**:
1. Integrate `pdf.js` annotation layer (already using pdf.js for preview)
2. New entity: `ReferenceAnnotation` (reference_id, user_id, page, coordinates, color, text_selection, note, created_at)
3. API: CRUD on annotations per reference
4. Annotations stored in DB, rendered as overlay on PDF viewer
5. **Export**: Include annotations in summary download; XFDF export for use in other PDF readers
6. **Search annotations**: Full-text search across all user's annotations
7. **Link to AI**: "Ask AI about this passage" — select text → opens chat with that passage as context

**Effort**: ~2 weeks. Significant frontend work with pdf.js annotation API.

---

### Feature 8: Related Paper Discovery (High Impact, Medium Effort)

**Problem**: No way to discover related papers from within the library. Users must go back to search.

**Solution**: Use ES "More Like This" + Shamra index to suggest related papers.

**Implementation**:
1. Add "Find Similar" button on each reference
2. **Within library**: ES MLT query on `user_reference_chunks` using the reference's extracted text
3. **On Shamra**: ES MLT query on `arabic_research` / `english_research` indices using title + abstract
4. **On ArXiv**: ArXiv API search using extracted keywords
5. Show results in a side panel: "Similar in Your Library (3)" + "Discover on Shamra (10)" + "On ArXiv (5)"
6. One-click "Add to Library" for Shamra/ArXiv results

**Effort**: ~3 days. ES MLT already used in recommendation engine; adapt for reference context.

---

### Feature 9: Shared & Team Libraries (Medium Impact, High Effort)

**Problem**: Libraries are completely private. Research teams need shared reference collections.

**Solution**: Allow users to create shared libraries and invite collaborators.

**Implementation**:
1. New entity: `ReferenceCollection` (id, name, description, owner_id, visibility: private/team/public, created_at)
2. New entity: `ReferenceCollectionMember` (collection_id, user_id, role: owner/editor/viewer)
3. New entity: `ReferenceCollectionItem` (collection_id, reference_id, added_by, added_at, notes)
4. Users can create collections, add their references, invite others by email
5. Shared references: viewers see metadata + summary; only owner's storage is consumed
6. **Public collections**: Optional — curated reading lists visible on user profiles
7. **Activity feed**: "User X added 3 papers to 'NLP Research' collection"

**Effort**: ~2 weeks. New entities, permissions system, invitation flow, collection UI.

---

### Feature 10: Reading Progress & Status Tracking (Low Impact, Low Effort)

**Problem**: No way to track which papers you've read, are reading, or plan to read.

**Solution**: Add simple status tracking per reference.

**Implementation**:
1. Add `readingStatus` enum column: `unread`, `reading`, `read`, `to_read`
2. Add `readingProgress` integer column (0-100, for page tracking)
3. Quick-toggle buttons in reference list: 📖 → 📗 → ✅
4. Filter by reading status in sidebar
5. Dashboard widget: "You've read 12 of 45 papers this month"

**Effort**: ~1 day. Simple column addition + UI toggle.

---

### Feature 11: Bulk BibTeX / Zotero / Mendeley Import (High Impact, Medium Effort)

**Problem**: Researchers with existing libraries in Zotero, Mendeley, or EndNote must re-enter everything manually.

**Solution**: Bulk import from standard bibliography formats.

**Implementation**:
1. **BibTeX bulk import**: Upload `.bib` file → parse all entries → create `UserReference` for each (already have client-side BibTeX parser — extend to handle full files with 100+ entries)
2. **RIS import**: Parse `.ris` files (used by most database exports)
3. **Zotero CSV**: Parse Zotero's CSV export format
4. **Progress UI**: "Importing 47 references... (23/47)" with skip/retry on errors
5. **DOI resolution**: For entries with DOIs, optionally fetch PDFs from Unpaywall API (open access only)
6. **Dedup during import**: Check each entry against existing library before creating

**Effort**: ~3-4 days. BibTeX parser exists; add RIS parser + bulk creation endpoint.

---

### Feature 12: Enhanced Summarization Options (Medium Impact, Low Effort)

**Problem**: One-size-fits-all ~800-word summary. Different use cases need different summaries.

**Solution**: Let users choose summary type and depth.

**Implementation**:
1. Summary type selector before generation:
   - **Executive summary** (current default, ~800 words)
   - **Quick overview** (~200 words, key findings only)
   - **Detailed analysis** (~2000 words, methodology + results + limitations)
   - **Literature review paragraph** (~150 words, suitable for insertion into a literature review)
   - **Bullet points** (structured key takeaways)
2. Different reduce prompts for each type (map step stays the same)
3. Store `summaryType` on entity; allow regenerating with different type
4. **Comparative summary**: Select 2-3 papers → generate a comparison table (methods, findings, limitations)

**Effort**: ~2 days. Mostly prompt engineering + UI dropdown.

---

### Feature 13: PDF Full-Text Search Within a Paper (Low Impact, Low Effort)

**Problem**: Users can search across their library but cannot search within a specific paper's text.

**Solution**: Add in-document search using the already-extracted text.

**Implementation**:
1. Add search box in the "Document Preview" tab
2. Search against `extractedText` (client-side for short docs, ES for longer ones)
3. Highlight matching passages in the extracted text view
4. If using PDF viewer, jump to approximate page (map character offset → page)

**Effort**: ~1 day. Client-side text search + highlight.

---

## Priority Matrix

| # | Feature | Impact | Effort | Priority |
|---|---------|--------|--------|----------|
| 1 | Auto-extract metadata from PDF | 🔴 High | Medium (2d) | **P0 — Do first** |
| 2 | Import by DOI / URL / ArXiv ID | 🔴 High | Low (3d) | **P0 — Do first** |
| 5 | Smart duplicate detection | 🟡 Medium | Low (1d) | **P0 — Do first** |
| 11 | Bulk BibTeX / Zotero import | 🔴 High | Medium (4d) | **P1 — Next sprint** |
| 3 | Ask AI about a paper (Q&A chat) | 🔴 High | Medium (4d) | **P1 — Next sprint** |
| 4 | Multiple citation formats | 🟡 Medium | Low (2d) | **P1 — Next sprint** |
| 12 | Enhanced summarization options | 🟡 Medium | Low (2d) | **P1 — Next sprint** |
| 10 | Reading progress tracking | 🟢 Low | Low (1d) | **P2 — Quick win** |
| 13 | In-document full-text search | 🟢 Low | Low (1d) | **P2 — Quick win** |
| 6 | AI auto-tagging & smart folders | 🟡 Medium | Medium (3d) | **P2 — Next quarter** |
| 8 | Related paper discovery | 🔴 High | Medium (3d) | **P2 — Next quarter** |
| 7 | Highlights & annotations | 🔴 High | High (2w) | **P3 — Long-term** |
| 9 | Shared & team libraries | 🟡 Medium | High (2w) | **P3 — Long-term** |

---

## Recommended Implementation Roadmap

### Sprint 1 (Week 1-2): Remove Upload Friction
- **Feature 1**: Auto-extract metadata after OCR
- **Feature 2**: Import by DOI/ArXiv/URL
- **Feature 5**: Duplicate detection
- **Result**: Users go from "upload PDF + fill 12 fields" to "paste DOI → done"

### Sprint 2 (Week 3-4): Deepen the AI Value
- **Feature 3**: Conversational Q&A per paper
- **Feature 12**: Enhanced summarization options
- **Feature 4**: Multiple citation formats
- **Result**: Library becomes an active research tool, not just storage

### Sprint 3 (Week 5-6): Scale the Library
- **Feature 11**: Bulk import from BibTeX/Zotero/RIS
- **Feature 10**: Reading progress tracking
- **Feature 13**: In-document search
- **Result**: Power users can migrate their full existing libraries

### Sprint 4 (Week 7-8): Discovery & Organization
- **Feature 6**: AI auto-tagging
- **Feature 8**: Related paper discovery
- **Result**: Library becomes self-organizing and surfaces connections

### Future: Collaboration
- **Feature 7**: Highlights & annotations
- **Feature 9**: Shared/team libraries
- **Result**: Multi-user research workflows

---

## Key Files Reference

| Component | File |
|-----------|------|
| Controller | `src/Controller/UserReferenceController.php` |
| Entity | `src/Entity/UserReference.php` |
| Repository | `src/Repository/UserReferenceRepository.php` |
| Upload + OCR service | `src/Service/Playground/UserReferenceService.php` |
| Chunking | `src/Service/Playground/ReferenceChunkingService.php` |
| ES indexing | `src/Service/Playground/ReferenceIndexingService.php` |
| Summarization | `src/Service/Playground/ReferenceSummaryService.php` |
| OCR handler | `src/MessageHandler/ProcessReferenceOcrHandler.php` |
| Indexing handler | `src/MessageHandler/IndexReferenceChunksHandler.php` |
| Summary handler | `src/MessageHandler/SummarizeReferenceHandler.php` |
| Translation handler | `src/MessageHandler/TranslateReferenceSummaryHandler.php` |
| Frontend (Vue SPA) | `templates/references/index.html.twig` |
| Landing page | `templates/references/about.html.twig` |
| ES mapping | `shamra_integration/reference_chunks_mapping.json` |
| Summary prompts | `playground_prompts/summary_map_prompt.md`, `summary_reduce_prompt.md` |
| Translations | `translations/UserBundle.en.yml`, `translations/UserBundle.ar.yml` |
| Agent docs | `agents/user_reference_library.md` |
