# ArXiv Search Algorithm for Write with AI Feature

## Overview

This document describes the algorithm for integrating ArXiv (arxiv.org) as a research source in the "Write with AI" feature. ArXiv is an open-access repository of electronic preprints and postprints covering physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           User Query (Arabic/English)                        │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Language Detection & Translation                      │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ 1. Detect if query is Arabic (using regex pattern for Arabic characters) ││
│  │ 2. If Arabic: Translate to English using Azure OpenAI                    ││
│  │ 3. Extract key search terms from translated/original query               ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           ArXiv API Search                                    │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ API Endpoint: http://export.arxiv.org/api/query                         ││
│  │                                                                          ││
│  │ Parameters:                                                              ││
│  │ - search_query: all:{keywords} or ti:{keywords} for title search        ││
│  │ - start: 0 (pagination offset)                                          ││
│  │ - max_results: 10-20 (configurable)                                     ││
│  │ - sortBy: relevance | lastUpdatedDate | submittedDate                   ││
│  │ - sortOrder: descending (default for most recent)                       ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Parse ATOM/XML Response                               │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ Extract from each <entry>:                                               ││
│  │ - id: ArXiv paper ID (e.g., http://arxiv.org/abs/hep-ex/0307015)        ││
│  │ - title: Paper title                                                     ││
│  │ - summary: Abstract/summary text                                         ││
│  │ - author(s): Author name(s)                                              ││
│  │ - published: Publication date                                            ││
│  │ - updated: Last update date                                              ││
│  │ - arxiv:primary_category: Subject category                               ││
│  │ - link[rel="related"]: PDF link                                          ││
│  │ - arxiv:journal_ref: Journal reference (if available)                    ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    Full Article Content Extraction                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ Option 1: Use Abstract Only (Fast, Always Available)                     ││
│  │ - The <summary> tag contains the full abstract                           ││
│  │ - Sufficient for most grounding purposes                                 ││
│  │                                                                          ││
│  │ Option 2: PDF Extraction (Slower, More Complete)                         ││
│  │ - Download PDF from arxiv.org/pdf/{id}                                   ││
│  │ - Extract text using PDF parser (not recommended due to complexity)      ││
│  │                                                                          ││
│  │ Option 3: ArXiv HTML5 Papers (When Available)                            ││
│  │ - Some papers have HTML versions at arxiv.org/html/{id}                  ││
│  │ - Parse HTML to extract full text                                        ││
│  │                                                                          ││
│  │ RECOMMENDED: Use Abstract (Option 1) as primary content                  ││
│  │ - Abstracts are high-quality summaries written by authors                ││
│  │ - Contains key findings and methodology                                  ││
│  │ - Fast and reliable extraction                                           ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      LLM Relevance Scoring & Extraction                       │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ For each ArXiv result, use Azure OpenAI to:                              ││
│  │                                                                          ││
│  │ 1. RELEVANCE SCORING:                                                    ││
│  │    Prompt: "Rate the relevance of this paper to the query '{query}'     ││
│  │             on a scale of 0-100. Paper: {title} - {summary}"            ││
│  │    Output: Numeric score for ranking                                     ││
│  │                                                                          ││
│  │ 2. KEY INFORMATION EXTRACTION:                                           ││
│  │    Prompt: "Extract the most useful information from this paper         ││
│  │             abstract that relates to '{query}'. Focus on:                ││
│  │             - Key findings and conclusions                               ││
│  │             - Methodology insights                                       ││
│  │             - Relevant statistics or data points                         ││
│  │             Keep the extraction concise (2-3 sentences max)."           ││
│  │    Output: Extracted relevant content for grounding                      ││
│  │                                                                          ││
│  │ 3. CITATION FORMATTING:                                                  ││
│  │    Format: {Authors} ({Year}). {Title}. arXiv:{id}. {journal_ref}       ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Return Structured Results                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ [                                                                        ││
│  │   {                                                                      ││
│  │     "id": "arxiv:hep-ex/0307015",                                        ││
│  │     "source": "arxiv",                                                   ││
│  │     "title": "Paper Title",                                              ││
│  │     "authors": ["Author 1", "Author 2"],                                 ││
│  │     "year": 2003,                                                        ││
│  │     "abstract": "Full abstract text...",                                 ││
│  │     "extracted_content": "LLM-extracted relevant content...",            ││
│  │     "relevance_score": 85,                                               ││
│  │     "url": "https://arxiv.org/abs/hep-ex/0307015",                       ││
│  │     "pdf_url": "https://arxiv.org/pdf/hep-ex/0307015",                   ││
│  │     "category": "hep-ex",                                                ││
│  │     "journal_ref": "Eur.Phys.J. C31 (2003) 17-29",                       ││
│  │     "citation": "H1 Collaboration (2003). Multi-Electron..."            ││
│  │   },                                                                     ││
│  │   ...                                                                    ││
│  │ ]                                                                        ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
```

## Implementation Details

### 1. Language Detection & Translation

```php
class ArxivSearchService
{
    /**
     * Detect if text contains Arabic characters
     */
    private function isArabic(string $text): bool
    {
        return preg_match('/[\x{0600}-\x{06FF}]/u', $text) === 1;
    }

    /**
     * Translate Arabic query to English using Azure OpenAI
     */
    private function translateToEnglish(string $arabicQuery): string
    {
        $prompt = "Translate the following Arabic academic query to English. 
                   Preserve technical terms and academic vocabulary.
                   Only output the translation, nothing else.
                   
                   Arabic: {$arabicQuery}";
        
        return $this->azureOpenAIService->complete($prompt);
    }
}
```

### 2. ArXiv API Query Construction

```php
/**
 * Build ArXiv API query URL
 * 
 * Search query syntax:
 * - all: Search all fields
 * - ti: Search title only
 * - au: Search author
 * - abs: Search abstract
 * - co: Search comments
 * - jr: Search journal reference
 * - cat: Search category
 * - rn: Search report number
 * 
 * Boolean operators: AND, OR, ANDNOT
 * Grouping: Use parentheses
 * Exact phrases: Use quotes
 */
private function buildArxivQuery(string $keywords, int $maxResults = 10): string
{
    $baseUrl = 'http://export.arxiv.org/api/query';
    
    // Encode the search query
    $searchQuery = urlencode("all:{$keywords}");
    
    $params = [
        'search_query' => "all:{$keywords}",
        'start' => 0,
        'max_results' => $maxResults,
        'sortBy' => 'relevance',
        'sortOrder' => 'descending'
    ];
    
    return $baseUrl . '?' . http_build_query($params);
}
```

### 3. XML Response Parsing

```php
/**
 * Parse ArXiv ATOM XML response
 */
private function parseArxivResponse(string $xmlResponse): array
{
    $results = [];
    
    // Register namespaces
    $xml = simplexml_load_string($xmlResponse);
    $xml->registerXPathNamespace('atom', 'http://www.w3.org/2005/Atom');
    $xml->registerXPathNamespace('arxiv', 'http://arxiv.org/schemas/atom');
    $xml->registerXPathNamespace('opensearch', 'http://a9.com/-/spec/opensearch/1.1/');
    
    // Get total results
    $totalResults = (int) $xml->xpath('//opensearch:totalResults')[0];
    
    // Parse each entry
    foreach ($xml->entry as $entry) {
        $arxivNs = $entry->children('http://arxiv.org/schemas/atom');
        
        // Extract ArXiv ID from the full URL
        $fullId = (string) $entry->id;
        $arxivId = str_replace('http://arxiv.org/abs/', '', $fullId);
        
        // Get PDF link
        $pdfUrl = '';
        foreach ($entry->link as $link) {
            if ((string) $link['title'] === 'pdf') {
                $pdfUrl = (string) $link['href'];
                break;
            }
        }
        
        // Get authors
        $authors = [];
        foreach ($entry->author as $author) {
            $authors[] = (string) $author->name;
        }
        
        // Extract year from published date
        $publishedDate = (string) $entry->published;
        $year = date('Y', strtotime($publishedDate));
        
        $results[] = [
            'id' => 'arxiv:' . $arxivId,
            'source' => 'arxiv',
            'title' => trim((string) $entry->title),
            'abstract' => trim((string) $entry->summary),
            'authors' => $authors,
            'year' => $year,
            'published' => $publishedDate,
            'updated' => (string) $entry->updated,
            'url' => $fullId,
            'pdf_url' => $pdfUrl,
            'category' => (string) $arxivNs->primary_category['term'],
            'journal_ref' => (string) $arxivNs->journal_ref,
            'comment' => (string) $arxivNs->comment
        ];
    }
    
    return [
        'total' => $totalResults,
        'results' => $results
    ];
}
```

### 4. LLM-Based Content Extraction

```php
/**
 * Use LLM to extract most relevant content from abstract
 */
private function extractRelevantContent(string $query, array $paper): array
{
    $prompt = <<<PROMPT
You are an academic research assistant. Analyze this paper abstract and extract the most useful information related to the user's query.

USER QUERY: {$query}

PAPER TITLE: {$paper['title']}

PAPER ABSTRACT:
{$paper['abstract']}

INSTRUCTIONS:
1. Rate the relevance of this paper to the query on a scale of 0-100
2. Extract 2-3 key sentences from the abstract that are most relevant to the query
3. Identify the main contribution or finding

Respond in JSON format:
{
    "relevance_score": <0-100>,
    "extracted_content": "<2-3 relevant sentences>",
    "main_contribution": "<one sentence summary>"
}
PROMPT;

    $response = $this->azureOpenAIService->complete($prompt);
    return json_decode($response, true);
}
```

### 5. Citation Formatting

```php
/**
 * Format ArXiv citation in academic style
 */
private function formatCitation(array $paper): string
{
    $authors = implode(', ', array_slice($paper['authors'], 0, 3));
    if (count($paper['authors']) > 3) {
        $authors .= ' et al.';
    }
    
    $citation = "{$authors} ({$paper['year']}). {$paper['title']}. ";
    $citation .= "arXiv:{$paper['id']}. ";
    
    if (!empty($paper['journal_ref'])) {
        $citation .= "Published in: {$paper['journal_ref']}";
    }
    
    return $citation;
}
```

## API Rate Limiting

ArXiv API has rate limiting policies:
- **Recommended**: No more than 1 request per 3 seconds
- **Burst limit**: Short bursts of up to 4 requests are tolerated
- **User-Agent**: Should include contact email for heavy usage

Implementation:
```php
// Add delay between requests
private function rateLimitedRequest(string $url): string
{
    static $lastRequestTime = 0;
    
    $elapsed = microtime(true) - $lastRequestTime;
    if ($elapsed < 3.0) {
        usleep((3.0 - $elapsed) * 1000000);
    }
    
    $context = stream_context_create([
        'http' => [
            'header' => 'User-Agent: ShamraAcademia/1.0 (contact@shamra.net)'
        ]
    ]);
    
    $response = file_get_contents($url, false, $context);
    $lastRequestTime = microtime(true);
    
    return $response;
}
```

## Error Handling

1. **Network Errors**: Retry with exponential backoff (max 3 retries)
2. **Empty Results**: Return empty array with appropriate message
3. **XML Parse Errors**: Log error and skip malformed entries
4. **LLM Errors**: Fall back to using raw abstract without extraction

## Performance Optimization

1. **Parallel Processing**: Process LLM extraction in batches
2. **Caching**: Cache ArXiv responses for 24 hours (papers don't change frequently)
3. **Limit Results**: Default to 10 results, max 20 for performance
4. **Early Termination**: Stop if enough high-relevance papers found

## Integration with Existing System

The ArXiv search will be integrated alongside the existing Shamra search:

```php
// In PlaygroundRAGService
public function searchSources(string $query, array $options): array
{
    $results = [];
    
    // Search Shamra (existing)
    if ($options['searchShamra'] ?? true) {
        $results['shamra'] = $this->searchShamra($query, $options['language']);
    }
    
    // Search ArXiv (new)
    if ($options['searchArxiv'] ?? false) {
        $results['arxiv'] = $this->arxivService->search($query);
    }
    
    // Merge and sort by relevance
    return $this->mergeAndRankResults($results);
}
```

## UI Integration

Add checkbox in the Write with AI sources panel:
- [ ] Search Shamra Academia (default: checked)
- [ ] Search ArXiv (default: unchecked)

When ArXiv is selected:
- Show ArXiv results with distinctive styling (different background color)
- Display category badge (e.g., "cs.AI", "physics.hep-ex")
- Show "View on ArXiv" and "View PDF" links

## Security Considerations

1. **Input Sanitization**: Escape user input before adding to API URL
2. **Response Validation**: Validate XML structure before parsing
3. **Content Filtering**: LLM extraction provides natural content filtering
4. **No Authentication**: ArXiv API is public, no keys needed