# Web Content Index Integration Prompt

## Context

You are helping integrate a **web crawler's Elasticsearch index** (`web_content`) with an existing **academic research index** (`arabic_research`). The goal is to **transform and insert** crawled content (OJS articles only) directly into the existing research index.

### Integration Strategy: Direct Insertion

```
┌─────────────────┐                    ┌─────────────────┐
│  web_content    │  ──Transform──►    │ arabic_research │
│  (staging)      │     & Insert       │  (production)   │
└─────────────────┘                    └─────────────────┘
```

**Key Constraints:**
- ⚠️ **DO NOT modify existing `arabic_research` schema** (no field type changes, no renaming)
- ✅ **New fields CAN be added** to `arabic_research` (safe, no reindex required)
- ✅ **Transform** `web_content` documents to match `arabic_research` schema exactly
- ✅ **Insert** transformed documents directly into `arabic_research`

---

## Source Index: `web_content` (Crawler Output)

### Connection Details
- **Host:** `http://localhost:9200` (local development)
- **Index Name:** `web_content`
- **Total Documents:** 4,765
- **Index Size:** ~22.4 MB
- **Document Types:** `ojs_article` only

### Current Schema

```json
{
  "web_content": {
    "mappings": {
      "properties": {
        "abstract": {
          "type": "text",
          "analyzer": "content_analyzer"
        },
        "author": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } }
        },
        "authors": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } }
        },
        "content": {
          "type": "text",
          "term_vector": "with_positions_offsets",
          "analyzer": "content_analyzer"
        },
        "content_hash": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
        },
        "crawl_timestamp": { "type": "date" },
        "creation_date": { "type": "keyword" },
        "depth": { "type": "integer" },
        "document_type": { "type": "keyword" },
        "doi": { "type": "keyword" },
        "domain": { "type": "keyword" },
        "external_links_count": { "type": "integer" },
        "file_name": { "type": "keyword" },
        "headings": {
          "type": "nested",
          "properties": {
            "level": { "type": "keyword" },
            "text": { "type": "text" }
          }
        },
        "internal_links_count": { "type": "integer" },
        "journal_name": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } }
        },
        "keywords": { "type": "keyword" },
        "meta_description": {
          "type": "text",
          "analyzer": "content_analyzer"
        },
        "meta_keywords": { "type": "keyword" },
        "page_count": { "type": "integer" },
        "pdf_url": { "type": "keyword" },
        "published_date": { "type": "keyword" },
        "section": { "type": "keyword" },
        "source_page_title": {
          "type": "text",
          "analyzer": "content_analyzer"
        },
        "source_page_url": { "type": "keyword" },
        "subject": {
          "type": "text",
          "analyzer": "content_analyzer"
        },
        "title": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } },
          "analyzer": "content_analyzer"
        },
        "url": { "type": "keyword" },
        "word_count": { "type": "integer" }
      }
    }
  }
}
```

### Sample Document (OJS Academic Article)

```json
{
  "_id": "cadeee9899e4c8a4895c",
  "_source": {
    "url": "https://scholarworks.iu.edu/journals/index.php/josotl/article/view/36689",
    "domain": "scholarworks.iu.edu",
    "title": "Community Mapping: A Strategy for Teaching Transition to Pre-Service Teachers",
    "content": "Transition from high school to adulthood can be one of the most challenging periods in a young person's life. To support students in having positive post-school outcomes, teachers must be adequately prepared when it comes to transition. Community mapping is a transition planning tool grounded in research that can help pre-service teachers learn how to match students' transition needs with available community assets. This article discusses the community mapping strategy and provides guidelines for successful implementation with pre-service teachers.",
    "abstract": "Transition from high school to adulthood can be one of the most challenging periods...",
    "authors": "Dr. Mariya T. Davis, Dr. Christina M. Gushanas, Dr. Ingrid K. Cumming",
    "keywords": ["teacher preparation", "transition", "community mapping", "strategy", "evidence-based practice"],
    "doi": "https://doi.org/10.14434/josotl.v25i4.36689",
    "published_date": "Oct 10, 2025",
    "journal_name": "Vol. 25 No. 4 (2025): Journal of the Scholarship of Teaching and Learning",
    "section": "Quick Hits",
    "document_type": "ojs_article",
    "crawl_timestamp": "2026-02-02T00:44:49.147791+00:00",
    "depth": 1,
    "word_count": 79,
    "content_hash": "cadeee9899e4c8a4895c",
    "pdf_url": "",
    "file_name": "",
    "page_count": 0
  }
}
```

---

## Target Index: `arabic_research` (Existing Academic Database)

### Connection Details
- **Host:** `http://username:password@external_IP:9200`
- **Index Name:** `arabic_research`
- **Purpose:** Academic research papers (Arabic and English)

### Current Schema

```json
{
  "arabic_research": {
    "mappings": {
      "properties": {
        "id": { "type": "keyword" },
        "slug": { "type": "keyword" },
        
        "arabic_full_title": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } },
          "analyzer": "arabic"
        },
        "english_full_title": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } }
        },
        
        "arabic_abstract": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } },
          "analyzer": "arabic"
        },
        "english_abstract": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } }
        },
        
        "content": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } }
        },
        
        "tag": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } }
        },
        "tag_id": { "type": "keyword" },
        
        "category_id": { "type": "keyword" },
        "arabic_category_name": { "type": "keyword" },
        "english_category_name": { "type": "keyword" },
        
        "fields_id": { "type": "keyword" },
        "arabic_fields": { "type": "keyword" },
        "english_fields": { "type": "keyword" },
        
        "publisher_id": { "type": "keyword" },
        "arabic_publisher_name": { "type": "keyword" },
        "english_publisher_name": { "type": "keyword" },
        "textualPublisher": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
        },
        
        "publication_date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        },
        "createdAt": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        },
        "last_updated_at": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        },
        
        "language": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
        },
        "document_name": { "type": "keyword" },
        
        "creator_id": { "type": "long" },
        "updater_id": { "type": "long" },
        "deleted": { "type": "boolean" },
        
        "hits": { "type": "long" },
        "downloads": { "type": "long" },
        "cites": { "type": "long" },
        "rate": { "type": "float" },
        "raters": { "type": "long" },
        
        "ai_summary": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } }
        },
        "ai_critique": { "type": "text" },
        "ai_keywords": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword" } }
        },
        "chatGPT": { "type": "object", "enabled": false },
        "chatGPTen": { "type": "object", "enabled": false },
        
        "related_researches": {
          "type": "text",
          "index": false,
          "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } }
        },
        "research_references": {
          "type": "text",
          "index": false,
          "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } }
        }
      }
    }
  }
}
```

---

## Task: Transform `web_content` for Direct Insertion into `arabic_research`

### Requirements

1. **Transform documents** from `web_content` schema to `arabic_research` schema
2. **Generate required fields** (`id`, `slug`, `language`, etc.) during transformation
3. **Detect language** (Arabic vs English) and populate appropriate title/abstract fields
4. **Add source tracking fields** to identify crawled content (`is_external`, `source_url`, etc.)
5. **Insert directly** into `arabic_research` index

### Constraints

- ❌ **DO NOT** modify any existing field types or analyzers in `arabic_research`
- ❌ **DO NOT** rename any existing fields in `arabic_research`  
- ❌ **DO NOT** require reindexing of existing `arabic_research` data
- ✅ **SAFE** to add new fields (they will only apply to new documents)

### Field Mapping Strategy

| web_content (current) | arabic_research (target) | Action Required |
|-----------------------|--------------------------|-----------------|
| `title` | `english_full_title` / `arabic_full_title` | Split by language detection |
| `abstract` | `english_abstract` / `arabic_abstract` | Split by language detection |
| `content` | `content` | ✅ Compatible |
| `authors` | `textualPublisher` | Rename or map |
| `keywords` | `tag` | Rename or add alias |
| `doi` | ❌ (new field) | Add `doi` to arabic_research |
| `published_date` | `publication_date` | Parse string → date format |
| `journal_name` | `english_publisher_name` | Map appropriately |
| `document_type` | ❌ (new field) | do not add to arabic_Research |
| `url` | ❌ (new field) | Add `source_url` to arabic_research |
| `pdf_url` | ❌ (new field) | Add to arabic_research |
| `domain` | ❌ (new field) | Add `source_domain` to arabic_research |
| `crawl_timestamp` | `createdAt` | Map to existing field |
| ❌ (missing) | `id` | Generate unique ID |
| ❌ (missing) | `slug` | Generate using `uniqid()` (see Slug Generation section) |
| ❌ (missing) | `language` | Detect from content |
| ❌ (missing) | `deleted` | Default to `false` |
| ❌ (missing) | `hits`, `downloads`, `cites` | Default to `0` |
| ❌ (missing) | `rate`, `raters` | Default to `0` |

### New Fields to Add to `web_content`

```json
{
  "id": { "type": "keyword" },
  "slug": { "type": "keyword" },
  "language": { "type": "keyword" },
  "english_full_title": {
    "type": "text",
    "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } }
  },
  "arabic_full_title": {
    "type": "text",
    "fields": { "keyword": { "type": "keyword", "ignore_above": 1000 } },
    "analyzer": "arabic"
  },
  "english_abstract": {
    "type": "text",
    "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } }
  },
  "arabic_abstract": {
    "type": "text",
    "fields": { "keyword": { "type": "keyword", "ignore_above": 10000 } },
    "analyzer": "arabic"
  },
  "source_type": { "type": "keyword" },
  "is_external": { "type": "boolean" },
  "hits": { "type": "long" },
  "downloads": { "type": "long" },
  "deleted": { "type": "boolean" }
}
```

### New Fields to Add to `arabic_research` (Safe - No Reindex Required)

These fields can be safely added to `arabic_research`. They will:
- ✅ Only apply to newly inserted documents
- ✅ Existing documents will have `null` for these fields (no impact)
- ✅ No reindexing required

```json
{
  "doi": { "type": "keyword" },
  "source_url": { "type": "keyword" },
  "pdf_url": { "type": "keyword" },
  "source_domain": { "type": "keyword" },
  "is_external": { "type": "boolean" }
}
```

**To add these fields (run once):**
```bash
curl -X PUT "EXTERNAL_ES_HOST/arabic_research/_mapping" \
  -H "Content-Type: application/json" \
  -d '{
    "properties": {
      "doi": { "type": "keyword" },
      "source_url": { "type": "keyword" },
      "pdf_url": { "type": "keyword" },
      "source_domain": { "type": "keyword" },
      "is_external": { "type": "boolean" }
    }
  }'
```

> **Note:** `word_count` and `page_count` fields are intentionally excluded from `arabic_research` as they are not relevant for the existing research schema.

---

## Expected Output

Please provide:

1. **Transformation script** (Python) to:
   - Read documents from `web_content` index
   - Transform each document to match `arabic_research` schema
   - Generate `id`, `slug`, and detect `language`
   - Parse `published_date` string → `publication_date` date format
   - Map `authors` → `textualPublisher`, `keywords` → `tag`, etc.
   - Insert transformed documents into `arabic_research`
2. **Sample transformed document** showing before/after structure
3. **Deduplication logic** using `content_hash` or `doi` to avoid duplicates
4. **Batch processing** approach for efficient bulk insertion

---

## Technical Constraints

- Elasticsearch version: 8.x
- **`arabic_research` index must remain unchanged** (no schema modifications to existing fields)
- New fields can be added to `arabic_research` (safe operation)
- Must preserve `content_hash` for deduplication checks
- Language detection should support Arabic and English
- Date format in `arabic_research`: `yyyy-MM-dd HH:mm:ss`
- Bulk insert for performance (500-1000 docs per batch)

---

## Document Types in `web_content`

| Type | Description | Expected Fields |
|------|-------------|-----------------|
| `ojs_article` | Open Journal Systems academic articles | title, abstract, authors, doi, journal_name, keywords |

---

## Slug Generation Mechanism

The existing Shamra Academia project uses a specific slug generation pattern for research entities. **All crawled content must follow this same pattern** for compatibility.

### How Slugs are Generated in Shamra Academia

```php
/**
 * Sets the slug of the research.
 *
 * @ORM\PrePersist
 *
 * @return self
 */
public function setSlug() {
    if($this->slug == null)
        $this->slug = \uniqid();
        
    return $this;
}
```

### Key Characteristics

| Property | Value |
|----------|-------|
| **Method** | PHP `uniqid()` function |
| **Format** | 13-character hexadecimal string |
| **Example** | `65c6e8a3b1234` |
| **Uniqueness** | Based on current time in microseconds |

### Implementation for Crawled Content

**Python equivalent:**
```python
import time
import random

def generate_slug():
    """Generate a slug compatible with PHP's uniqid()"""
    # PHP uniqid() returns hex of current time in microseconds
    timestamp = time.time()
    hex_timestamp = format(int(timestamp * 1000000), 'x')
    return hex_timestamp[:13]  # 13 characters like PHP

# Alternative: use uuid for guaranteed uniqueness
import uuid
def generate_slug_uuid():
    return uuid.uuid4().hex[:13]
```

**Bash/curl equivalent:**
```bash
# Generate slug similar to PHP uniqid()
slug=$(python3 -c "import time; print(format(int(time.time()*1000000), 'x')[:13])")

# Or use date-based hex
slug=$(date +%s%N | md5sum | head -c 13)
```

### Slugger Utility (Alternative for Content-Based Slugs)


```php
class Slugger {
    static function slugify($slug) {
        $slug1 = str_replace('/', " ", $slug);
        $slug2 = str_replace('\\', " ", $slug1);
        return \preg_replace('#[ -]+#', '_', \mb_strtolower(\trim(\strip_tags($slug2)), 'UTF-8'));
    }
}
```

This utility is used for Tags (not Research). **For Research entities, always use `uniqid()` style slugs.**

---

## Questions to Address

1. ~~Should we create a new unified index or use index aliases?~~ **Resolved: Direct insertion into `arabic_research`** - No aliases needed, no new index. Transform and insert directly.
2. How to handle documents without abstracts (HTML pages)?
3. ~~Should `textualPublisher` in `arabic_research` be updated to `authors` for consistency?~~ **Resolved: No** - Keep existing field names. Map `authors` → `textualPublisher` during transformation.
4. ~~How to generate stable `id` and `slug` for crawled content?~~ **Resolved: Use `uniqid()` pattern**
5. Should we add `ai_summary` generation for crawled content?

6. Important:
Do not add the same title twice, for each document in web_Content, we will query arabic_research for that title and see if the research exists, if there is a match in title (identical or differnet in qutation or simple characters), we will ignore that document.

Expected output:
A python script in a new folder called shamra_integration, that has .env variable to start integration.