Semiconductor AI Agent — Interactive Pipeline

🔄 Weekly Prediction Pipeline — click any node to inspect

🗃️ Data Sources

📄 Research Report PDFs

📰 weekly_news_raw

💹 stock_prices

📅 earnings_calendar

▼

⚙️ PDF Extraction

report_extractions(GPT-4.1 extraction)

▼

📥 Data Loading (data_loader.py)

load_street_consensus90-day lookback, dedup per bank

load_week_reportscurrent + 3-week context

load_week_news+ filter_news (GPT-4.1 filter)

load_week_prices+ 4-week trend + 52w range

load_trailing_peTTM from last 4Q actuals

▼

📝 Prompt Assembly

system.mdValuation framework

＋

weekly_delta.mdAll data injected

▼

🤖 GPT-4.1 (via LiteLLM)

JSON Judgmenteps_deviation · pe_adjustment · signal

▼

🧮 Code Computes

AI_EPS = Consensus × (1 + dev%)AI_PE = Trailing × (1 + adj%)
PT = EPS × PE

▼

💾 Output

weekly_analyst_journal

→

weekly_portfolio_snapshot

📊 Comparison Only

bloomberg_consensusJoined in backtest API for display

⚠️ Bloomberg Clarification

Bloomberg data is NOT used in the weekly prediction pipeline.

The bloomberg_consensus table is only LEFT JOIN'd in weekly_backtest.py L226 for the backtest comparison UI. "Street Consensus" in our prompt = average of our PDF-extracted bank reports.

NOT IN PIPELINE

📄 Consensus Source

load_street_consensus() queries report_extractions with 90-day lookback, deduplicates by keeping latest per bank, averages FY2 EPS and target prices in USD.

report_extractions data_loader.py:334

🧮 Anchored Deviation Valuation

Market Anchors

Street Consensus EPS(avg of bank reports)

Trailing PE(Price / TTM EPS)

→

AI Judgment

eps_deviation_pct(±15% cap)

pe_adjustment_pct(±20% cap)

→

Code Computes

AI_EPS= Consensus × (1+dev%)

AI_PE= Trailing × (1+adj%)

PT = EPS × PE

Expected Return= (PT/Price − 1) × 100

🎯 Identity Property

When deviations = 0%: PT = Consensus_EPS × (Price/Consensus_EPS) = Price → Expected Return = 0% → Signal = hold. This is the default most weeks.

📐 Error Amplification

PT_error ≈ EPS_error + PE_error + cross-term. INTC cross-term max = 54.2%. PE is 2-4× more volatile than EPS (CoV 10-42% vs 3-18%).

🏗️ Thematic Purity V3 — Revenue Disaggregation + AI Alignment Pipeline

Based on MSCI/Bloomberg methodology + ASC 606/IFRS 15 revenue disaggregation + AI-driven alignment scoring. Data is theme-agnostic (extract once, score for any theme). → View Live Dashboard

84+

Companies

625+

Sub-Segments

Business Types

Pipeline Steps

Themes

~$0.80

Total AI Cost

Phase 1 Data Extraction — Theme-agnostic · Run once · All themes reuse

STEP 1

Segment Extraction

ASC 280 / IFRS 8

Annual report PDF → Gemini 2.5 Flash
Extract L1 reportable segments
Revenue %, amounts, description
Key products & end markets

Output Tables

annual_report_extractions
extracted_segments_v2

🤖 Gemini Flash × N

STEP 2 ⭐

Revenue Disaggregation

ASC 606 / IFRS 15

Same PDF, second Gemini call
End market breakdown
Product type breakdown
Geography breakdown
Cross-check with L1 (±5%)

Output Table

extracted_sub_segments

🤖 Gemini Flash × N

STEP 2.1 🆕

Financial Notes Extraction

ASC 360 / ASC 718 / ASC 350 / IAS 38

Same PDF, third Gemini call
PP&E breakdown — land, buildings, equipment, CIP
SBC allocation — R&D vs Sales vs G&A
Intangible assets — patents, technology, goodwill, customer relationships
Net amounts, useful lives, validation status

Output Tables

extracted_ppe_breakdown
extracted_sbc_allocation
extracted_intangibles_breakdown

🤖 Gemini Flash × N

STEP 3

Business Type Normalization

GICS-like classification

Classify L2 sub-segments (priority)
Fallback to L1 if no L2 data
Standard industry categories
e.g. "AI & HPC Semiconductors"

Output

business_type field

🤖 GPT × 1 call

Phase 2 Theme Scoring — Theme-specific · Per theme · AI-driven alignment

STEP 4 ⭐ CHANGED

AI Per-Segment Alignment Scoring

Gemini 2.5 Flash · Per company · With financial context

NEW: AI scores EACH segment's alignment (0-100%)
Input: segments + business_types + financial notes context (PP&E, SBC, Intangibles)
AI determines: layer (CORE/ENABLER/ADJACENT) + alignment_pct (0-100)
Replaces fixed factors (1.0/0.5/0.25) with granular per-segment scoring
Financial notes provide context for alignment judgment

Output

ai_layer + alignment_pct per segment
stored on extracted_segments_v2 / extracted_sub_segments

🤖 Gemini Flash × N companies

STEP 5 CHANGED

Score Calculation

Mechanical · Uses AI alignment_pct

If L2 exists → use L2 granularity
If no L2 → fallback to L1
NEW formula: Score = Σ(revenue_pct × AI alignment_pct / 100)
OLD formula: Score = Σ(revenue_pct × fixed layer_factor)
Score range: 0~100
MSCI comparison (human_score) for gap analysis

Output Table

thematic_scores (ai_score + human_score + gap)

🟢 Mechanical (post AI alignment)

🧮 Scoring Formula — V3 AI Alignment

Thematic_Score(company, theme) =
  Σsegment ( revenue_pct × AI_alignment_pct / 100 )

        Where: alignment_pct = AI judgment per segment (0-100), not fixed factors

        Example: NVDA — "Data Center" 89.7% × 95/100 + "Auto" 1.1% × 85/100 + "Gaming" 7.4% × 0/100 = 86.2

        Score 0 = zero connection, Score 100 = pure-play

🔄 V2 → V3 Change

V2 (old): Fixed factors per layer — CORE=1.0, ENABLER=0.5, ADJACENT=0.25. Same factor for all segments in a layer.
V3 (new): AI scores each segment individually with alignment_pct (0-100%). A CORE segment might get 95% or 70% depending on actual relevance. Financial notes (PP&E, SBC, Intangibles) provide additional context for AI judgment.

📊 Financial Evidence Integration (Step 2.1)

Financial notes data provides supporting evidence for thematic alignment. Extracted from annual report and displayed as insight cards on the company detail page.

🏭 PP&E — Capital Investment

ASC 360

Land, buildings, equipment, CIP
Gross & net amounts
Auto-insight: manufacturing vs fabless

UI Dimension

Capital Commitment KPI card

🧪 SBC — Talent Allocation

ASC 718

R&D vs Sales vs G&A breakdown
% allocation per department
Auto-insight: innovation vs commercial culture

UI Dimension

R&D Intensity KPI card

💡 Intangibles — IP Moat

ASC 350 / IAS 38

Patents, technology, goodwill
Customer relationships, licenses
Auto-insight: organic IP vs acquisition growth

UI Dimension

IP Portfolio KPI card

💾 Database Schema

annual_report_extractions

security, fiscal_year, report_url,
segment_classification, total_revenue,
extraction_model, extracted_at

extracted_segments_v2

extraction_id → FK,
segment_name, revenue_pct,
revenue_amount, description,
business_type, ai_layer, alignment_pct

extracted_sub_segments

extraction_id → FK, parent_segment,
sub_segment_name, revenue_pct,
breakdown_type, business_type,
ai_layer, alignment_pct

extracted_ppe_breakdown 🆕

extraction_id → FK,
category_name, gross_amount,
net_amount, net_numeric,
currency, source_page

extracted_sbc_allocation 🆕

extraction_id → FK,
department, sbc_amount,
sbc_pct, currency,
source_page

extracted_intangibles_breakdown 🆕

extraction_id → FK,
category_name, gross_amount,
net_amount, net_numeric,
useful_life_years, validation_status

thematic_runs

id, theme_name,
run_mode, scoring_method,
created_at

thematic_scores

security, run_id → FK,
ai_score, human_score,
scoring_granularity (L1/L2),
scoring_rationale, status

🧠 Pipeline Prompts — Flow & Actual Instructions

Each step's prompt is shown below with its data flow. Click to expand/collapse.

Step 1 Segment Extraction step1_unified.py → Gemini Flash

Annual Report PDF

S3 / Orbit API

→

EXTRACTION_PROMPT

+ PDF binary

→

Gemini 2.5 Flash

JSON response

→

Validate (sum≈100%)

store_extraction()

→

annual_report_extractions

extracted_segments_v2

EXTRACTION_PROMPT

You are a financial data extraction specialist. Extract ALL business segments
from this annual report. DO NOT make any thematic judgments.

## Instructions
1. Determine segment classification: "business" / "geographic" / "single"
2. ONLY extract ASC 280 / IFRS 8 reportable segments.
3. For EACH segment extract:
   - segment_name, segment_name_original (if non-English)
   - revenue_amount (with currency), revenue_pct (% of total)
   - description: 1-2 sentences QUOTED from report
   - source_page, source_quote
   - key_products, end_markets

## Rules
- Extract ONLY what the report states. Do not infer or estimate.
- Include ALL operating segments, even small ones.
- EXCLUDE "Eliminations", "Corporate/Other" — only operating segments.
- Revenue percentages should sum to ~100%.

## Output: JSON { segments: [...], total_revenue_amount, fiscal_year_end }

Step 2 Revenue Disaggregation step2_disaggregation.py → Gemini Flash

Same PDF

+ L1 segments context

→

DISAGGREGATION_PROMPT

L1 JSON injected

→

Gemini 2.5 Flash

JSON response

→

Cross-validate L2↔L1

±5% tolerance

→

extracted_sub_segments

end_market / product_type / geography

build_disaggregation_prompt(l1_segments)

You are a financial data extraction specialist focused on revenue disaggregation.

## Context: Known L1 Segments (ASC 280)
{l1_context}  ← injected from Step 1 output

## Task: Search ENTIRE annual report for MORE GRANULAR revenue breakdown tables:
- Notes → "Revenue Disaggregation" / "Disaggregation of Revenue"
- MD&A → "Revenue by End Market" / "Revenue by Product Line"

## Extraction Rules
1. Classify each breakdown: "end_market" / "product_type" / "geography" / "application" / "timing"
2. revenue_pct: ALWAYS as % of TOTAL COMPANY revenue (not % of parent segment)
   Example: "Software" = 50% of DI, DI = 22.5% → pct = 11.25%
3. Record source_page, source_quote

## Cross-Validation
- Complete breakdowns: sub-items ≈ parent total (±5%)
- Partial breakdowns (is_partial=true): sum < parent is expected

## Output: JSON { breakdowns: [...], qualitative_insights, cross_validation }

Step 2.1 🆕 Financial Notes Extraction step2_1_financial_notes.py → Gemini Flash

Same PDF

Annual Report

→

FINANCIAL_NOTES_PROMPT

3 extraction targets

→

Gemini 2.5 Flash

JSON response

→

Validate & Parse

Normalize amounts

→

3 tables

ppe / sbc / intangibles

FINANCIAL_NOTES_PROMPT

Extract the following financial note disclosures from this annual report:

## 1. PP&E Breakdown (ASC 360)
Find the Property, Plant & Equipment schedule. Extract each category:
- category_name (land, buildings, equipment, CIP, etc.)
- gross_amount, accumulated_depreciation, net_amount
- currency, source_page

## 2. SBC Allocation (ASC 718)
Find Stock-Based Compensation by function/department:
- department (R&D, Sales & Marketing, G&A)
- sbc_amount, sbc_pct (% of total SBC)
- currency, source_page

## 3. Intangible Assets (ASC 350 / IAS 38)
Find Intangible Assets schedule (excluding or including Goodwill):
- category_name (technology, patents, customer relationships, etc.)
- gross_amount, accumulated_amortization, net_amount
- useful_life_years, validation_status
- currency, source_page

## Rules
- Extract ONLY what the report states
- Include source_page for every item
- Use original currency from report

Step 3 Business Type Normalization step3_classify.py → GPT-4o (structured)

All L2 sub-segments

(fallback: L1)

→

CLASSIFY_PROMPT

+ 84 predefined types

→

GPT-4o

JSON Schema enforced

→

business_type field

on each segment

CLASSIFY_PROMPT

You are classifying business segments into STANDARD industry categories.

## IMPORTANT RULES
1. You MUST pick from the pre-defined list below. DO NOT invent new types.
2. If no type fits, use the CLOSEST match from the list.
3. Only use "OTHER: <desc>" if truly nothing matches (<5%).
4. Similar businesses across companies MUST get the SAME business_type.
5. You MUST classify EVERY segment — do NOT skip any.

## Pre-defined Business Types (~84 categories)
{type_list}  ← e.g. "AI & HPC Semiconductors", "Semiconductor Foundry", etc.

## Segments to Classify ({segment_count} total)
{segments_json}

Return JSON { classifications: [{ segment_id, business_type, rationale }] }

## Business Types to Evaluate ({type_count} total) {types_json} Return JSON { mappings: [{ business_type, layer, category_id, rationale }] }

Signal ↓ \ Conviction →	High	Medium	Low
strong_buy	5	4	3
buy	4	3	2
hold	0	0	0
sell	−2	−1	−1
strong_sell	−3	−2	−1

Ticker	Signal	Conv.	Score	Selected
TSM	strong_buy	med	4	✅ 20%
AVGO	buy	high	4	✅ 20%
MU	buy	med	3	✅ 20%
WDC	buy	med	3	✅ 20%
ADI	buy	med	3	✅ 20%
NVDA	buy	med	3	❌ (rank 6)
ASML	buy	med	3	❌ (rank 7)
INTC…AMD	hold	med	0	❌

Strategy	Method	Cumulative	Sharpe	MDD	Win%
🤖 AI Top 5	Score >0, rank top 5, equal weight	+273.7%	4.20	-15.2%	76%
🔀 Score Weighted	All buy/strong_buy, score-weighted	+207.7%	3.94	-15.2%	69%
🏦 Bank Consensus	Bullish ratio weighted, all stocks	+114.7%	2.47	-22.5%	69%
⚖️ Equal Weight	1/15 each, rebalance weekly	+111.4%	2.46	-22.0%	69%
💰 MktCap Weighted	By market cap, rebalance weekly	+71.6%	1.78	-21.1%	67%
📦 Buy & Hold	1/15 at start, no rebalance	+126.1%	2.63	-21.9%	71%
🏦 BBG Top 5	Top 5 by BBG consensus upside	-22.3%	-0.49	-32.4%	41%
📈 QQQ Benchmark	NASDAQ-100 ETF	+25.7%	1.24	-12.0%	53%

🔬 Semiconductor AI Agent

🔄 Weekly Prediction Pipeline — click any node to inspect

⚠️ Bloomberg Clarification

📄 Consensus Source

🧮 Anchored Deviation Valuation

🎯 Identity Property

📐 Error Amplification

🏗️ Thematic Purity V3 — Revenue Disaggregation + AI Alignment Pipeline

Phase 1 Data Extraction — Theme-agnostic · Run once · All themes reuse

Segment Extraction

Revenue Disaggregation

Financial Notes Extraction

Business Type Normalization

Phase 2 Theme Scoring — Theme-specific · Per theme · AI-driven alignment

AI Per-Segment Alignment Scoring

Score Calculation

🧮 Scoring Formula — V3 AI Alignment

🔄 V2 → V3 Change

📊 Financial Evidence Integration (Step 2.1)

🏭 PP&E — Capital Investment

🧪 SBC — Talent Allocation

💡 Intangibles — IP Moat

💾 Database Schema

annual_report_extractions

extracted_segments_v2

extracted_sub_segments

extracted_ppe_breakdown 🆕

extracted_sbc_allocation 🆕

extracted_intangibles_breakdown 🆕

thematic_runs

thematic_scores

🧠 Pipeline Prompts — Flow & Actual Instructions

📈 Strategy Allocation

🏆 Score Matrix

📊 Risk Metrics

📈 Strategy Cumulative Returns (Real Data: Mar 2025 – Feb 2026)

✅ AI Top 5 Leads

📊 vs Benchmarks

❌ Expected Return Quintile Analysis — Magnitude Not Predictive

❌ Key Finding

⚠️ Expected Return Distribution — Clustered Near Zero

⚠️ Implication

⚠️ EPS vs PE Coefficient of Variation — PE Drives All Error

⚠️ Implication

🤖 vs 🏦 AI Price Target vs Bloomberg Consensus

📊 Per-Stock MAPE — AI Wins 14 of 15

🤖 AI Average MAPE

🏦 Bloomberg Average MAPE

📐 Systematic Bias — AI Near Zero, BBG Persistently Bullish

🔍 Interpretation

📅 Latest Week Snapshot (Mar 8, 2026) — AI vs BBG Upside

🤖 AI Closer: 13/15 stocks

⚠️ Outliers: MU & TSM

🧮 Strategy Methodology — How AI Top 5 Works

Step 1️⃣ — Anchored Deviation Valuation

🔑 Identity Property

Step 2️⃣ — Signal × Conviction = Composite Score

💻 Source: portfolio_backtest.py:39-55

Step 3️⃣ — Top 5 Selection Algorithm

Algorithm (executed every Friday):

⚡ Edge Cases

📋 Live Example — Week 51 (Mar 8, 2026):

Step 4️⃣ — Why Equal Weight? The Quintile Evidence

✅ Ranking Has Power

⚠️ Magnitude Has No Power

Step 5️⃣ — Seven Strategies Head-to-Head

📐 Key Insight: AI Top 5 vs BBG Top 5 is the Killer Comparison

❓ Honest FAQ — Common Meeting Questions

Q1: "AI has lower MAPE than Bloomberg — does that mean AI is better?"

✅ The Comparison IS Fair (Methodologically)

⚠️ Honest Context: AI Has a Structural Advantage

Q2: "Can AI predict price direction?"

⚠️ Direction Accuracy ≈ 50%

✅ But ranking accuracy IS real

Q3: "Is the Top 5 +277% return real?"

✅ It's real, with caveats

⚠️ Caveats to acknowledge

Q4: "What is AI's real value?"

🎯 AI is a Ranker, not a Predictor

🎯 AI vs BBG Price Target Accuracy — Per-Stock Scorecard

💻 Source: `portfolio_backtest.py:39-55`