🔄 Weekly Prediction Pipeline — click any node to inspect
PT = EPS × PE
⚠️ Bloomberg Clarification
Bloomberg data is NOT used in the weekly prediction pipeline.
The bloomberg_consensus table is only LEFT JOIN'd in weekly_backtest.py L226 for the backtest comparison UI. "Street Consensus" in our prompt = average of our PDF-extracted bank reports.
📄 Consensus Source
load_street_consensus() queries report_extractions with 90-day lookback, deduplicates by keeping latest per bank, averages FY2 EPS and target prices in USD.
🧮 Anchored Deviation Valuation
🎯 Identity Property
When deviations = 0%: PT = Consensus_EPS × (Price/Consensus_EPS) = Price → Expected Return = 0% → Signal = hold. This is the default most weeks.
📐 Error Amplification
PT_error ≈ EPS_error + PE_error + cross-term. INTC cross-term max = 54.2%. PE is 2-4× more volatile than EPS (CoV 10-42% vs 3-18%).
🏗️ Thematic Purity V3 — Revenue Disaggregation + AI Alignment Pipeline
Based on MSCI/Bloomberg methodology + ASC 606/IFRS 15 revenue disaggregation + AI-driven alignment scoring. Data is theme-agnostic (extract once, score for any theme). → View Live Dashboard
Phase 1 Data Extraction — Theme-agnostic · Run once · All themes reuse
Segment Extraction
- Annual report PDF → Gemini 2.5 Flash
- Extract L1 reportable segments
- Revenue %, amounts, description
- Key products & end markets
extracted_segments_v2
Revenue Disaggregation
- Same PDF, second Gemini call
- End market breakdown
- Product type breakdown
- Geography breakdown
- Cross-check with L1 (±5%)
Financial Notes Extraction
- Same PDF, third Gemini call
- PP&E breakdown — land, buildings, equipment, CIP
- SBC allocation — R&D vs Sales vs G&A
- Intangible assets — patents, technology, goodwill, customer relationships
- Net amounts, useful lives, validation status
extracted_sbc_allocation
extracted_intangibles_breakdown
Business Type Normalization
- Classify L2 sub-segments (priority)
- Fallback to L1 if no L2 data
- Standard industry categories
- e.g. "AI & HPC Semiconductors"
Phase 2 Theme Scoring — Theme-specific · Per theme · AI-driven alignment
AI Per-Segment Alignment Scoring
- NEW: AI scores EACH segment's alignment (0-100%)
- Input: segments + business_types + financial notes context (PP&E, SBC, Intangibles)
- AI determines:
layer(CORE/ENABLER/ADJACENT) +alignment_pct(0-100) - Replaces fixed factors (1.0/0.5/0.25) with granular per-segment scoring
- Financial notes provide context for alignment judgment
stored on extracted_segments_v2 / extracted_sub_segments
Score Calculation
- If L2 exists → use L2 granularity
- If no L2 → fallback to L1
- NEW formula: Score = Σ(revenue_pct × AI alignment_pct / 100)
- OLD formula: Score = Σ(revenue_pct × fixed layer_factor)
- Score range: 0~100
- MSCI comparison (human_score) for gap analysis
🧮 Scoring Formula — V3 AI Alignment
Example: NVDA — "Data Center" 89.7% × 95/100 + "Auto" 1.1% × 85/100 + "Gaming" 7.4% × 0/100 = 86.2
Score 0 = zero connection, Score 100 = pure-play
🔄 V2 → V3 Change
V2 (old): Fixed factors per layer — CORE=1.0, ENABLER=0.5, ADJACENT=0.25. Same factor for all segments in a layer.
V3 (new): AI scores each segment individually with alignment_pct (0-100%). A CORE segment might get 95% or 70% depending on actual relevance. Financial notes (PP&E, SBC, Intangibles) provide additional context for AI judgment.
📊 Financial Evidence Integration (Step 2.1)
Financial notes data provides supporting evidence for thematic alignment. Extracted from annual report and displayed as insight cards on the company detail page.
🏭 PP&E — Capital Investment
- Land, buildings, equipment, CIP
- Gross & net amounts
- Auto-insight: manufacturing vs fabless
🧪 SBC — Talent Allocation
- R&D vs Sales vs G&A breakdown
- % allocation per department
- Auto-insight: innovation vs commercial culture
💡 Intangibles — IP Moat
- Patents, technology, goodwill
- Customer relationships, licenses
- Auto-insight: organic IP vs acquisition growth
💾 Database Schema
annual_report_extractions
segment_classification, total_revenue,
extraction_model, extracted_at
extracted_segments_v2
segment_name, revenue_pct,
revenue_amount, description,
business_type, ai_layer, alignment_pct
extracted_sub_segments
sub_segment_name, revenue_pct,
breakdown_type, business_type,
ai_layer, alignment_pct
extracted_ppe_breakdown 🆕
category_name, gross_amount,
net_amount, net_numeric,
currency, source_page
extracted_sbc_allocation 🆕
department, sbc_amount,
sbc_pct, currency,
source_page
extracted_intangibles_breakdown 🆕
category_name, gross_amount,
net_amount, net_numeric,
useful_life_years, validation_status
thematic_runs
run_mode, scoring_method,
created_at
thematic_scores
ai_score, human_score,
scoring_granularity (L1/L2),
scoring_rationale, status
🧠 Pipeline Prompts — Flow & Actual Instructions
Each step's prompt is shown below with its data flow. Click to expand/collapse.
Step 1 Segment Extraction step1_unified.py → Gemini Flash
You are a financial data extraction specialist. Extract ALL business segments
from this annual report. DO NOT make any thematic judgments.
## Instructions
1. Determine segment classification: "business" / "geographic" / "single"
2. ONLY extract ASC 280 / IFRS 8 reportable segments.
3. For EACH segment extract:
- segment_name, segment_name_original (if non-English)
- revenue_amount (with currency), revenue_pct (% of total)
- description: 1-2 sentences QUOTED from report
- source_page, source_quote
- key_products, end_markets
## Rules
- Extract ONLY what the report states. Do not infer or estimate.
- Include ALL operating segments, even small ones.
- EXCLUDE "Eliminations", "Corporate/Other" — only operating segments.
- Revenue percentages should sum to ~100%.
## Output: JSON { segments: [...], total_revenue_amount, fiscal_year_end }
Step 2 Revenue Disaggregation step2_disaggregation.py → Gemini Flash
You are a financial data extraction specialist focused on revenue disaggregation.
## Context: Known L1 Segments (ASC 280)
{l1_context} ← injected from Step 1 output
## Task: Search ENTIRE annual report for MORE GRANULAR revenue breakdown tables:
- Notes → "Revenue Disaggregation" / "Disaggregation of Revenue"
- MD&A → "Revenue by End Market" / "Revenue by Product Line"
## Extraction Rules
1. Classify each breakdown: "end_market" / "product_type" / "geography" / "application" / "timing"
2. revenue_pct: ALWAYS as % of TOTAL COMPANY revenue (not % of parent segment)
Example: "Software" = 50% of DI, DI = 22.5% → pct = 11.25%
3. Record source_page, source_quote
## Cross-Validation
- Complete breakdowns: sub-items ≈ parent total (±5%)
- Partial breakdowns (is_partial=true): sum < parent is expected
## Output: JSON { breakdowns: [...], qualitative_insights, cross_validation }
Step 2.1 🆕 Financial Notes Extraction step2_1_financial_notes.py → Gemini Flash
Extract the following financial note disclosures from this annual report: ## 1. PP&E Breakdown (ASC 360) Find the Property, Plant & Equipment schedule. Extract each category: - category_name (land, buildings, equipment, CIP, etc.) - gross_amount, accumulated_depreciation, net_amount - currency, source_page ## 2. SBC Allocation (ASC 718) Find Stock-Based Compensation by function/department: - department (R&D, Sales & Marketing, G&A) - sbc_amount, sbc_pct (% of total SBC) - currency, source_page ## 3. Intangible Assets (ASC 350 / IAS 38) Find Intangible Assets schedule (excluding or including Goodwill): - category_name (technology, patents, customer relationships, etc.) - gross_amount, accumulated_amortization, net_amount - useful_life_years, validation_status - currency, source_page ## Rules - Extract ONLY what the report states - Include source_page for every item - Use original currency from report
Step 3 Business Type Normalization step3_classify.py → GPT-4o (structured)
You are classifying business segments into STANDARD industry categories.
## IMPORTANT RULES
1. You MUST pick from the pre-defined list below. DO NOT invent new types.
2. If no type fits, use the CLOSEST match from the list.
3. Only use "OTHER: <desc>" if truly nothing matches (<5%).
4. Similar businesses across companies MUST get the SAME business_type.
5. You MUST classify EVERY segment — do NOT skip any.
## Pre-defined Business Types (~84 categories)
{type_list} ← e.g. "AI & HPC Semiconductors", "Semiconductor Foundry", etc.
## Segments to Classify ({segment_count} total)
{segments_json}
Return JSON { classifications: [{ segment_id, business_type, rationale }] }