Analysis - Statsledge

🔍 Data Quality Audits

Audit

🆔

Player ID Mismatch Audit

player_id_audit_report.md

Comprehensive audit identifying 15+ player ID mismatches between Cricsheet data and IPL 2026 squad files. Resolves duplicate IDs from franchise trades (MI↔LSG) and ensures data integrity across all analytics pipelines.

Why this was done

Discovered during stat pack generation that some players appeared twice or had incorrect stats. This audit created canonical ID mappings used by all downstream outputs.

📄 Internal Document

Audit

🎯

Entry Point Audit

entry_point_audit_report.md

Analysis of where batters typically bat in the order (entry points). Validates our batting position classifications and ensures OPENER, MIDDLE_ORDER, and FINISHER tags align with actual match data.

Why this was done

Andy Flower flagged that some "opener" tags didn't match where players actually batted. This audit provided evidence-based entry point distributions.

📄 Internal Document

📈 EDA & Threshold Analysis

EDA

📊

Threshold EDA (2023+ Data)

threshold_eda_2023.md

Exploratory data analysis that determined optimal thresholds for player tags using only 2023-2025 IPL data. Establishes baselines for SPECIALIST vs VULNERABLE, PP_BEAST vs PP_LIABILITY, and other phase-specific tags.

Why this was done

Original thresholds were based on all-time T20 data, causing stat drift. This EDA recalibrated all thresholds to reflect the current T20 meta (2023+).

📄 Internal Document

Validation

⚖️

Baselines vs Tags Comparison

baselines_vs_tags.md

Systematic comparison of league-wide baselines against our player tag criteria. Validates that SPECIALIST and VULNERABLE thresholds are meaningful relative to overall IPL performance distributions.

Why this was done

Jose Mourinho challenged whether our tag thresholds were statistically defensible. This analysis proved that tags represent meaningful deviations from league averages.

📄 Internal Document

📚 Research & Methodology

Research

🏈

PFF Grading System Research

pff_grading_system_research.md

Deep-dive into Pro Football Focus's play-by-play grading methodology. Explores how context-aware evaluation and position-specific metrics can be adapted for cricket analytics.

Why this was done

PFF revolutionized NFL analytics by going beyond box scores. This research informed our approach to context-aware player evaluation in cricket.

📄 Internal Document

Research

🏀

KenPom Methodology Research

kenpom_methodology_research.md

Analysis of KenPom's college basketball analytics — adjusted efficiency, tempo-free stats, and the "Four Factors." Foundation for our planned CricPom team rating system.

Why this was done

KenPom's adjusted metrics revolutionized basketball analysis. This research explores how to create cricket equivalents for team-level efficiency ratings.

📄 Internal Document

Research

⚽

CricPom Prototype Specification

cricpom_prototype_spec_020926_v1.md

Opponent-adjusted, venue-normalized, phase-aware T20 efficiency rating system. 5-factor tournament weighting (PQI, CI, Recency, Conditions, Confidence) with sigmoid-based sample size scoring. 231 IPL 2026 players rated across 14 T20 tournaments.

Why this was done

KenPom transformed college basketball evaluation. This spec translates that methodology to T20 cricket — adjusting for opponent quality, venue conditions, and sample size to produce honest player efficiency ratings.

📄 Internal Spec 📊 Player Feed (231 players)

Research

⚖

Tournament Composite Weights

tkt187_final_weights.py

Geometric mean formula weighting 14 T20 tournaments across 5 factors: Player Quality Index, Effective Conditions Index, Recency decay, Conditions Similarity, and Sample Confidence. Produces per-team weighted SR, avg, economy, boundary%, and dot% composites.

Why this was done

A simple average of IPL + BBL + PSL stats would be misleading. Different tournaments have different quality levels. This engine ensures the IPL carries more weight than a T10 league, while still using cross-tournament data for players with limited IPL samples.

📄 Weights Engine

Research

📈

Insight Confidence Framework

sigmoid_confidence()

Sigmoid-based confidence scoring: 1 / (1 + exp(-0.02 × (matches - 100))). Players with 200+ matches get 95%+ confidence. Under 30 matches = low confidence. Applied to all 231 CricPom player ratings and pressure performance metrics.

Why this was done

Treating a player with 5 IPL innings the same as one with 150 is dishonest. This framework quantifies statistical reliability so readers know which insights are backed by deep data and which are small-sample estimates.

📄 Implementation

Research

🔮

Silhouette Score Validation

baseline_comparison.py

Three-metric cluster validation: silhouette score (cohesion vs separation), Davies-Bouldin index (cluster similarity), and Calinski-Harabasz index (between/within variance ratio). Tests K-means against random baseline to confirm clustering is meaningful, not noise.

Why this was done

Clustering algorithms will always produce clusters — even from random data. Without validation metrics, our batter/bowler archetypes could be statistical artifacts. Silhouette scoring proves the structure is real.

📄 Validation Code

Research

🎯

Pressure Sequence Analysis

generate_momentum_data.py

Consecutive dot ball and boundary sequence analysis for bowling pressure and batting resilience profiling. Rates teams Elite/Strong/Average/Weak based on sequence length thresholds. Identifies clutch performers and choke risks via SR delta under pressure.

Why this was done

Traditional bowling stats (economy, SR) miss the ability to build sustained pressure. A bowler who bowls 4 dots then leaks a boundary is different from one who never strings 3 dots together. Sequences reveal pressure capability that averages hide.

📄 Generator

Spec

🎨

Player Clustering PRD

player_clustering_prd.md

Product requirements document for our K-means clustering model. Defines the 6 batter archetypes (EXPLOSIVE_OPENER to FINISHER) and 7 bowler archetypes (PACER to PART_TIMER).

Why this was done

Before building the clustering model, we needed to define what archetypes should exist and how they map to cricket roles. This PRD aligned the team.

📄 Internal Document

Spec

🎭

Cluster Archetypes (Creative)

cluster_archetypes_creative.md

Creative descriptions and narrative framing for each player archetype. Makes technical clusters accessible to fans through cricket storytelling and real-world player examples.

Why this was done

Raw cluster labels (Cluster 0, 1, 2...) don't resonate with fans. This doc created the narrative layer that makes analytics accessible.

📄 Internal Document

🧩 Player Pattern Recognition

Analytics

📉

Batter Consistency Index

batter_consistency_index.csv

Rolling consistency analysis across IPL 2023-2025. Tracks coefficient of variation in runs scored, single-digit failure rate, and form trajectory by season. Separates "reliable anchors" from "streaky match-winners."

Why this matters

Raw averages hide volatility. A 35-average batter who scores 0, 70, 0, 70 is very different from one who scores 30, 40, 35, 30. This index reveals who you can depend on.

📊 Internal Data

Analytics

🤝

Partnership Synergy Scores

partnership_synergy.csv

Measures how batting pairs amplify each other's performance. Synergy index compares partnership run rates against individual averages. Year-wise trends show which combinations are improving or declining.

Why this matters

Predicted XIs aren't just about 11 individuals — it's about combinations. Partnership data reveals which batting pairs create more than the sum of their parts.

📊 Internal Data

⚔️ Matchup Intelligence

Matchups

🏏

Batter vs Bowling Type

batter_bowling_type_matchup.csv

How every IPL batter performs against pace, off-spin, leg-spin, and left-arm spin. Identifies PACE_SPECIALIST, SPIN_SPECIALIST, and VULNERABLE_VS_SPIN tags based on 2023-2025 data with statistically significant sample sizes.

Why this matters

A batter averaging 45 overall might average 55 vs pace but 25 vs spin. This matchup data drives bowling attack composition and batting order decisions.

📊 Internal Data

Matchups

🎳

Bowler vs Batting Handedness

bowler_handedness_matchup.csv

How every IPL bowler performs against left-handers vs right-handers. Reveals asymmetric matchups — bowlers who dominate one handedness but struggle against the other. Critical for batting order optimization.

Why this matters

Teams often alternate LHB/RHB in the order to disrupt bowling lines. This data quantifies exactly how much advantage each switch provides.

📊 Internal Data

Venue

🏟️

Team Venue Records

team_venue_records.csv

Win/loss records for every team at every IPL venue, with year-wise breakdown. Identifies home fortress effects, away vulnerabilities, and neutral venue performance patterns across 2023-2025.

Why this matters

Venue is cricket's most underrated variable. Some teams have 80%+ home win rates but drop to 30% away. Park factors directly influence team ratings.

📊 Internal Data

🔥 Pressure & Phase Performance

Pressure

💪

Bowler Pressure Sequences

bowler_pressure_sequences.csv

Tracks bowler performance under pressure — economy and strike rate in death overs, when defending small totals, and in consecutive dot-ball sequences. The cricket equivalent of "clutch" performance.

Why this matters

PFF grades NFL quarterbacks on 4th-quarter performance. Similarly, a bowler's death overs economy under pressure is more predictive than their overall economy.

📊 Internal Data

Phase

📋

Bowler Phase Distribution

bowler_phase_distribution_grouped.csv

How bowlers distribute their overs across powerplay, middle, and death phases. Grouped analysis reveals captaincy patterns — which bowlers are trusted at death, which are powerplay-only, and who bowls through all phases.

Why this matters

A bowler's role in a team is defined by when they bowl. Phase distribution directly feeds into our DEATH_SPECIALIST, NEW_BALL_SPECIALIST, and WORKHORSE tags.

📊 Internal Data

Pressure

🎯

Batter Pressure Bands

batter_pressure_bands.csv

How every IPL batter performs across low, medium, and high pressure bands (2023-2025). Segments strike rate, boundary percentage, and dot ball frequency by match situation to reveal who thrives and who wilts under pressure.

Why this matters

Overall averages flatten out pressure context. A batter averaging 140 SR may drop to 110 in high-pressure situations. This data separates genuine clutch performers from flat-track bullies.

📊 Internal Data

Pressure

📊

Pressure Performance Ratings

pressure_deltas.csv

Composite pressure performance ratings for batters and bowlers. Measures the delta between overall and pressure-situation strike rates, boundary percentages, and dot ball rates. Assigns CLUTCH, STEADY, or FADES ratings based on statistical thresholds.

Why this matters

The PFF approach applied to cricket: context-aware grading over raw stats. A bowler who improves by 15% under pressure is more valuable than one who merely maintains. These ratings directly inform Predicted XII selection weights.

📊 Internal Data

Glossary

📖

Pressure Performance Glossary

Reference guide for all pressure metrics and ratings

▼ Expand glossary

Pressure Bands (RRR-Based)

COMFORTABLE	< 8	Cruising — run rate is manageable, batters can play normally
BUILDING	8 – 10	Above par — scoring needs to accelerate, risk-taking begins
HIGH	10 – 12	Aggressive required — boundaries needed every 2-3 balls
EXTREME	12 – 15	Six-hitting territory — almost every ball must score
NEAR_IMPOSSIBLE	15+	Miracle needed — requires continuous boundaries to win

Pressure Ratings (Performance Tags)

CLUTCH	Batters: SR improves 10%+ AND dot% drops in 12+ RRR bands \| Bowlers: Economy improves AND dot% rises
PRESSURE_PROOF	Metrics within ±5% of overall across all bands (batters and bowlers alike)
MODERATE	Performance changes between 5-10% under pressure for both roles
PRESSURE_SENSITIVE	Batters: SR drops 10%+ OR dot% rises 10%+ \| Bowlers: Economy rises 15%+ OR boundary% conceded rises 10%+
FINISHER	Batter only — SR in 15+ band exceeds 170 with adequate sample
CLOSER	Bowler only — Economy < 8.5 in 15+ band with 5+ overs

Entry Context (Batter Only)

FRESH	< 10 balls	Walked in during pressure phase — facing it cold
BUILDING	10 – 25	Getting set when pressure hit — partially established
SET	25 – 40	Well established before pressure phase
DEEP_SET	40+	Long innings before pressure — fully in rhythm

Other Terms

Weighted Score (W.Score)	Composite metric combining SR delta with sample size (log₂ scaling) and death overs bonus (30% weight for overs 16-20 execution)
SR Delta	Percentage change in strike rate between overall performance and pressure situations. Positive = better under pressure.
Death Pressure Ratio	Proportion of pressure balls faced in overs 16-20 vs all pressure balls

⚙️ Algorithm Documentation

Algorithm

🎯

SUPER SELECTOR Algorithm v2

predicted_xii_algorithm_v2.md

Complete specification of our Predicted XII algorithm. Covers constraint satisfaction (overseas limits, balance requirements), scoring weights, impact player selection, and tie-breaking rules.

Why this was done

The Predicted XII is our flagship output. This doc ensures the algorithm is reproducible, auditable, and can be improved systematically.

📄 Internal Document

Domain

🌸

Andy Flower Validation v2

andy_flower_v2_validation.md

Domain expert review of our clustering model outputs. Andy Flower's validation of player archetypes and recommendations for threshold adjustments based on cricket expertise.

Why this was done

Data-driven models need domain validation. Andy's cricket expertise ensures our clusters map to real playing styles, not just statistical patterns.

📄 Internal Document

🌍 Tournament Intelligence

PQI (Player Quality)

Eff. CI (Competition)

Recency

Conditions

Sample Size

#	Tournament	Weight	Tier	PQI	Eff. CI	Recency	Conditions	Sample	Matches

Methodology

Tournament weights determine how much non-IPL performance data is trusted when building player profiles. Each tournament is scored on 5 factors: PQI (25%), Effective CI (20%), Recency (20%), Conditions Similarity (15%), and Sample Confidence (20%). The composite weight is the geometric mean of all factors (weighted). Recency uses a 4-year half-life decay. Conditions similarity is benchmarked against IPL 2023-2025 as the baseline. Tier assignments: 1A (0.80+), 1B (0.60-0.79), 1C (0.45-0.59), 2 (<0.45).

🔍 Data Quality Audits

Player ID Mismatch Audit

Entry Point Audit

📈 EDA & Threshold Analysis

Threshold EDA (2023+ Data)

Baselines vs Tags Comparison

📚 Research & Methodology

PFF Grading System Research

KenPom Methodology Research

CricPom Prototype Specification

Tournament Composite Weights

Insight Confidence Framework

Silhouette Score Validation

Pressure Sequence Analysis

Player Clustering PRD

Cluster Archetypes (Creative)

🧩 Player Pattern Recognition

Batter Consistency Index

Partnership Synergy Scores

⚔️ Matchup Intelligence

Batter vs Bowling Type

Bowler vs Batting Handedness

Team Venue Records

🔥 Pressure & Phase Performance

Bowler Pressure Sequences

Bowler Phase Distribution

Batter Pressure Bands

Pressure Performance Ratings

Pressure Performance Glossary

⚙️ Algorithm Documentation

SUPER SELECTOR Algorithm v2

Andy Flower Validation v2

🌍 Tournament Intelligence

🏏 The Playbook Rundown

Home — "The Main Event"

Team Breakdowns — "The Dugout"

Artifacts — "The Trophy Cabinet"

Analysis — "Studying Film"

Research — "Pep's Tactical Notebook"

The Film Room — "Breaking Down Tape"

About — "The Origin Story"