Financial Index Replication
Prescriptive optimization template for selecting a sparse 20-stock replication basket and weights that track an S&P 500-like benchmark.
What this template is for
Index funds and separately managed accounts often need to track a broad benchmark without holding every constituent. This template builds a sparse replication basket: from 50 S&P 500-like stocks, select exactly 20 names and their weights so the portfolio follows the benchmark’s historical returns as closely as possible.
The model uses RelationalAI’s prescriptive reasoner to optimize both selection and sizing in one mixed-integer program. It includes practical portfolio constraints: long-only weights, max position size, sector neutrality, and per-name ADV participation limits.
Why this problem matters
Holding every index constituent can be operationally expensive, especially for smaller accounts, tax-aware portfolios, or products with custody and trading constraints. A sparse replicating basket gives most of the benchmark exposure while reducing the number of positions to trade and maintain.
The hard part is that name selection and weight optimization interact. The best 20 names are not simply the largest constituents or the highest-correlated stocks; they need to work together as a portfolio while respecting sector and per-name trading-capacity rules.
Key design patterns demonstrated
- Cardinality-constrained selection — binary variables choose exactly 20 stocks.
- Linked binary and continuous decisions — a stock can carry weight only if selected.
- Tracking objective — the solver minimizes absolute tracking residuals while the script reports realized RMS tracking error after solving.
- Sector neutrality — replicated sector exposure must stay within a fixed active band around benchmark sector weights.
- ADV participation control — ADV stands for average daily dollar volume; each stock’s buy or sell amount is capped as a fraction of ADV.
- Full-history evaluation — optimize across the entire return history and report realized tracking quality.
- Baseline comparison — compare against equal-weight top-20 stocks by full-history correlation.
Who this is for
- Quantitative analysts building index replication workflows
- Portfolio managers exploring sparse benchmark tracking
- Data scientists learning mixed-integer optimization with financial constraints
- Engineers modeling linked selection and allocation decisions
What you’ll build
- A semantic model for stocks, sectors, benchmark returns, and stock returns
- A mixed-integer optimization model with 20-name cardinality
- Long-only portfolio weights with max position constraints
- Sector-neutrality and ADV participation constraints
- A tracking residual objective over historical returns
- Full-history tracking error reports and a simple baseline comparison
What’s included
financial_index_replication.py— Main script with the semantic model, optimization model, solve, and reportingdata/stocks.csv— 50-stock universe with ticker, sector, benchmark weight, liquidity, and previous weightdata/index_returns.csv— Monthly S&P 500-like benchmark returnsdata/stock_returns.csv— Monthly historical returns by stockpyproject.toml— Python package configuration with dependencies
Template structure
.├─ README.md # this file├─ pyproject.toml # dependencies├─ financial_index_replication.py # main entrypoint: model, solve, report└─ data/ ├─ stocks.csv # 50-stock universe ├─ index_returns.csv # benchmark monthly returns ├─ stock_returns.csv # per-stock monthly returns └─ replica_returns.csv # written by the script after solvingStart here: run python financial_index_replication.py.
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
- RelationalAI Python SDK (
relationalai) == 1.0.14
Quickstart
-
Download ZIP:
Terminal window curl -O https://private.relational.ai/templates/zips/v1/financial_index_replication.zipunzip financial_index_replication.zipcd financial_index_replication -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai init -
Run:
Terminal window python financial_index_replication.py -
Expected output:
======================================================================FINANCIAL INDEX REPLICATION======================================================================Universe: 50 stocksSelected names: exactly 20Max position: 10%Sector active band: +/- 4%Portfolio value: $10,000,000Max ADV participation per name: 5%Status: OPTIMALObjective: total absolute residual = ...=== Selected Replication Basket ===ticker sector weight benchmark_weight previous_weight avg_dollar_volume... ... ... ... ... ...=== Sector Exposure ===sector weight benchmark_weight active_weightConsumer Discretionary ... ... ......=== Tracking Quality ===Annualized tracking error: ...Mean abs monthly residual: ...Implied turnover: ...=== Baseline Comparison ===Baseline: equal-weight top-20 stocks by full-history correlationBaseline annualized tracking error: ...Wrote benchmark-vs-replica returns to: data/replica_returns.csv
How It Works
1. Load the universe
The template loads a compact synthetic dataset:
stocks.csv: identifiers, sectors, benchmark weights, liquidity, and previous holdingsindex_returns.csv: monthly benchmark returnsstock_returns.csv: monthly returns for each stock
The data is synthetic but shaped like an S&P 500 replication problem, so the template is runnable without licensed market data.
The benchmark constituent weights were generated by first assigning broad target sector allocations, then drawing uneven positive stock weights within each sector and scaling those names so each sector sums back to its target. This creates a market-cap-like benchmark shape without using licensed constituent weights. The benchmark return for each month is then calculated from the stock return table as the weighted sum of all constituent returns, with a small random noise term added so the sparse 20-name replication problem is realistic rather than perfectly mechanical.
2. Define selection and weight variables
The model has two main decisions per stock:
Stock.x_selected = model.Property(f"{Stock} selected if {Float:selected}")Stock.x_weight = model.Property(f"{Stock} has replication weight {Float:weight}")x_selected is binary. x_weight is continuous. The linking constraint keeps unselected names at zero weight:
weight <= MAX_WEIGHT * selected3. Match benchmark returns
For each historical month, the model creates positive and negative residual variables:
index_return[t] - sum_i weight[i] * stock_return[i,t] = pos_error[t] - neg_error[t]The objective minimizes total absolute residual:
minimize sum_t pos_error[t] + neg_error[t]This keeps the solver problem linear and mixed-integer. After solving, the script computes the standard RMS tracking error across the full history.
This template uses an L1 tracking objective because absolute residuals keep the model linear with binary selection variables. A classic L2 objective, minimizing squared residuals, is also a natural tracking-error formulation if the selected solver supports the resulting mixed-integer quadratic problem.
4. Add portfolio realism
The template includes constraints practitioners expect:
- exactly 20 selected stocks
- weights sum to 100%
- no shorting
- max 10% per selected stock
- sector weights within +/- 4% of benchmark sector weights
- per-name buy and sell amounts no more than 5% of average daily dollar volume (ADV)
5. Evaluate the portfolio
After solving, the script reports:
- selected names and weights
- sector exposures and active sector weights
- annualized tracking error
- mean absolute residual
- implied turnover
- comparison to a simple top-correlation baseline
data/replica_returns.csvwithdate,index_return, andreplica_return
You can use data/replica_returns.csv to plot the original benchmark return series against the optimized replica return series.
Customize
- Change
N_REPLICATION_NAMESto select more or fewer names. - Change
MAX_WEIGHTto tighten or relax the largest allowed position size. - Tighten
SECTOR_ACTIVE_BANDfor stricter sector neutrality. - Lower
MAX_ADV_PARTICIPATIONfor stricter per-name trading capacity. - Replace the synthetic CSVs with real benchmark and constituent returns if your data license allows it.
Troubleshooting
Why is the solver returning INFEASIBLE?
- The combination of `N_REPLICATION_NAMES`, `SECTOR_ACTIVE_BAND`, and `MAX_ADV_PARTICIPATION` may be over-constrained for the universe. Loosen the sector band first (e.g., 0.04 -> 0.06) and re-run.- `MAX_ADV_PARTICIPATION` interacts with `PORTFOLIO_VALUE` and `previous_weight`. A large portfolio rebalancing into low-ADV names can be infeasible -- raise the ADV cap, lower portfolio value, or expand the universe.- Check that `MAX_WEIGHT * N_REPLICATION_NAMES >= 1.0` so the full-investment constraint is reachable.Why does rai init fail or hang?
- Confirm the RAI Native App is installed in your Snowflake account and your user has access.- Check that your active Snowflake profile points to the right account/role; re-run `rai init` to refresh credentials.- Network proxies and corporate firewalls can block the auth handshake -- try from an unrestricted network.Why are my tracking-error numbers worse than the baseline?
- Verify that all three CSVs cover the same date range. A mismatch causes the script to silently drop months from the join.- Confirm `index_returns.csv` and `stock_returns.csv` use the same return convention (simple vs log) -- mixing them inflates residuals.- Inspect the selected basket's sector exposure: a tight `SECTOR_ACTIVE_BAND` can push the optimizer away from the highest-correlation names.Why did pd.read_csv fail on one of the data files?
- Confirm the file exists under `data/` and matches the expected headers (`ticker`, `sector`, `benchmark_weight`, `avg_dollar_volume`, `previous_weight` for stocks; `date,index_return` for index; `date,ticker,return` for stock returns).- Re-extract the template ZIP if any file looks truncated.- On Windows, ensure files are UTF-8 encoded with no BOM.