{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 4: A Better Standard - Bias Correction\n", "\n", "So far, we have learned how to get an MI estimate. But is that estimate *correct*? Is it *reliable*? For scientific research, a single number is not enough. We need to be sure that our result is not an artifact of our limited data.\n", "\n", "This tutorial tackles the most important concept for producing publishable results: **bias correction**. We will demonstrate two critical statistical pitfalls and show how `NeuralMI`'s flagship `'rigorous'` mode solves them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. The Problems with Naive Estimates\n", "\n", "Using an MI estimator naively with `mode='estimate'` is fast, but it doesn't account for two critical issues:\n", "\n", "1. **Estimator Variance:** Due to the random nature of neural network training (e.g., weight initialization, data shuffling), running the same estimation twice will give slightly different answers.\n", "2. **Finite-Sampling Bias:** This is the bigger problem. With a finite dataset, estimators tend to find spurious correlations in the noise, which leads them to **systematically overestimate** the true MI. This bias gets worse as the amount of data gets smaller.\n", "\n", "Let's demonstrate these problems in code." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import neural_mi as nmi\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set_context(\"talk\")\n", "\n", "# --- Generate Data ---\n", "# We use a moderate number of samples to ensure the bias is visible.\n", "n_samples = 2500\n", "x_raw, y_raw = nmi.datasets.generate_nonlinear_from_latent(\n", " n_samples=n_samples, latent_dim=10, observed_dim=100, mi=3.0\n", ")\n", "x_raw_transposed = x_raw.T\n", "y_raw_transposed = y_raw.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 1: Estimator Variance\n", "\n", "Let's run the exact same estimation twice, changing only the `random_seed`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- Demonstrating Estimator Variance ---\n", "2025-10-20 00:03:36 - neural_mi - INFO - Starting parameter sweep sequentially (n_workers=1)...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7aa46cd590754712adf3cfdf81a2dc43", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Sequential Sweep Progress: 0%| | 0/1 [00:00 1.\n", "2025-10-20 00:03:51 - neural_mi - INFO - Starting rigorous analysis with 4 workers...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6fb8155c53584dc3a41ad496eab5c4d7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Rigorous Analysis Progress: 0%| | 0/55 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ax = rigorous_results.plot(show=False)\n", "ax.axhline(y=3.0, color='green', linestyle='-', label=f'True MI ({3.0:.2f} bits)')\n", "ax.legend()\n", "ax.set_ylim(bottom=0)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Fine-Tuning the Rigorous Analysis\n", "\n", "For advanced users, `nmi.run` provides parameters to control the bias correction procedure:\n", "\n", "- `gamma_range (range)`: Sets the range of data splits to test. The default is `range(1, 11)`. A straight line on a larger range is a better sign.\n", "- `delta_threshold (float)`: A measure of curvature used to find the 'linear region' of the MI vs 1/N plot for extrapolation. Lower values are stricter. The default is `0.1`.\n", "- `min_gamma_points (int)`: The minimum number of points required for a reliable linear fit after pruning non-linear points. The default is `5`.\n", "\n", "While the defaults are robust for most cases, these parameters offer more control for specialized analyses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Conclusion\n", "\n", "You now understand the most important feature of the `NeuralMI` library. Simple MI estimates are unreliable for scientific work, but `mode='rigorous'` provides a principled, automated workflow to correct for statistical biases and produce a final estimate with a confidence interval.\n", "\n", "> **Recommendation:** For any final, scientific analysis where you actually care about the exact MI result, `mode='rigorous'` is the recommended mode.\n", "\n", "In the next tutorial, we'll explore another advanced analysis for understanding the complexity of a single neural population." ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:base] *", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 4 }