{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Combining measurements\n",
    "\n",
    "When we do a fit, we can have additional knowledge about a parameter from other measurements. This can be taken into account either through a simultaneous fit or by adding a constraint (subsidiary measurement)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding a constraint\n",
    "\n",
    "If we know a parameters value from a different measurement and want to constraint this using its uncertainty, a Gaussian constraint can be added to the likelihood as\n",
    "\n",
    "\\begin{align}\n",
    "\\mathrm{constr_i} = \\mathrm{Gauss}(\\mu_{measured}; \\theta_i, \\sigma_{measured})\n",
    "\\end{align}\n",
    "\n",
    "In general, additional terms can be added to the likelihood arbitrarily in zfit, be it to incorporate other shaped measurements or to add penalty terms to confine a fit within boundaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hepunits as u\n",
    "import matplotlib.pyplot as plt\n",
    "import mplhep\n",
    "import numpy as np\n",
    "import particle.literals as lp\n",
    "import tensorflow as tf\n",
    "import zfit.z.numpy\n",
    "\n",
    "plt.rcParams['figure.figsize'] = (8, 6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mu_true = lp.B_plus.mass * u.MeV\n",
    "sigma_true = 50 * u.MeV\n",
    "\n",
    "# number of signal and background\n",
    "n_sig_rare = 120\n",
    "n_bkg_rare = 700"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create some data\n",
    "signal_np = np.random.normal(loc=mu_true, scale=sigma_true, size=n_sig_rare)\n",
    "bkg_np_raw = np.random.exponential(size=20000, scale=700)\n",
    "bkg_np = bkg_np_raw[bkg_np_raw < 1000][:n_bkg_rare] + 5000  # just cutting right, but zfit could also cut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Firstly, the observable and its range is defined\n",
    "obs = zfit.Space('Bmass', 5000, 6000)  # for whole range"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load data into zfit and let zfit concatenate the data\n",
    "signal_data = zfit.Data(signal_np, obs=obs)\n",
    "bkg_data = zfit.Data(bkg_np, obs=obs)\n",
    "data = zfit.data.concat([signal_data, bkg_data])\n",
    "# (we could also do it manually)\n",
    "# data = zfit.Data(array=np.concatenate([signal_np, bkg_np], axis=0), obs=obs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parameters are specified:  (name (unique), initial, lower, upper) whereas lower, upper are optional\n",
    "mu = zfit.Parameter('mu', 5279, 5100, 5400)\n",
    "sigma = zfit.Parameter('sigma', 20, 1, 200)\n",
    "sig_yield = zfit.Parameter('sig_yield', n_sig_rare + 30,\n",
    "                           step_size=3)  # step size: default is small, use appropriate\n",
    "signal = zfit.pdf.Gauss(mu=mu, sigma=sigma, obs=obs, extended=sig_yield)\n",
    "\n",
    "lam = zfit.Parameter('lambda', -0.002, -0.1, -0.00001, step_size=0.001)  # floating, also without limits\n",
    "bkg_yield = zfit.Parameter('bkg_yield', n_bkg_rare - 40, step_size=1)\n",
    "comb_bkg = zfit.pdf.Exponential(lam, obs=obs, extended=bkg_yield)\n",
    "\n",
    "# The final model is the combination of the signal and backgrond PDF\n",
    "model = zfit.pdf.SumPDF([comb_bkg, signal])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "constraint = zfit.constraint.GaussianConstraint(mu, observation=5275 * u.MeV, sigma=15 * u.MeV)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nll = zfit.loss.ExtendedUnbinnedNLL(model, data, constraints=constraint)\n",
    "minimizer = zfit.minimize.Minuit(gradient=\"zfit\")\n",
    "result = minimizer.minimize(nll)\n",
    "result.hesse();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simultaneous fits\n",
    "\n",
    "Sometimes, we don't want to fit a single channel but multiple with the same likelihood and having shared parameters between them. In this example, we will fit the decay simultaneously to its resonant control channel.\n",
    "\n",
    "A simultaneous likelihood is the product of different likelihoods defined by\n",
    "\\begin{align}\n",
    "\\mathcal{L}_{f(x)}(\\theta | {data_0, data_1, ..., data_n}) &= \\prod_{i=1}^{n} \\mathcal{L}(\\theta_i, data_i)\n",
    "\\end{align}\n",
    "\n",
    "and becomes in the NLL a sum\n",
    "\n",
    "\\begin{align}\n",
    "\\mathrm{NLL}_{f(x)}(\\theta | {data_0, data_1, ..., data_n}) &= \\sum_{i=1}^{n} \\mathrm{NLL}(\\theta_i, data_i)\n",
    "\\end{align}\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_sig_reso = 40000\n",
    "n_bkg_reso = 3000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create some data\n",
    "signal_np_reso = np.random.normal(loc=mu_true, scale=sigma_true * 0.7, size=n_sig_reso)\n",
    "bkg_np_raw_reso = np.random.exponential(size=20000, scale=900)\n",
    "bkg_np_reso = bkg_np_raw_reso[bkg_np_raw_reso < 1000][:n_bkg_reso] + 5000\n",
    "\n",
    "# load data into zfit\n",
    "obs_reso = zfit.Space('Bmass_reso', 5000, 6000)\n",
    "signal_data_reso = zfit.Data(signal_np_reso, obs=obs_reso)\n",
    "bkg_data_reso = zfit.Data(bkg_np_reso, obs=obs_reso)\n",
    "data_reso = zfit.data.concat([signal_data_reso, bkg_data_reso])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sharing and composing parameters\n",
    "\n",
    "As an example, the same signal shape will be used with the identical `mu` yet a scaled `sigma`. This means that the `sigma` of our control mode corresponds to the `sigma` of our signal times a scaling parameter. Therefore, the `scaled_sigma` is a function of two other parameters, `sigma` and `sigma_scaling` and _cannot_ change it's value independently. There are two fundamentally distinct types to represent this behavior in zfit, independent (`Parameter`) and dependent parameters (`ComplexParameter`, `ComposedParameter`,...)\n",
    "\n",
    "#### Independent parameter\n",
    "\n",
    "An independent parameter has, as a distinctive method, a `set_value` that _changes_ the value of the parameter. In a fit, or in general, these are _the only object_ that can directly change their value and therefore do not depend on other objects while most other objects depend on Parameters.\n",
    "As a consequence, this parameters can have limits (which effectively restrict the possible values a `Parameter` can be assigned to) and have a `step_size`, a hint to any minimization algorithm about the order of magnitude that a change in the parameter will have on the loss.\n",
    "\n",
    "Another attribute is a `floating` flag: if set to `False`, the parameter won't be floating in the fit, whether explicitly given or implicitly inferred from the dependencies.\n",
    "\n",
    "#### Dependent parameter\n",
    "\n",
    "These are single-valued functions effectively that depend on other objects, usually other parameters. Therefore, a dependent parameter does not have a `set_value` function and also does not posses limits. The latter is preferred to be set with the `Parameter` it depends on, however, if a hard limit is required, this can always be enforced in the definition of a `ComposedParameter`.\n",
    "\n",
    "The most notable parameter is the `ComposedParameter`, which returns an arbitrary function of its input arguments, the latter which can be specified with the `params` argument.\n",
    "\n",
    "While this parameters *cannot* change there value explicitly and therefore won't be used by a minimizer, the zfit minimizers automatically extract the independent parameters that a dependent parameter depends on (if this is given as an argument.)\n",
    "\n",
    "As a consequence, these parameters also miss a `step_size` attribute. Furthermore, `floating` can't be used, neither set nor retrieved; it is rather advised to check directly with its dependencies.\n",
    "\n",
    "#### Sharing parameters\n",
    "\n",
    "Since in zfit, every parameter object is unique, also defined by its name, it is straightforward to know when a parameter is shared in the loss and when it is not: if the same object is used in two places, it is shared. This can be arbitrarily mixed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Firstly, we create a free scaling parameter\n",
    "sigma_scaling = zfit.Parameter('sigma_scaling', 0.9, 0.1, 10, step_size=0.1)\n",
    "\n",
    "\n",
    "def sigma_scaled_fn(sigma, sigma_scaling):\n",
    "    return sigma * sigma_scaling  # this can be an arbitrary function\n",
    "\n",
    "\n",
    "sigma_scaled = zfit.ComposedParameter('sigma scaled',  # name\n",
    "                                      sigma_scaled_fn,  # function\n",
    "                                      params=[sigma, sigma_scaling],  # the objects used inside the function\n",
    "                                      unpack_params=True  # we could also just use a `params` argument, a dict\n",
    "                                      )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reso_sig_yield = zfit.Parameter('reso_sig_yield', n_sig_reso - 100, 0, n_sig_reso * 3,\n",
    "                                step_size=1)\n",
    "signal_reso = zfit.pdf.Gauss(mu=mu,  # the same as for the rare mode\n",
    "                             sigma=sigma_scaled,\n",
    "                             obs=obs_reso,\n",
    "                             extended=reso_sig_yield)\n",
    "\n",
    "lambda_reso = zfit.Parameter('lambda_reso', -0.002, -0.01, 0.0001)\n",
    "reso_bkg_yield = zfit.Parameter('reso_bkg_yield', n_bkg_reso + 70, 0, 2e5, step_size=1)\n",
    "comb_bkg_reso = zfit.pdf.Exponential(lambda_reso, obs=obs_reso, extended=reso_bkg_yield)\n",
    "\n",
    "\n",
    "model_reso = zfit.pdf.SumPDF([comb_bkg_reso, signal_reso])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To implement the simultaneous fit, there are two ways to achieve this in zfit. As an important distinction to other frameworks, zfit translates the above equation\n",
    "\\begin{align}\n",
    "\\mathrm{NLL}_{f(x)}(\\theta | {data_0, data_1, ..., data_n}) &= \\sum_{i=1}^{n} \\mathrm{NLL}(\\theta_i, data_i)\n",
    "\\end{align}\n",
    "\n",
    "directly into code.\n",
    "\n",
    "We can build two losses and add them directly or give a list of models and data, which build a loss each one and add up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nll_rare = zfit.loss.ExtendedUnbinnedNLL(model, data)\n",
    "nll_reso = zfit.loss.ExtendedUnbinnedNLL(model_reso, data_reso)\n",
    "nll_simultaneous = nll_rare + nll_reso"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "signal_reso.get_yield()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_simultaneous = minimizer.minimize(nll_simultaneous)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_simultaneous.hesse()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(result_simultaneous.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting a simultaneous loss\n",
    "\n",
    "Since the definition of a simultaneous fit is as above, it is simple to plot each component separately: either my using the attributes of the loss to access the models and plot in a general fashion or directly reuse the model and data from before; we created them manually before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets the values of the parameters to the result of the simultaneous fit\n",
    "# in case they were modified.\n",
    "zfit.param.set_values(nll_simultaneous.get_params(), result_simultaneous)\n",
    "\n",
    "\n",
    "def plot_fit_projection(model, data, nbins=30, ax=None):\n",
    "    # The function will be reused.\n",
    "    if ax is None:\n",
    "        ax = plt.gca()\n",
    "\n",
    "    lower, upper = data.data_range.limit1d\n",
    "\n",
    "    # Creates and histogram of the data and plots it with mplhep.\n",
    "    counts, bin_edges = np.histogram(data.unstack_x(), bins=nbins)\n",
    "    mplhep.histplot(counts, bins=bin_edges, histtype=\"errorbar\", yerr=True,\n",
    "                    label=\"Data\", ax=ax, color=\"black\")\n",
    "\n",
    "    binwidth = np.diff(bin_edges)[0]\n",
    "    x = tf.linspace(lower, upper, num=1000)  # or np.linspace\n",
    "\n",
    "    # Line plots of the total pdf and the sub-pdfs.\n",
    "    y = model.ext_pdf(x) * binwidth\n",
    "    ax.plot(x, y, label=\"total\", color=\"royalblue\")\n",
    "    for m, l, c in zip(model.get_models(), [\"background\", \"signal\"], [\"forestgreen\", \"crimson\"]):\n",
    "        ym = m.ext_pdf(x) * binwidth\n",
    "        ax.plot(x, ym, label=l, color=c)\n",
    "\n",
    "    ax.set_title(data.data_range.obs[0])\n",
    "    ax.set_xlim(lower, upper)\n",
    "    ax.legend(fontsize=15)\n",
    "\n",
    "    return ax\n",
    "\n",
    "\n",
    "fig, axs = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "for mod, dat, ax, nb in zip(nll_simultaneous.model, nll_simultaneous.data, axs, [30, 60]):\n",
    "    plot_fit_projection(mod, dat, nbins=nb, ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discovery test\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observed an excess of our signal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(result_simultaneous.params[sig_yield])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we would like to compute the significance of this observation or, in other words, the probabilty that this observation is the result of the statistical fluctuation. To do so we have to perform an hypothesis test where the null and alternative hypotheses are defined as:\n",
    "\n",
    "* $H_{0}$, the null or background only hypothesis, i.e. $N_{sig} = 0$;\n",
    "* $H_{1}$, the alternative hypothesis, i.e $N_{sig} = \\hat{N}_{sig}$, where $\\hat{N}_{sig}$ is the fitted value of $N_{sig}$ printed above.\n",
    "\n",
    "In `hepstats` to formulate a hypothesis you have to use the `POI` (Parameter Of Interest) class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hepstats.hypotests.parameters import POI\n",
    "\n",
    "# the null hypothesis\n",
    "sig_yield_poi = POI(sig_yield, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What the `POI` class does is to take as input a `zfit.Parameter` instance and a value corresponding to a given hypothesis. You can notice that we didn't define here the alternative hypothesis as in the discovery test the value of POI for alternate is set to the best fit value.\n",
    "\n",
    "The test statistic used is the profile likelihood ratio and defined as:\n",
    "\n",
    "\\begin{equation}\n",
    "q_{0} = \\left\\{\n",
    "    \\begin{array}{ll}\n",
    "        -2 \\ln \\frac{\\mathcal{L}(N_{sig}=0, \\; \\hat{\\hat{\\theta}})}{\\mathcal{L}(N_{sig}=\\hat{N}_{sig}, \\; \\hat{\\theta})} & \\mbox{if } \\; \\hat{N}_{sig} \\geq 0 \\\\\n",
    "        0 & \\mbox{if } \\; \\hat{N}_{sig} < 0\n",
    "    \\end{array}\n",
    "\\right.\n",
    "\\end{equation}\n",
    "\n",
    "where $\\hat{\\theta}$ are the best fitted values of the nuisances parameters (i.e. background yield, exponential slope...), while $\\hat{\\hat{\\theta}}$ are the fitted values of the nuisances when ${N}_{sig} = 0$.\n",
    "\n",
    "From the test statistic distribution a p-value can computed as\n",
    "\n",
    "\\begin{equation}\n",
    "p_{0} = \\int_{q_{0}^{obs}}^{\\infty} f(q_{0} |H_{0}) dq_{0}\n",
    "\\end{equation}\n",
    "\n",
    "where $q_{0}^{obs}$ is the value of the test statistic evaluated on observed data.\n",
    "\n",
    "The construction of the test statistic and the computation of the p-value is done in a `Calculator` object in `hepstats`. In this example we will use in this example the `AsymptoticCalculator` calculator which assumes that $q_{0}$ follows a $\\chi^2(ndof=1)$ which simplifies the p-value computation to\n",
    "\n",
    "\\begin{equation}\n",
    "p_{0} = 1 - \\Phi\\bigg({\\sqrt{q_{0}^{obs}}}\\bigg).\n",
    "\\end{equation}\n",
    "\n",
    "The calculator objects takes as input the likelihood function and a minimizer to profile the likelihood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hepstats.hypotests.calculators import AsymptoticCalculator\n",
    "\n",
    "# construction of the calculator instance\n",
    "calculator = AsymptoticCalculator(input=nll_simultaneous, minimizer=minimizer)\n",
    "calculator.bestfit = result_simultaneous\n",
    "\n",
    "# equivalent to above\n",
    "calculator = AsymptoticCalculator(input=result_simultaneous, minimizer=minimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is another calculator in `hepstats` called `FrequentistCalculator` which constructs the test statistic distribution $f(q_{0} |H_{0})$ with pseudo-experiments (toys), but it takes more time.\n",
    "\n",
    "The `Discovery` class is a high-level class that takes as input a calculator and a `POI` instance representing the null hypothesis, it basically asks the calculator to compute the p-value and also computes the signifance as\n",
    "\n",
    "\\begin{equation}\n",
    "Z = \\Phi^{-1}(1 - p_0).\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hepstats.hypotests import Discovery\n",
    "\n",
    "discovery = Discovery(calculator=calculator, poinull=sig_yield_poi)\n",
    "discovery.result()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we get a significance of about $7\\sigma$ which is well above the $5 \\sigma$ threshold for discoveries 😃.\n",
    "\n",
    "## Upper limit calculation\n",
    "\n",
    "Let's try to compute the discovery significance with a lower number of generated signal events."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets the values of the parameters to the result of the simultaneous fit\n",
    "zfit.param.set_values(nll_simultaneous.get_params(), result_simultaneous)\n",
    "sigma_scaling.floating = False\n",
    "\n",
    "# Creates a sampler that will draw events from the model\n",
    "sampler = model.create_sampler()\n",
    "\n",
    "# Creates new simultaneous loss\n",
    "nll_simultaneous_low_sig = zfit.loss.ExtendedUnbinnedNLL(model, sampler) + nll_reso"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Samples with sig_yield = 10. Since the model is extended the number of\n",
    "# signal generated is drawn from a poisson distribution with lambda = 10.\n",
    "sampler.resample({sig_yield: 10})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calculator_low_sig = AsymptoticCalculator(input=nll_simultaneous_low_sig, minimizer=minimizer)\n",
    "\n",
    "discovery_low_sig = Discovery(calculator=calculator_low_sig, poinull=sig_yield_poi)\n",
    "discovery_low_sig.result()\n",
    "print(f\"\\n {calculator_low_sig.bestfit.params} \\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We might consider computing an upper limit on the signal yield instead. The test statistic for an upper limit calculation is\n",
    "\n",
    "\\begin{equation}\n",
    "q_{N_{sig}} = \\left\\{\n",
    "    \\begin{array}{ll}\n",
    "        -2 \\ln \\frac{\\mathcal{L}(N_{sig}, \\; \\hat{\\hat{\\theta}})}{\\mathcal{L}(N_{sig}=\\hat{N}_{sig}, \\; \\hat{\\theta})} & \\mbox{if } \\; \\hat{N}_{sig} \\leq N_{sig} \\\\\n",
    "        0 & \\mbox{if } \\; \\hat{N}_{sig} > N_{sig}.\n",
    "    \\end{array}\n",
    "\\right.\n",
    "\\end{equation}\n",
    "\n",
    "and the p-value is\n",
    "\n",
    "\\begin{equation}\n",
    "p_{N_{sig}} = \\int_{q_{N_{sig}}^{obs}}^{\\infty} f(q_{N_{sig}} |N_{sig}) dq_{N_{sig}}.\n",
    "\\end{equation}\n",
    "\n",
    "The upper limit on $N_{sig}$, $N_{sig, \\uparrow}$ is found for $p_{N_{sig, \\uparrow}} = 1 - \\alpha$, $\\alpha$ being the confidence level (typically $95 \\%$). The upper limit is found by interpolation of the p-values as a function of $N_{sig}$, which is done the `UpperLimit` class. We have to give the range of values of $N_{sig}$ to scan using the `POIarray` class which as the `POI` class takes as input the parameter but takes several values to evaluate the parameter instead of one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hepstats.hypotests import UpperLimit\n",
    "from hepstats.hypotests.parameters import POIarray\n",
    "\n",
    "#Background only hypothesis.\n",
    "bkg_only = POI(sig_yield, 0)\n",
    "# Range of Nsig values to scan.\n",
    "sig_yield_scan = POIarray(sig_yield, np.linspace(0, 70, 10))\n",
    "\n",
    "ul = UpperLimit(calculator=calculator_low_sig, poinull=sig_yield_scan, poialt=bkg_only)\n",
    "ul.upperlimit(alpha=0.05);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import plotlimit\n",
    "\n",
    "plotlimit(ul, CLs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## Splot\n",
    "\n",
    "This is now an demonstration of the **sPlot** algorithm, described in [Pivk:2004ty](https://arxiv.org/pdf/physics/0402083.pdf).\n",
    "\n",
    "If a data sample is populated by different sources of events, like signal and background, **sPlot** is able to unfold the contributions of the different sources for a given variable.\n",
    "\n",
    "Let's construct a dataset with two variables, the invariant mass and lifetime, for the resonant signal defined above and the combinatorial background."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Signal distributions.\n",
    "nsig_sw = 20000\n",
    "np_sig_m_sw = signal_reso.sample(nsig_sw)[\"Bmass_reso\"]\n",
    "np_sig_t_sw = np.random.exponential(size=nsig_sw, scale=1)\n",
    "\n",
    "# Background distributions.\n",
    "nbkg_sw = 150000\n",
    "np_bkg_m_sw = comb_bkg_reso.sample(nbkg_sw)[\"Bmass_reso\"]\n",
    "np_bkg_t_sw = np.random.normal(size=nbkg_sw, loc=2.0, scale=2.5)\n",
    "\n",
    "# Lifetime cut.\n",
    "t_cut = np_bkg_t_sw > 0\n",
    "np_bkg_t_sw = np_bkg_t_sw[t_cut]\n",
    "np_bkg_m_sw = np_bkg_m_sw[t_cut]\n",
    "\n",
    "# Mass distribution\n",
    "np_m_sw = np.concatenate([np_sig_m_sw, np_bkg_m_sw])\n",
    "\n",
    "# Lifetime distribution\n",
    "np_t_sw = np.concatenate([np_sig_t_sw, np_bkg_t_sw])\n",
    "\n",
    "# Plots the mass and lifetime distribution.\n",
    "fig, axs = plt.subplots(1, 2, figsize=(16, 6))\n",
    "axs[0].hist([np_bkg_m_sw, np_sig_m_sw], bins=50, stacked=True, label=(\"background\", \"signal\"), alpha=.7)\n",
    "axs[0].set_xlabel(\"m\")\n",
    "axs[0].legend(fontsize=15)\n",
    "axs[1].hist([np_bkg_t_sw, np_sig_t_sw], bins=50, stacked=True, label=(\"background\", \"signal\"), alpha=.7)\n",
    "axs[1].set_xlabel(\"t\")\n",
    "axs[1].legend(fontsize=15);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this particular example we want to unfold the signal lifetime distribution. To do so **sPlot** needs a discriminant variable to determine the yields of the various sources using an <ins>extended</ins> maximum likelihood fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Builds the loss.\n",
    "data_sw = zfit.Data(np_m_sw, obs=obs_reso)\n",
    "nll_sw = zfit.loss.ExtendedUnbinnedNLL(model_reso, data_sw)\n",
    "\n",
    "#This parameter was useful in the simultaneous fit but not anymore so we fix it.\n",
    "sigma_scaling.floating = False\n",
    "\n",
    "# Minimizes the loss.\n",
    "result_sw = minimizer.minimize(nll_sw)\n",
    "print(result_sw.params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualization of the result.\n",
    "zfit.param.set_values(nll_sw.get_params(), result_sw)\n",
    "plot_fit_projection(model_reso, data_sw, nbins=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**sPlot** will use the fitted yield for each sources to derive the so-called **sWeights** for each data point:\n",
    "\n",
    "\\begin{equation}\n",
    "W_{n}(x) = \\frac{\\sum_{j=1}^{N_S} V_{nj} f_j(x)}{\\sum_{k=1}^{N_S} N_{k}f_k(x)}\n",
    "\\end{equation}\n",
    "\n",
    "with\n",
    "\n",
    "\\begin{equation}\n",
    "V_{nj}^{-1} = \\sum_{e=1}^{N} \\frac{f_n(x_e) f_j(x_e)}{(\\sum_{k=1}^{N_S} N_{k}f_k(x))^2}\n",
    "\\end{equation}\n",
    "\n",
    "\n",
    "where ${N_S}$ is the number of sources in the data sample, here 2. The index $n$ represents the source, for instance $0$ is the signal and $1$ is the background, then $f_0$ and $N_0$ are the pdf and yield for the signal.\n",
    "\n",
    "In `hepstats` the **sWeights** are computed with the `compute_sweights` function which takes as arguments the <ins>fitted</ins> extended model and the discrimant data (on which the fit was performed)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hepstats.splot import compute_sweights\n",
    "\n",
    "weights = compute_sweights(model_reso, data_sw)\n",
    "\n",
    "print(weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Sum of signal sWeights: \", np.sum(weights[reso_sig_yield]))\n",
    "print(\"Sum of background sWeights: \", np.sum(weights[reso_bkg_yield]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can apply the signal **sWeights** on the lifetime distribution and retrieve its signal components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(1, 2, figsize=(16, 6))\n",
    "nbins = 40\n",
    "\n",
    "sorter = np_m_sw.argsort()\n",
    "\n",
    "axs[0].plot(np_m_sw[sorter], weights[reso_sig_yield][sorter], label=\"$w_\\\\mathrm{sig}$\")\n",
    "axs[0].plot(np_m_sw[sorter], weights[reso_bkg_yield][sorter], label=\"$w_\\\\mathrm{bkg}$\")\n",
    "axs[0].plot(np_m_sw[sorter], weights[reso_sig_yield][sorter] + weights[reso_bkg_yield][sorter],\n",
    "            \"-k\", label=\"$w_\\\\mathrm{sig} + w_\\\\mathrm{bkg}$\")\n",
    "axs[0].axhline(0, color=\"0.5\")\n",
    "axs[0].legend(fontsize=15)\n",
    "axs[0].set_xlim(5000, 5600)\n",
    "\n",
    "axs[1].hist(np_t_sw, bins=nbins, range=(0, 6), weights=weights[reso_sig_yield], label=\"weighted histogram\", alpha=.5)\n",
    "axs[1].hist(np_sig_t_sw, bins=nbins, range=(0, 6), histtype=\"step\", label=\"true histogram\", lw=1.5)\n",
    "axs[1].set_xlabel(\"t\")\n",
    "axs[1].legend(fontsize=15);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be careful the **sPlot** technique works only on variables that are uncorrelated with the discriminant variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Correlation between m and t: {np.corrcoef(np_m_sw, np_t_sw)[0, 1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's apply to signal **sWeights** on the mass distribution to see how bad the results of **sPlot** is when applied on a variable that is correlated with the discrimant variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(np_m_sw, bins=100, range=(5000, 6000), weights=weights[reso_sig_yield]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}