Significance/p-value calculation

This tutorial shows how to calculate the significance (p-value).

For this example we use the location of the blazar TXS 0506+056 as its observed TS in the time-integrated analysis is lower than the one of NGC 1068. And this makes the calculation of the corresponding p-value based on pseudo-experiment less expensive.

For analysis setup details check the Fitting a steady point-source with the public 14-year IceCube track data tutorial.

[1]:
import numpy as np
from matplotlib import pyplot as plt

import skyllh
from skyllh.analyses.i3.publicdata_ps.time_integrated_ps import create_analysis
from skyllh.core.config import Config
from skyllh.core.source_model import PointLikeSource

cfg = Config()

datasets = skyllh.create_datasets('IceTracks-DR2', cfg=cfg)

# Location of the blazatar TXS 0506+056
src_ra = 77.35  # degrees
src_dec = 5.7  # degrees

source = PointLikeSource(ra=np.radians(src_ra), dec=np.radians(src_dec))

ana = create_analysis(cfg=cfg, datasets=datasets, source=source)
100%|██████████| 33/33 [00:00<00:00, 603.49it/s]
100%|██████████| 33/33 [00:00<00:00, 325.38it/s]
100%|██████████| 33/33 [00:00<00:00, 595.61it/s]
100%|██████████| 33/33 [00:00<00:00, 603.72it/s]
100%|██████████| 4/4 [00:11<00:00,  2.88s/it]
100%|██████████| 136/136 [00:00<00:00, 11742.22it/s]

Initializing the analysis instance and unblinding the data.

[2]:
from skyllh.core.random import RandomStateService

rss = RandomStateService(3090)
ts, fitparam_values, status = ana.unblind(rss)
ns_bf = fitparam_values['ns']
gamma_bf = fitparam_values['gamma']
print(f'Unblinded results: TS={ts:.2f}, ns={ns_bf:.2f}, gamma={gamma_bf:.2f}')
print(f'Minimizer status: {status}')
Unblinded results: TS=9.77, ns=8.80, gamma=1.97
Minimizer status: {'grad': array([-3.66069500e-06,  5.61787954e-05]), 'task': 'CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH', 'funcalls': 16, 'nit': 11, 'warnflag': 0, 'skyllh_minimizer_n_reps': 0, 'n_llhratio_func_calls': 16}

Calculating the significance (local p-value) by generating the test-statistic distribution of background-only pseudo_experiments (trials)

The local significance of the source, i.e. its p-value, can be calculated by generating the test-statistic distribution of background-only pseudo-experiment, i.e. for zero injected signal events. SkyLLH provides the helper function create_trial_data_file to generate these trials and store them in a numpy file for later usage.

[3]:
from skyllh.core.utils.analysis import create_trial_data_file
[4]:
help(create_trial_data_file)
Help on function create_trial_data_file in module skyllh.core.utils.analysis:

create_trial_data_file(
    ana,
    rss,
    n_trials,
    mean_n_sig=0,
    mean_n_sig_null=0,
    mean_n_bkg_list=None,
    minimizer_rss=None,
    bkg_kwargs=None,
    sig_kwargs=None,
    pathfilename=None,
    ncpu=None,
    ppbar=None,
    tl=None
)
    Creates and fills a trial data file with `n_trials` generated trials for
    each mean number of injected signal events specified by `mean_n_sig` for a
    given analysis.

    Parameters
    ----------
    ana : instance of Analysis
        The Analysis instance to use for the trial generation.
    rss : instance of RandomStateService
        The RandomStateService instance to use for generating random
        numbers.
    n_trials : int
        The number of trials to perform for each hypothesis test.
    mean_n_sig : ndarray of float | float | 2- or 3-element sequence of float
        The array of mean number of injected signal events (MNOISEs) for which
        to generate trials. If this argument is not a ndarray, an array of
        MNOISEs is generated based on this argument.
        If a single float is given, only this given MNOISEs are injected.
        If a 2-element sequence of floats is given, it specifies the range of
        MNOISEs with a step size of one.
        If a 3-element sequence of floats is given, it specifies the range plus
        the step size of the MNOISEs.
    mean_n_sig_null : ndarray of float | float | 2- or 3-element sequence of float
        The array of the fixed mean number of signal events (FMNOSEs) for the
        null-hypothesis for which to generate trials. If this argument is not a
        ndarray, an array of FMNOSEs is generated based on this argument.
        If a single float is given, only this given FMNOSEs are used.
        If a 2-element sequence of floats is given, it specifies the range of
        FMNOSEs with a step size of one.
        If a 3-element sequence of floats is given, it specifies the range plus
        the step size of the FMNOSEs.
    mean_n_bkg_list : list of float | None
        The mean number of background events that should be generated for
        each dataset. This parameter is passed to the ``do_trials`` method of
        the ``Analysis`` class. If set to None (the default), the background
        generation method needs to obtain this number itself.
    minimizer_rss : instance of RandomStateService | None
        The instance of RandomStateService to use for generating random
        numbers for the minimizer, e.g. new initial fit parameter values.
        If set to ``None``, a rss with the same seed as ``rss`` will be
        initialized.
    bkg_kwargs : dict | None
        Additional keyword arguments for the `generate_events` method of the
        background generation method class. An usual keyword argument is
        `poisson`.
    sig_kwargs : dict | None
        Additional keyword arguments for the `generate_signal_events` method
        of the `SignalGenerator` class. An usual keyword argument is
        `poisson`.
    pathfilename : string | None
        Trial data file path including the filename.
        If set to None generated trials won't be saved.
    ncpu : int | None
        The number of CPUs to use.
    ppbar : instance of ProgressBar | None
        The optional instance of the parent progress bar.
    tl: instance of TimeLord | None
        The instance of TimeLord that should be used to measure individual
        tasks.

    Returns
    -------
    seed : int
        The seed used to generate the trials.
    mean_n_sig : 1d ndarray
        The array holding the mean number of signal events used to generate the
        trials.
    mean_n_sig_null : 1d ndarray
        The array holding the fixed mean number of signal events for the
        null-hypothesis used to generate the trials.
    trial_data : structured numpy ndarray
        The generated trial data.

[5]:
from skyllh.core.timing import TimeLord

tl = TimeLord()
[6]:
rss = RandomStateService(seed=1)
(_, _, _, trials) = create_trial_data_file(
    ana=ana, rss=rss, n_trials=1e3, mean_n_sig=0, pathfilename='txs_bkg_trails.npy', ncpu=8, tl=tl
)
print(tl)
100%|██████████| 1001/1001 [00:17<00:00, 55.99it/s]
TimeLord: Executed tasks:
[Generating background events for dataset "IC40".     ] 7.8e-04 sec/iter (1000)
[Generating background events for dataset "IC59".     ]   0.002 sec/iter (1000)
[Generating background events for dataset "IC79".     ]   0.001 sec/iter (1000)
[Generating background events for dataset "IC86_I-XI".]   0.009 sec/iter (1000)
[Generating pseudo data.                              ]   0.011 sec/iter (1000)
[Initializing trial.                                  ]   0.018 sec/iter (1000)
[Get sig probability densities and grads.             ] 1.2e-06 sec/iter (88088)
[Get bkg probability densities and grads.             ] 8.7e-07 sec/iter (88088)
[Calculate PDF ratios.                                ] 4.4e-05 sec/iter (88088)
[Calc pdfratio value Ri                               ] 3.2e-04 sec/iter (44044)
[Calc logLambda and grads                             ] 1.1e-04 sec/iter (44044)
[Evaluate llh-ratio function.                         ]   0.002 sec/iter (11011)
[Minimize -llhratio function.                         ]   0.017 sec/iter (1000)
[Maximizing LLH ratio function.                       ]   0.017 sec/iter (1000)
[Calculating test statistic.                          ] 1.4e-05 sec/iter (1000)

After generating the background trials, we can histogram the test-statistic values and plot the TS distribution.

[7]:
fig, ax = plt.subplots()
(h, be) = np.histogram(trials['ts'], bins=np.arange(0, np.max(trials['ts']) + 0.1, 0.1))
ax.plot(0.5 * (be[:-1] + be[1:]), h, drawstyle='steps-mid', label='background')
ax.axvline(ts, label=f'TS(TXS 0506+056)={ts:.2f}')
ax.set_yscale('log')
ax.set_xlabel('TS')
ax.set_ylabel('#trials per bin')
ax.legend()
plt.show()
../_images/tutorials_p_value_method_comparison_12_0.png

One can also use a helper function extend_trial_data_file to extend the trial data if the trial data generated is not enough to estimate the p-value reliably.

[8]:
from skyllh.core.utils.analysis import extend_trial_data_file
[9]:
help(extend_trial_data_file)
Help on function extend_trial_data_file in module skyllh.core.utils.analysis:

extend_trial_data_file(
    ana,
    rss,
    n_trials,
    trial_data,
    mean_n_sig=0,
    mean_n_sig_null=0,
    mean_n_bkg_list=None,
    bkg_kwargs=None,
    sig_kwargs=None,
    pathfilename=None,
    **kwargs
)
    Appends to the trial data file `n_trials` generated trials for each
    mean number of injected signal events specified by `mean_n_sig` for a
    given analysis.

    Parameters
    ----------
    ana : instance of Analysis
        The Analysis instance to use for sensitivity estimation.
    rss : instance of RandomStateService
        The RandomStateService instance to use for generating random
        numbers.
    n_trials : int
        The number of trials the trial data file needs to be extended by.
    trial_data : structured numpy ndarray
        The structured numpy ndarray holding the trials.
    mean_n_sig : ndarray of float | float | 2- or 3-element sequence of float
        The array of mean number of injected signal events (MNOISEs) for which
        to generate trials. If this argument is not a ndarray, an array of
        MNOISEs is generated based on this argument.
        If a single float is given, only this given MNOISEs are injected.
        If a 2-element sequence of floats is given, it specifies the range of
        MNOISEs with a step size of one.
        If a 3-element sequence of floats is given, it specifies the range plus
        the step size of the MNOISEs.
    mean_n_sig_null : ndarray of float | float | 2- or 3-element sequence of float
        The array of the fixed mean number of signal events (FMNOSEs) for the
        null-hypothesis for which to generate trials. If this argument is not a
        ndarray, an array of FMNOSEs is generated based on this argument.
        If a single float is given, only this given FMNOSEs are used.
        If a 2-element sequence of floats is given, it specifies the range of
        FMNOSEs with a step size of one.
        If a 3-element sequence of floats is given, it specifies the range plus
        the step size of the FMNOSEs.
    bkg_kwargs : dict | None
        Additional keyword arguments for the `generate_events` method of the
        background generation method class. An usual keyword argument is
        `poisson`.
    sig_kwargs : dict | None
        Additional keyword arguments for the `generate_signal_events` method
        of the `SignalGenerator` class. An usual keyword argument is
        `poisson`.
    pathfilename : string | None
        Trial data file path including the filename.

    Additional keyword arguments
    ----------------------------
    Additional keyword arguments are passed-on to the ``create_trial_data_file``
    function.

    Returns
    -------
    trial_data :
        Trial data file extended by the required number of trials for each
        mean number of injected signal events..

[10]:
tl = TimeLord()
rss = RandomStateService(seed=2)
trials = extend_trial_data_file(
    ana=ana, rss=rss, n_trials=4e3, trial_data=trials, pathfilename='txs_bkg_trails.npy', ncpu=8, tl=tl
)
100%|██████████| 4001/4001 [01:02<00:00, 64.44it/s]
[11]:
print(tl)
TimeLord: Executed tasks:
[Generating background events for dataset "IC40".     ] 7.8e-04 sec/iter (4000)
[Generating background events for dataset "IC59".     ]   0.002 sec/iter (4000)
[Generating background events for dataset "IC79".     ]   0.002 sec/iter (4000)
[Generating background events for dataset "IC86_I-XI".]   0.009 sec/iter (4000)
[Generating pseudo data.                              ]   0.010 sec/iter (4000)
[Initializing trial.                                  ]   0.016 sec/iter (4000)
[Get sig probability densities and grads.             ] 1.1e-06 sec/iter (357296)
[Get bkg probability densities and grads.             ] 8.5e-07 sec/iter (357296)
[Calculate PDF ratios.                                ] 4.4e-05 sec/iter (357296)
[Calc pdfratio value Ri                               ] 2.9e-04 sec/iter (178648)
[Calc logLambda and grads                             ] 1.1e-04 sec/iter (178648)
[Evaluate llh-ratio function.                         ]   0.001 sec/iter (44662)
[Minimize -llhratio function.                         ]   0.015 sec/iter (4000)
[Maximizing LLH ratio function.                       ]   0.015 sec/iter (4000)
[Calculating test statistic.                          ] 1.3e-05 sec/iter (4000)

Among utility functions saved in the library skyllh.core.utils.analysis, SkyLLH offers a function calculate_pval_from_trials that allows one to calculate the p-value emprically directly from the distribution of background trials. The function takes the created background-only pseudo-experiments and the observed TS value and it returns the corresponding p-value and the related statistical uncertainty.

[12]:
from skyllh.core.utils.analysis import calculate_pval_from_trials
[13]:
help(calculate_pval_from_trials)
Help on function calculate_pval_from_trials in module skyllh.core.utils.analysis:

calculate_pval_from_trials(ts_vals, ts_threshold, comp_operator='greater_equal')
    Calculates the percentage (p-value) of test-statistic trials that are
    above the given test-statistic critical value.
    In addition it calculates the standard deviation of the p-value assuming
    binomial statistics.

    Parameters
    ----------
    ts_vals : (n_trials,)-shaped 1D ndarray of float
        The ndarray holding the test-statistic values of the trials.
    ts_threshold : float
        The critical test-statistic value.
    comp_operator: string, optional
        The comparison operator for p-value calculation. It can be set to one of
        the following options: 'greater' or 'greater_equal'.

    Returns
    -------
    p, p_sigma: tuple(float, float)

[14]:
trials = np.load('txs_bkg_trails.npy')
[15]:
p_val_from_trials, p_val_from_trials_sigma = calculate_pval_from_trials(trials['ts'], ts)
print(f'p_value calculated from trials: {p_val_from_trials:.3e}')
p_value calculated from trials: 2.200e-03

Here below we provide a visualization of the computed background-only pseudo-experiments TS values together with the observed TS for TXS 0506+056

[16]:
fig, ax = plt.subplots()
(h, be) = np.histogram(trials['ts'], bins=np.arange(0, np.max(trials['ts']) + 0.1, 0.1))
ax.plot(0.5 * (be[:-1] + be[1:]), h, drawstyle='steps-mid', label='background')
ax.axvline(ts, label=f'TS(TXS 0506+056)={ts:.3f}')
ax.set_yscale('log')
ax.set_xlabel('TS')
ax.set_ylabel('#trials per bin')
ax.legend()
plt.show()
../_images/tutorials_p_value_method_comparison_24_0.png

Calculating the significance (local p-value) by fitting a truncated gamma function

It can happen (if your lucky or if you’re estimating a \(5\sigma\) discovery potential, see the Sensitivity study notebook) that the observed TS is high enough that computing background trials to calculate the corresponding p-value becomes extremely expensive.

While, the emprical estimation remains the most robust way to compute a significance, a faster alternative is to fit the tail of the generated distribution of background-only pseudo-experiments with a truncated gamma function and use it to extrapolate p-values at higher TS values.

Among the analysis utility functions, SkyLLH offers the fit_truncated_gamma function, which fits the truncated gamma function to the TS distribution.

Additionally, the calculate_pval_from_gammafit_to_trials function provides a wrapper method that runs the fit and returns the p-value estimated from the fit.

..note:: Please note that fitting the truncated gamma function requires the iminuit Python package.

[17]:
from scipy.stats import gamma

from skyllh.core.utils.analysis import (
    calculate_critical_ts_from_gamma,
    calculate_pval_from_gammafit_to_trials,
    fit_truncated_gamma,
)

You need to truncate the distribution at some \(\eta > 0\) because the TS distribution is the sum of a delta peak at 0 and a continuous distribution for TS > 0. Here we choose \(\eta=2.0\).

[18]:
eta = 2.0
pars, norm = fit_truncated_gamma(trials['ts'], eta=eta)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
bins = np.linspace(0, np.max(trials['ts']) + 0.1, 30)
counts, _ = np.histogram(trials['ts'], bins=bins, density=True)
ax.bar(
    0.5 * (bins[1:] + bins[:-1]),
    height=counts,
    width=bins[1:] - bins[:-1],
    edgecolor='C0',
    color='C0',
    alpha=0.5,
    label='Bkg TS distribution',
)
ax.errorbar(
    0.5 * (bins[1:] + bins[:-1]),
    counts,
    yerr=np.sqrt(counts / len(trials['ts']) / (bins[1:] - bins[:-1])),
    fmt='',
    ls='',
    color='C0',
)
bins = bins[bins > eta]
bincenters = 0.5 * (bins[1:] + bins[:-1])
bw = bins[1:] - bins[:-1]
ax.set_yscale('log')

vals = np.linspace(eta, np.max(trials['ts']) + 2, 100)
ax.plot(
    vals, norm * gamma.pdf(vals, a=pars[0], scale=pars[1]), lw=1.5, color='k', zorder=5, label='Fitted truncated gamma'
)

ax.axvline(ts, color='C1', linestyle='--', label='unblinded TS')

ax.set_xlabel('Test statistic')
ax.set_ylabel('Counts')

ax.legend()
plt.show()
../_images/tutorials_p_value_method_comparison_30_0.png

calculate_critical_ts_from_gamma function calculates the TS you would need in your analysis to reach a given significance.

[19]:
critical_ts = calculate_critical_ts_from_gamma(
    ts=trials['ts'],
    h0_ts_quantile=1.3e-3,  # this is ~ 3 sigma
    eta=1.0,
)
print(f'TS value for a 3 sigma detection: {critical_ts:.2f}')
critical_ts = calculate_critical_ts_from_gamma(
    ts=trials['ts'],
    h0_ts_quantile=2.9e-7,  # this is ~ 5 sigma
    eta=1.0,
)
print(f'TS value for a 5 sigma detection: {critical_ts:.2f}')
TS value for a 3 sigma detection: 12.34
TS value for a 5 sigma detection: 29.47

And of course you can also calculate the p-value for the observed TS from the fitted function. This should result in a very similar result to the one estimated empirically from the trials.

[20]:
p_val_from_fit, _ = calculate_pval_from_gammafit_to_trials(trials['ts'], ts, eta=eta)
print(f'p_value calculated from fit: {p_val_from_fit:.3e}')
p_value calculated from fit: 3.939e-03

Practially, one might be interested in flip-flopping from one estimation method to the other depending on how high the TS value of interest is and on how many pseudo-experiments have been computed to populate the background TS distribution.

This feature exists in the calculate_pval_from_trials_mixed method, which takes a switch_at_ts additional argument to set at what TS value one should switch from the empirical estimation to the evaluation of the truncated gamma fit.

[21]:
from skyllh.core.utils.analysis import calculate_pval_from_trials_mixed

help(calculate_pval_from_trials_mixed)
Help on function calculate_pval_from_trials_mixed in module skyllh.core.utils.analysis:

calculate_pval_from_trials_mixed(
    ts_vals,
    ts_threshold,
    switch_at_ts=3.0,
    eta=None,
    n_max=500000,
    comp_operator='greater_equal'
)
    Calculates the probability (p-value) of test-statistic exceeding
    the given test-statistic threshold. This calculation relies on fitting
    a gamma distribution to a list of ts values if ts_threshold is larger than
    switch_at_ts. If ts_threshold is smaller then the pvalue will be taken
    from the trials directly.

    Parameters
    ----------
    ts_vals : (n_trials,)-shaped 1D ndarray of float
        The ndarray holding the test-statistic values of the trials.
    ts_threshold : float
        The critical test-statistic value.
    switch_at_ts : float, optional
        Test-statistic value below which p-value is computed from trials
        directly. For thresholds greater than switch_at_ts the pvalue is
        calculated using a gamma fit.
    eta : float, optional
        Test-statistic value at which the gamma function is truncated
        from below. Default is None.
    n_max : int, optional
        The maximum number of trials that should be used during
        fitting. Default = 500,000
    comp_operator: string, optional
        The comparison operator for p-value calculation. It can be set to one of
        the following options: 'greater' or 'greater_equal'.

    Returns
    -------
    p, p_sigma: tuple(float, float)