Fitting a steady point-source with the public 14-year IceCube track data

This tutorial shows how to fit a neutrino point-like steady source (time-integrated) using the IceCube public 14-year point-source data with SkyLLH.

We assume that you have installed the SkyLLH package in your environment by following the Installation guide. If that’s not the case you’ll need to add the path to the SkyLLH folder to your PYTHONPATH.

[1]:
import numpy as np

Initializing a configuration dictionary

First we have to create a config instance.

[2]:
from skyllh.core.config import Config

cfg = Config()

The Config object stores a dictionary with global settings, including computing, logging, and analysis settings. It is a required argument to several methods.

[3]:
cfg
[3]:
{'multiproc': {'ncpu': None},
 'logging': {'log_level': 'INFO',
  'log_format': '%(asctime)s %(processName)s %(name)s %(levelname)s: %(message)s',
  'enable_tracing': False},
 'project': {'working_directory': '.'},
 'repository': {'base_path': '/Users/chiarabellenghi/.cache/skyllh',
  'download_from_origin': True},
 'units': {'internal': {'angle': Unit("rad"),
   'energy': Unit("GeV"),
   'length': Unit("cm"),
   'time': Unit("s")},
  'defaults': {'fluxes': {'angle': Unit("rad"),
    'energy': Unit("GeV"),
    'length': Unit("cm"),
    'time': Unit("s")}}},
 'datafields': {'run': 4,
  'ra': 4,
  'dec': 4,
  'ang_err': 4,
  'time': 4,
  'log_energy': 4,
  'true_ra': 8,
  'true_dec': 8,
  'true_energy': 8,
  'mcweight': 8},
 'caching': {'pdf': {'MultiDimGridPDF': False}}}

Multiprocessing

The multiprocessing in skyllh can be enabled globally by updating the configuration with a number of CPU cores. It almost linearly speeds up creating an analysis and running trials, with a trade-off of a slightly higher memory usage:

[4]:
cfg.set_ncpu(6)

cfg
[4]:
{'multiproc': {'ncpu': 6},
 'logging': {'log_level': 'INFO',
  'log_format': '%(asctime)s %(processName)s %(name)s %(levelname)s: %(message)s',
  'enable_tracing': False},
 'project': {'working_directory': '.'},
 'repository': {'base_path': '/Users/chiarabellenghi/.cache/skyllh',
  'download_from_origin': True},
 'units': {'internal': {'angle': Unit("rad"),
   'energy': Unit("GeV"),
   'length': Unit("cm"),
   'time': Unit("s")},
  'defaults': {'fluxes': {'angle': Unit("rad"),
    'energy': Unit("GeV"),
    'length': Unit("cm"),
    'time': Unit("s")}}},
 'datafields': {'run': 4,
  'ra': 4,
  'dec': 4,
  'ang_err': 4,
  'time': 4,
  'log_energy': 4,
  'true_ra': 8,
  'true_dec': 8,
  'true_energy': 8,
  'mcweight': 8},
 'caching': {'pdf': {'MultiDimGridPDF': False}}}

Setting up logging

The user sets up the main logger. The logger name, logging format and level, output file, and other options can be set.

Default logging format and level are defined in the Config instance, but they can be overwritten by the arguments of logging.setup_logging.

[5]:
from skyllh.core.logging import setup_logging

logger = setup_logging(cfg=cfg, name='fitting_a_source', log_level='INFO')

Getting the datasets

The datasets are automatically downloaded from dataverse.harvard.edu to a local cache directory (~/.cache/skyllh) on first run. To use a custom dataset location, set cfg['repository']['base_path'] to the desired path.

Note

IceTracks-DR2 is a pretty heavy dataset (~2GB) and it might take a while to initialize the analysis when it has to be downloaded for the first time. Be patient, and the next time it will be loaded from ~/.cache/skyllh :)

To create the correct dataset object, we need to import the 14-yr dataset definition:

[6]:
import skyllh

The collection of datasets is created with skyllh.create_datasets. With the default configuration, the data is downloaded automatically from dataverse.harvard.edu.

[7]:
help(skyllh.create_datasets)
Help on function create_datasets in module skyllh.datasets.datasets:

create_datasets(sample_name, cfg, names=None, base_path=None, sub_path_fmt=None)
    Creates a list of Dataset instances for a named data sample.

    Parameters
    ----------
    sample_name : str
        The name of the data sample. Available samples are the keys of
        ``skyllh.datasets.data_samples``.
    cfg : instance of Config
        The instance of Config holding the local configuration.
    names : sequence of str | None
        The dataset names to return. If None, the module's ``DATASET_NAMES``
        default is used (combined IC86 seasons where applicable).
    base_path : str | None
        The base path of the data files. If None, uses
        ``cfg['repository']['base_path']``.
    sub_path_fmt : str | None
        The sub path format override. If None, uses the module default.

    Returns
    -------
    datasets : list of Dataset

If no names are passed to skyllh.create_datasets, it loads all data sets that are part of the IceTracks-DR2 dataset collection.

A tutorial on how to access specific information about which datasets are included in, e.g., IceTracks-DR2 is provided in Dataset collections tutorial.

[8]:
datasets = skyllh.create_datasets('IceTracks-DR2', cfg=cfg)
2026-05-21 17:16:28,654 MainProcess skyllh.datasets.datasets INFO: Loaded 4 dataset(s) from sample "IceTracks-DR2": IC40, IC59, IC79, IC86_I-XI

By default, this imports all available data sets in IceTracks-DR2 (all 14 years of data).

Creating the analysis object

The analysis for doing a point-source search as in, e.g., https://doi.org/10.1103/PhysRevLett.124.051103 is pre-defined. This analysis instance can be created via the create_analysis method of the time_integrated_ps module of the public data interface.

The function requires the config, the datasets, and a source object. It also takes a number of additional optional arguments for which we refer the reader to the docstring.

[9]:
from skyllh.analyses.i3.publicdata_ps.time_integrated_ps import create_analysis

help(create_analysis)
Help on function create_analysis in module skyllh.analyses.i3.publicdata_ps.time_integrated_ps:

create_analysis(
    cfg,
    datasets,
    source,
    refplflux_Phi0=1,
    refplflux_E0=1000.0,
    refplflux_gamma=2.0,
    refplflux_Ec=inf,
    energy_range=None,
    ns_seed=10.0,
    ns_min=0.0,
    ns_max=1000.0,
    gamma_seed=3.0,
    gamma_min=1.0,
    gamma_max=4.0,
    kde_smoothing=False,
    minimizer_impl='LBFGS',
    minimizer_max_rep=100,
    compress_data=False,
    keep_data_fields=None,
    evt_sel_delta_angle_deg=10,
    construct_sig_generator=True,
    tl=None,
    ppbar=None,
    logger_name=None
)
    Creates the Analysis instance for this particular analysis.

    Parameters
    ----------
    cfg : instance of Config
        The instance of Config holding the local configuration.
    datasets : list of Dataset instances
        The list of Dataset instances, which should be used in the
        analysis.
    source : PointLikeSource instance
        The PointLikeSource instance defining the point source position.
    refplflux_Phi0 : float
        The flux normalization to use for the reference power law flux model.
    refplflux_E0 : float
        The reference energy to use for the reference power law flux model.
    refplflux_gamma : float
        The spectral index to use for the reference power law flux model.
    refplflux_Ec: float,
        The cutoff energy for the cutoff power law flux model.
    energy_range: 2-element tuple of float (low energy, high energy) | None
        The energy range for signal generation. Both low and high energies are
        given in GeV. If set to ``None``, the entire energy range of the
        dataset is used.
    ns_seed : float
        Value to seed the minimizer with for the ns fit.
    ns_min : float
        Lower bound for ns fit.
    ns_max : float
        Upper bound for ns fit.
    gamma_seed : float | None
        Value to seed the minimizer with for the gamma fit. If set to None,
        the refplflux_gamma value will be set as gamma_seed.
    gamma_min : float
        Lower bound for gamma fit.
    gamma_max : float
        Upper bound for gamma fit.
    kde_smoothing : bool
        Deprecated: use of ``kde_smoothing=True`` is deprecated and will be
        removed in a future version.
        Apply a KDE-based smoothing to the data-driven background pdf.
        Default: False.
    minimizer_impl : str
        Minimizer implementation to be used. Supported options are ``"LBFGS"``
        (L-BFG-S minimizer used from the :mod:`scipy.optimize` module),
        or ``"minuit"`` (Minuit minimizer used by the :mod:`iminuit` module).
        Default: "LBFGS".
    minimizer_max_rep : int
        In case the minimization process did not converge at the first time
        this option specifies the maximum number of repetitions with
        different initials. Default is 100.
    compress_data : bool
        Flag if the data should get converted from float64 into float32.
    keep_data_fields : list of str | None
        List of additional data field names that should get kept when loading
        the data.
    evt_sel_delta_angle_deg : float
        The delta angle in degrees for the event selection optimization methods.
    construct_sig_generator : bool
        Flag if the signal generator should be constructed (``True``) or not
        (``False``).
    tl : TimeLord instance | None
        The TimeLord instance to use to time the creation of the analysis.
    ppbar : ProgressBar instance | None
        The instance of ProgressBar for the optional parent progress bar.
    logger_name : str | None
        The name of the logger to be used. If set to ``None``, ``__name__`` will
        be used.

    Returns
    -------
    ana : instance of SingleSourceMultiDatasetLLHRatioAnalysis
        The Analysis instance for this analysis.

Defining the Source: Here we use the location of NGC 1068 as an example source.

Instantiating a PointLikeSource object required the source location (R.A. and Dec.) in radians. Optionally, the source can have a name and a weight. The latter is in principle useful when performing a stacking analysis which is not supported for public data yet.

[10]:
from skyllh.core.source_model import PointLikeSource

help(PointLikeSource)
Help on class PointLikeSource in module skyllh.core.source_model:

class PointLikeSource(SourceModel, IsPointlike)
 |  PointLikeSource(ra, dec, name=None, weight=None, **kwargs)
 |
 |  The PointLikeSource class is a source model for a point-like source
 |  object in the sky at a given location (right-ascention and declination).
 |
 |  Method resolution order:
 |      PointLikeSource
 |      SourceModel
 |      skyllh.core.model.Model
 |      IsPointlike
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, ra, dec, name=None, weight=None, **kwargs)
 |      Creates a new PointLikeSource instance for defining a point-like
 |      source.
 |
 |      Parameters
 |      ----------
 |      ra : float
 |          The right-ascention coordinate of the source in radians.
 |      dec : float
 |          The declination coordinate of the source in radians.
 |      name : str | None
 |          The name of the source.
 |      weight : float | None
 |          The relative weight of the source w.r.t. other sources.
 |          If set to None, unity will be used.
 |
 |  __str__(self)
 |      Pretty string representation.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from SourceModel:
 |
 |  classification
 |      The astronomical classification of the source.
 |
 |  weight
 |      The weight of the source.
 |
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from skyllh.core.model.Model:
 |
 |  id
 |      (read-only) The ID of the model. It's an integer generated with
 |      Python's `id` function. Hence, it's related to the memory address
 |      of the object.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from skyllh.core.model.Model:
 |
 |  __dict__
 |      dictionary for instance variables
 |
 |  __weakref__
 |      list of weak references to the object
 |
 |  name
 |      The name of the model.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from IsPointlike:
 |
 |  dec
 |      The declination coordinate of the point-like source.
 |
 |  ra
 |      The right-ascention coordinate of the point-like source.

[11]:
src_ra = 40.67  # degrees
src_dec = -0.01  # degrees

source = PointLikeSource(ra=np.radians(src_ra), dec=np.radians(src_dec))

If not otherwise specified through the optional arguments, the time_integrated_ps.create_analysis method source assumes a signal hypothesis such that the defined source emits a powerlaw flux with spectral index \(\gamma=2.0\):

\[\phi(E_{\nu}) = \mathrm{refplflux\_Phi0} \cdot \left(\frac{E_{\nu}}{\mathrm{refplflux\_E0}}\right)^{-\mathrm{refplflux\_gamma}}\]
[12]:
ana = create_analysis(cfg=cfg, datasets=datasets, source=source)
2026-05-21 17:16:29,344 MainProcess skyllh.analyses.i3.publicdata_ps.time_integrated_ps INFO: SourceHypoGroupManager
    Source Hypothesis Groups:
        0: SourceHypoGroup:
            sources (1):
                0: PointLikeSource: "4663125600": { ra=40.670 deg, dec=-0.010 deg }
            fluxmodel:
                1.000e+00 * (E / (1000 GeV))^-2 * 1 (GeV cm^2 s)^-1
            detector signal yield builders (1):
                PDSingleParamFluxPointLikeSourceI3DetSigYieldBuilder
            signal generation method:
                NoneType
2026-05-21 17:16:29,346 MainProcess skyllh.analyses.i3.publicdata_ps.time_integrated_ps INFO: ParameterModelMapper: 2 global parameters, 2 models (1 source)
    Parameters:
        ns [floating (0 <= 10 <= 1000)]
            in models:
            - IceCube: ns

        gamma [floating (1 <= 3 <= 4)]
            in models:
            - 4663125600: gamma

100%|██████████| 33/33 [00:00<00:00, 897.20it/s]
100%|██████████| 33/33 [00:00<00:00, 1015.70it/s]
100%|██████████| 33/33 [00:00<00:00, 949.20it/s]
100%|██████████| 33/33 [00:00<00:00, 764.71it/s]
100%|██████████| 4/4 [00:10<00:00,  2.71s/it]
100%|██████████| 136/136 [00:00<00:00, 1250.92it/s]

Understanding how the analysis works

One usually wants to get best-fit values (maximum likelihood estimators) for the source parameters and a test-statistic, using methods like ana.do_trial or ana.unblind (see below). However, here we describe what happens under the hood when the analysis is run either on simulated or real experimental data.

Initializing a trial

After the Analysis instance was created, the point-source analysis can be run. To do so the analysis is initialized with some trial data. For instance we could initialize the analysis with the experimental data but the same applies for simulated data.

The Analysis object provides the method initialize_trial to initialize a trial with data. It takes a list of DataFieldRecordArray instances holding the events. If we want to initialize a trial with the experimental data, we can get that list from the Analysis instance itself:

[13]:
events_list = [data.exp for data in ana.data_list]
ana.initialize_trial(events_list)

Maximizing the log-likelihood ratio function

After initializing a trial, we can maximize the LLH ratio function using the ana.llhratio.maximize method. This method requires a RandomStateService instance in case the minimizer does not succeed and a new set of initial values for the fit parameters need to get generated. The method returns a 3-element tuple. The first element is the value of the LLH ratio function at its maximum. The second element is the set of fit parameters that maximize the LLH ratio. The third element is the status dictionary of the minimizer.

[14]:
from skyllh.core.random import RandomStateService

rss = RandomStateService(seed=1)
[15]:
(log_lambda_max, fitparam_values, status) = ana.llhratio.maximize(rss)
[16]:
print(f'log_lambda_max = {log_lambda_max}')
print(f'fitparam_values = {fitparam_values}')
print(f'status = {status}')
log_lambda_max = 14.576758615502305
fitparam_values = [80.0938193   3.20884803]
status = {'grad': array([-3.62370079e-06, -2.31360859e-04]), 'task': 'CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH', 'funcalls': 13, 'nit': 10, 'warnflag': 0, 'skyllh_minimizer_n_reps': 0, 'n_llhratio_func_calls': 13}

Calculating the test-statistic

Using the maximum of the LLH ratio function and the fit parameter values at the maximum we can calculate the test-statistic using the ana.calculate_test_statistic method:

[17]:
TS = ana.calculate_test_statistic(log_lambda_max, fitparam_values)
print(f'TS = {TS:.3f}')
TS = 29.154

Unblinding the data

Above, we perform the analysis following pedagogical steps: we initialize the analysis with a trial of the experimental data, maximize the log-likelihood ratio function for all given experimental data events, and calculate the test-statistic value.

However, the same calculations can be performed all together using the ana.unblind method of the analysis instance, which returns the test statistic and the best fit parameters directly.

Note that the ana.unblind method runs the analysis on the experimental data.

[18]:
help(ana.unblind)
Help on method unblind in module skyllh.core.analysis:

unblind(minimizer_rss, tl=None) method of skyllh.core.analysis.SingleSourceMultiDatasetLLHRatioAnalysis instance
    Evaluates the unscrambled data, i.e. unblinds the data.

    Parameters
    ----------
    minimizer_rss : instance of RandomStateService
        The instance of RandomStateService that should be used by the
        minimizer to generate new random initial fit parameter values.
    tl : instance of TimeLord | None
        The optional instance of TimeLord that should be used to time the
        maximization of the LLH ratio function.

    Returns
    -------
    TS : float
        The test-statistic value.
    global_params_dict : dict
        The dictionary holding the global parameter names and their
        best fit values. It includes fixed and floating parameters.
    status : dict
        The status dictionary with information about the performed
        minimization process of the negative of the log-likelihood ratio
        function.

[19]:
from skyllh.core.random import RandomStateService

rss = RandomStateService(seed=1)

ts, fitparam_values, status = ana.unblind(rss)
[20]:
print(f'TS = {ts:.3f}')
print(f'ns = {fitparam_values["ns"]:.2f}')
print(f'gamma = {fitparam_values["gamma"]:.2f}')
print(f'status = {status}')
TS = 29.154
ns = 80.09
gamma = 3.21
status = {'grad': array([-3.62370079e-06, -2.31360859e-04]), 'task': 'CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH', 'funcalls': 13, 'nit': 10, 'warnflag': 0, 'skyllh_minimizer_n_reps': 0, 'n_llhratio_func_calls': 13}

Converting the fit parameters into a flux normalization

By default the analysis assuming a powerlaw signal spectrum is created with a flux normalization of 1 \(\text{GeV}^{-1}\text{s}^{-1}\text{cm}^{-2}\text{sr}^{-1}\) at 1 TeV (see refplflux_Phi0 and refplflux_E0 arguments of the create_analysis method).

The analysis instance offers the method ana.calculate_fluxmodel_scaling_factor that calculates the scaling factor that, for a given spectral model, converts a mean number of signal events of 1 into a flux normalization refplflux_Phi0 at refplflux_E0.

[21]:
# The default value of `create_analysis()`
refplflux_gamma = 2.0

scaling_factor = ana.calculate_fluxmodel_scaling_factor()
print(f'1 event = {scaling_factor:.3e} (E/1000 GeV)^{{-{refplflux_gamma:.1f}}} 1/(GeV s cm^2 sr)')
1 event = 3.084e-17 (E/1000 GeV)^{-2.0} 1/(GeV s cm^2 sr)

By default the conversion factor is returned for the spectral index with which the analysis

The method optionally takes a fitparam_values argument where a specific spectral index can be specified. Note that this won’t change the internal refplflux_gamma value with which the analysis was initialized.

[22]:
refplflux_gamma = fitparam_values['gamma']

scaling_factor = ana.calculate_fluxmodel_scaling_factor(fitparam_values=[1, refplflux_gamma])
print(
    f'best-fit flux =  {fitparam_values["ns"] * scaling_factor:.3e} (E/1000 GeV)^{{-{refplflux_gamma:.1f}}} 1/(GeV s cm^2 sr)'
)
best-fit flux =  2.869e-14 (E/1000 GeV)^{-3.2} 1/(GeV s cm^2 sr)

Alternatively, the ana.mu2flux and ana.flux2mu methods return:

  • a flux normalization at refplflux_E0 given a certain mean number of signal events passed to it as an argument

  • a mean number of signal events corresponding to a flux normalization at refplflux_E0 passed to it as an argument

respectively.

They also take a different spectral parameters as optional arguments

[23]:
phi0 = ana.mu2flux(10)
print(f'Flux normalization for 10 events = {phi0:.3e} 1/(GeV s cm^2 sr)')
Flux normalization for 10 events = 3.084e-16 1/(GeV s cm^2 sr)
[24]:
mean_ns = ana.flux2mu(5e-15)
print(f'Mean number of events for a flux normalization at 1000 GeV of 5e-15 1/(GeV s cm^2 sr) = {mean_ns:.2f}')
Mean number of events for a flux normalization at 1000 GeV of 5e-15 1/(GeV s cm^2 sr) = 162.14