Dataset Definition

Each analysis requires a dataset definition that defines a particular dataset. In practice usually a data sample exists, which is a collection of individual datasets. For example the public 10-year IceCube point-source data sample is a collection of individual datasets, one for each partial IceCube detector configuration.

SkyLLh provides the skyllh.core.dataset.Dataset class to create an individual dataset definition. Such a definition defines the experimental and monte-carlo data files and possibly additional information like data binning definitions or auxilary data files.

Individual datasets can be combined into a dataset collection via the skyllh.core.dataset.DatasetCollection class.

A dataset collection is usually defined within one Python module providing the function create_dataset_collection. For instance the 10-year public point-source data sample is defined in the skyllh.datasets.i3.PublicData_10y_ps module, and the its dataset collection can be created via the create_dataset_collection() function. This function requires a configuration. If no data repository base path is set in the configuration, that base path needs to be passed to the function as well.

[1]:
from skyllh.core.config import (
    Config,
)
from skyllh.core.dataset import (
    Dataset,
    DatasetCollection,
)
[2]:
# Create configuration instance.
cfg = Config()
[3]:
# Create individual dataset.
my_dataset = Dataset(
    cfg=cfg,
    name='My Dataset',
    exp_pathfilenames='exp.npy',
    mc_pathfilenames='mc.npy',
    livetime=365,
    version=1,
    verqualifiers={'patch': 0},
    default_sub_path_fmt='my_dataset_v{version:03d}_p{patch:02d}',
    base_path='/data/ana/analyses/',
)

# Create collection of individual datasets.
dsc = DatasetCollection(
    name='My Dataset Collection',
    description='This is my dataset collection containing all my individual '
        'datasets.')
dsc.add_datasets((my_dataset,))
[3]:
<skyllh.core.dataset.DatasetCollection at 0x7fa0a37834f0>

We can print the dataset collection, which will list all the individual datasets of this collection.

[4]:
print(dsc)
DatasetCollection "My Dataset Collection"
--------------------------------------------------------------------------------
Description:
This is my dataset collection containing all my individual datasets.
Available datasets:

  Dataset "My Dataset": v001patch00
      { livetime = 365.000 days }
      Experimental data:
          [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
      MC data:
          [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy

Individual datasets of the dataset collection can be retrieved via the get_dataset() method:

[5]:
my_dataset = dsc.get_dataset('My Dataset')
print(my_dataset)
Dataset "My Dataset": v001patch00
    { livetime = 365.000 days }
    Experimental data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
    MC data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy

Auxiliary data files

If a dataset requires auxiliary data files, such files can be defined via the add_aux_data_definition() method:

[6]:
my_dataset.add_aux_data_definition('aux_file_key_1', 'aux_data/aux_file1.dat')
[7]:
print(my_dataset)
Dataset "My Dataset": v001patch00
    { livetime = 365.000 days }
    Experimental data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
    MC data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy
    Auxiliary data:
        aux_file_key_1:
            [FOUND] /data/ana/analyses/my_dataset_v001_p00/aux_data/aux_file1.dat

If the auxiliary data is not present as a file but as actual Python data, such data can be added via the add_aux_data() method:

[8]:
my_dataset.add_aux_data('aux_data_1', [1, 2, 3])

Dataset Origin

An individual dataset can have an origin, which specifies where the dataset can be downloaded automatically. SkyLLH provides the skyllh.core.dataset.DatasetOrigin class to define such an origin.

The origin consists of a host (possibly also a port), a base path and a sub path at the origin, and a transfer function which will be used to perform the actual data transfer.

SkyLLH provides two dataset transfer methods, wget and rsync.

[9]:
from skyllh.core.dataset import (
    DatasetOrigin,
    WGETDatasetTransfer,
)
[10]:
origin = DatasetOrigin(
    host='data.mydomain.com',
    base_path='/downloads/data',
    sub_path='my_dataset',
    transfer_func=WGETDatasetTransfer(protocol='https').transfer)
my_dataset.origin = origin

In the example above we specified that the dataset is available at the URL data.mydomain.com/downloads/data/my_dataset, which can be transfered using wget via the https protocol.

Hence, the experimental and monte-carlo files exp.npy and mc.npy of the dataset must be available at https://data.mydomain.com/downloads/data/my_dataset/exp.npy and https://data.mydomain.com/downloads/data/my_dataset/mc.npy, respectively.

Origin as archive file

The dataset might be available as an archive file, e.g. a zip file on a webserver. In such cases the filename argument of the DatasetOrigin class constructor can be used in combination with a post transfer function specified via the post_transfer_func argument of the constructor:

[11]:
origin = DatasetOrigin(
    host='data.mydomain.com',
    base_path='/downloads/data',
    sub_path='',
    filename='my_dataset.zip',
    transfer_func=WGETDatasetTransfer(protocol='https').transfer,
    post_transfer_func=WGETDatasetTransfer.post_transfer_unzip)

The example above will transfer the single archive file https://data.mydomain.com/downloads/data/my_dataset.zip and unzip the file on the local host.

Downloading the dataset

If an origin is defined for an individual dataset, that dataset can be downloaded automatically using the skyllh.core.dataset.Dataset.make_data_available() method of the Dataset class.