Dataset Definition

Each analysis requires a dataset definition that defines a particular dataset. In practice usually a data sample exists, which is a collection of individual datasets. For example the public 10-year IceCube point-source data sample is a collection of individual datasets, one for each partial IceCube detector configuration.

SkyLLh provides the skyllh.core.dataset.Dataset class to create an individual dataset definition. Such a definition defines the experimental and monte-carlo data files and possibly additional information like data binning definitions or auxilary data files.

Individual datasets can be combined into a dataset collection via the skyllh.core.dataset.DatasetCollection class.

A dataset collection is usually defined within one Python module providing the function create_dataset_collection. For instance the 10-year public point-source data sample is defined in the skyllh.datasets.i3.PublicData_10y_ps module, and the its dataset collection can be created via the create_dataset_collection() function. This function requires a configuration. If no data repository base path is set in the configuration, that base path needs to be passed to the function as well.

[1]:

from skyllh.core.config import (
    Config,
)
from skyllh.core.dataset import (
    Dataset,
    DatasetCollection,
)

[2]:

# Create configuration instance.
cfg = Config()

[3]:

# Create individual dataset.
my_dataset = Dataset(
    cfg=cfg,
    name='My Dataset',
    exp_pathfilenames='exp.npy',
    mc_pathfilenames='mc.npy',
    livetime=365,
    version=1,
    verqualifiers={'patch': 0},
    default_sub_path_fmt='my_dataset_v{version:03d}_p{patch:02d}',
    base_path='/data/ana/analyses/',
)

# Create collection of individual datasets.
dsc = DatasetCollection(
    name='My Dataset Collection',
    description='This is my dataset collection containing all my individual '
        'datasets.')
dsc.add_datasets((my_dataset,))

[3]:

<skyllh.core.dataset.DatasetCollection at 0x7fa0a37834f0>

We can print the dataset collection, which will list all the individual datasets of this collection.

[4]:

print(dsc)

DatasetCollection "My Dataset Collection"
--------------------------------------------------------------------------------
Description:
This is my dataset collection containing all my individual datasets.
Available datasets:

  Dataset "My Dataset": v001patch00
      { livetime = 365.000 days }
      Experimental data:
          [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
      MC data:
          [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy

Individual datasets of the dataset collection can be retrieved via the get_dataset() method:

[5]:

my_dataset = dsc.get_dataset('My Dataset')
print(my_dataset)

Dataset "My Dataset": v001patch00
    { livetime = 365.000 days }
    Experimental data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
    MC data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy

Auxiliary data files

If a dataset requires auxiliary data files, such files can be defined via the add_aux_data_definition() method:

[6]:

my_dataset.add_aux_data_definition('aux_file_key_1', 'aux_data/aux_file1.dat')

[7]:

print(my_dataset)

Dataset "My Dataset": v001patch00
    { livetime = 365.000 days }
    Experimental data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
    MC data:
        [FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy
    Auxiliary data:
        aux_file_key_1:
            [FOUND] /data/ana/analyses/my_dataset_v001_p00/aux_data/aux_file1.dat

If the auxiliary data is not present as a file but as actual Python data, such data can be added via the add_aux_data() method:

[8]:

my_dataset.add_aux_data('aux_data_1', [1, 2, 3])

Dataset Origin

An individual dataset can have an origin, which specifies where the dataset can be downloaded automatically. SkyLLH provides the skyllh.core.dataset.DatasetOrigin class to define such an origin.

The origin consists of a host (possibly also a port), a base path and a sub path at the origin, and a transfer function which will be used to perform the actual data transfer.

SkyLLH provides two dataset transfer methods, wget and rsync.

[9]:

from skyllh.core.dataset import (
    DatasetOrigin,
    WGETDatasetTransfer,
)

[10]:

origin = DatasetOrigin(
    host='data.mydomain.com',
    base_path='/downloads/data',
    sub_path='my_dataset',
    transfer_func=WGETDatasetTransfer(protocol='https').transfer)
my_dataset.origin = origin

In the example above we specified that the dataset is available at the URL data.mydomain.com/downloads/data/my_dataset, which can be transfered using wget via the https protocol.

Hence, the experimental and monte-carlo files exp.npy and mc.npy of the dataset must be available at https://data.mydomain.com/downloads/data/my_dataset/exp.npy and https://data.mydomain.com/downloads/data/my_dataset/mc.npy, respectively.

Origin as archive file

The dataset might be available as an archive file, e.g. a zip file on a webserver. In such cases the filename argument of the DatasetOrigin class constructor can be used in combination with a post transfer function specified via the post_transfer_func argument of the constructor:

[11]:

origin = DatasetOrigin(
    host='data.mydomain.com',
    base_path='/downloads/data',
    sub_path='',
    filename='my_dataset.zip',
    transfer_func=WGETDatasetTransfer(protocol='https').transfer,
    post_transfer_func=WGETDatasetTransfer.post_transfer_unzip)

The example above will transfer the single archive file https://data.mydomain.com/downloads/data/my_dataset.zip and unzip the file on the local host.

Downloading the dataset

If an origin is defined for an individual dataset, that dataset can be downloaded automatically using the skyllh.core.dataset.Dataset.make_data_available() method of the Dataset class.