Dataset Definition
Each analysis requires a dataset definition that defines a particular dataset. In practice usually a data sample exists, which is a collection of individual datasets. For example the public 10-year IceCube point-source data sample is a collection of individual datasets, one for each partial IceCube detector configuration.
SkyLLh provides the skyllh.core.dataset.Dataset
class to create an
individual dataset definition. Such a definition defines the experimental and
monte-carlo data files and possibly additional information like data binning
definitions or auxilary data files.
Individual datasets can be combined into a dataset collection via the
skyllh.core.dataset.DatasetCollection
class.
A dataset collection is usually defined within one Python module providing the
function create_dataset_collection
. For instance the 10-year public
point-source data sample is defined in the
skyllh.datasets.i3.PublicData_10y_ps
module, and the its dataset
collection can be created via the
create_dataset_collection()
function. This function requires a configuration. If no data repository base
path is set in the configuration, that base path needs to be passed to the
function as well.
[1]:
from skyllh.core.config import (
Config,
)
from skyllh.core.dataset import (
Dataset,
DatasetCollection,
)
[2]:
# Create configuration instance.
cfg = Config()
[3]:
# Create individual dataset.
my_dataset = Dataset(
cfg=cfg,
name='My Dataset',
exp_pathfilenames='exp.npy',
mc_pathfilenames='mc.npy',
livetime=365,
version=1,
verqualifiers={'patch': 0},
default_sub_path_fmt='my_dataset_v{version:03d}_p{patch:02d}',
base_path='/data/ana/analyses/',
)
# Create collection of individual datasets.
dsc = DatasetCollection(
name='My Dataset Collection',
description='This is my dataset collection containing all my individual '
'datasets.')
dsc.add_datasets((my_dataset,))
[3]:
<skyllh.core.dataset.DatasetCollection at 0x7fa0a37834f0>
We can print the dataset collection, which will list all the individual datasets of this collection.
[4]:
print(dsc)
DatasetCollection "My Dataset Collection"
--------------------------------------------------------------------------------
Description:
This is my dataset collection containing all my individual datasets.
Available datasets:
Dataset "My Dataset": v001patch00
{ livetime = 365.000 days }
Experimental data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
MC data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy
Individual datasets of the dataset collection can be retrieved via the
get_dataset()
method:
[5]:
my_dataset = dsc.get_dataset('My Dataset')
print(my_dataset)
Dataset "My Dataset": v001patch00
{ livetime = 365.000 days }
Experimental data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
MC data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy
Auxiliary data files
If a dataset requires auxiliary data files, such files can be defined via the
add_aux_data_definition()
method:
[6]:
my_dataset.add_aux_data_definition('aux_file_key_1', 'aux_data/aux_file1.dat')
[7]:
print(my_dataset)
Dataset "My Dataset": v001patch00
{ livetime = 365.000 days }
Experimental data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/exp.npy
MC data:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/mc.npy
Auxiliary data:
aux_file_key_1:
[FOUND] /data/ana/analyses/my_dataset_v001_p00/aux_data/aux_file1.dat
If the auxiliary data is not present as a file but as actual Python data, such
data can be added via the add_aux_data()
method:
[8]:
my_dataset.add_aux_data('aux_data_1', [1, 2, 3])
Dataset Origin
An individual dataset can have an origin, which specifies where the
dataset can be downloaded automatically. SkyLLH provides the
skyllh.core.dataset.DatasetOrigin
class to define such an origin.
The origin consists of a host (possibly also a port), a base path and a sub path at the origin, and a transfer function which will be used to perform the actual data transfer.
SkyLLH provides two dataset transfer methods, wget
and rsync
.
[9]:
from skyllh.core.dataset import (
DatasetOrigin,
WGETDatasetTransfer,
)
[10]:
origin = DatasetOrigin(
host='data.mydomain.com',
base_path='/downloads/data',
sub_path='my_dataset',
transfer_func=WGETDatasetTransfer(protocol='https').transfer)
my_dataset.origin = origin
In the example above we specified that the dataset is available at the URL
data.mydomain.com/downloads/data/my_dataset
, which can be transfered
using wget
via the https protocol.
Hence, the experimental and monte-carlo files exp.npy
and mc.npy
of the dataset must be available at
https://data.mydomain.com/downloads/data/my_dataset/exp.npy
and
https://data.mydomain.com/downloads/data/my_dataset/mc.npy
, respectively.
Origin as archive file
The dataset might be available as an archive file, e.g. a zip file on a
webserver. In such cases the filename
argument of the
DatasetOrigin
class constructor can be used in
combination with a post transfer function specified via the
post_transfer_func
argument of the constructor:
[11]:
origin = DatasetOrigin(
host='data.mydomain.com',
base_path='/downloads/data',
sub_path='',
filename='my_dataset.zip',
transfer_func=WGETDatasetTransfer(protocol='https').transfer,
post_transfer_func=WGETDatasetTransfer.post_transfer_unzip)
The example above will transfer the single archive file
https://data.mydomain.com/downloads/data/my_dataset.zip
and unzip the file
on the local host.
Downloading the dataset
If an origin is defined for an individual dataset, that dataset can be
downloaded automatically using the
skyllh.core.dataset.Dataset.make_data_available()
method of the
Dataset
class.