{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Definition" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Each analysis requires a dataset definition that defines a particular dataset.\n", "In practice usually a data sample exists, which is a collection of individual\n", "datasets. For example the public 10-year IceCube point-source data sample is a\n", "collection of individual datasets, one for each partial IceCube detector\n", "configuration.\n", "\n", "SkyLLh provides the :py:class:`skyllh.core.dataset.Dataset` class to create an\n", "individual dataset definition. Such a definition defines the experimental and\n", "monte-carlo data files and possibly additional information like data binning\n", "definitions or auxilary data files.\n", "\n", "Individual datasets can be combined into a dataset collection via the \n", ":py:class:`skyllh.core.dataset.DatasetCollection` class.\n", "\n", "A dataset collection is usually defined within one Python module providing the\n", "function ``create_dataset_collection``. For instance the 10-year public \n", "point-source data sample is defined in the \n", ":py:mod:`skyllh.datasets.i3.PublicData_10y_ps` module, and the its dataset \n", "collection can be created via the \n", ":py:func:`~skyllh.datasets.i3.PublicData_10y_ps.create_dataset_collection`\n", "function. This function requires a configuration. If no data repository base \n", "path is set in the configuration, that base path needs to be passed to the \n", "function as well." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from skyllh.core.config import (\n", " Config,\n", ")\n", "from skyllh.core.dataset import (\n", " Dataset,\n", " DatasetCollection,\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Create configuration instance.\n", "cfg = Config()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create individual dataset.\n", "my_dataset = Dataset(\n", " cfg=cfg,\n", " name='My Dataset',\n", " exp_pathfilenames='exp.npy',\n", " mc_pathfilenames='mc.npy',\n", " livetime=365,\n", " version=1,\n", " verqualifiers={'patch': 0},\n", " default_sub_path_fmt='my_dataset_v{version:03d}_p{patch:02d}',\n", " base_path='/data/ana/analyses/',\n", ")\n", "\n", "# Create collection of individual datasets.\n", "dsc = DatasetCollection(\n", " name='My Dataset Collection',\n", " description='This is my dataset collection containing all my individual '\n", " 'datasets.')\n", "dsc.add_datasets((my_dataset,))" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "We can print the dataset collection, which will list all the individual datasets\n", "of this collection." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatasetCollection \"My Dataset Collection\"\n", "--------------------------------------------------------------------------------\n", "Description:\n", "This is my dataset collection containing all my individual datasets.\n", "Available datasets:\n", "\n", " Dataset \"My Dataset\": v001patch00\n", " { livetime = 365.000 days }\n", " Experimental data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n", " MC data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n", " \n" ] } ], "source": [ "print(dsc)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Individual datasets of the dataset collection can be retrieved via the\n", ":py:meth:`~skyllh.core.dataset.DatasetCollection.get_dataset` method:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset \"My Dataset\": v001patch00\n", " { livetime = 365.000 days }\n", " Experimental data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n", " MC data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n", " \n" ] } ], "source": [ "my_dataset = dsc.get_dataset('My Dataset')\n", "print(my_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Auxiliary data files" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "If a dataset requires auxiliary data files, such files can be defined via the\n", ":py:meth:`~skyllh.core.dataset.Dataset.add_aux_data_definition` method:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "my_dataset.add_aux_data_definition('aux_file_key_1', 'aux_data/aux_file1.dat')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset \"My Dataset\": v001patch00\n", " { livetime = 365.000 days }\n", " Experimental data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n", " MC data:\n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n", " Auxiliary data:\n", " aux_file_key_1: \n", " [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/aux_data/aux_file1.dat\n" ] } ], "source": [ "print(my_dataset)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "If the auxiliary data is not present as a file but as actual Python data, such\n", "data can be added via the :py:meth:`~skyllh.core.dataset.Dataset.add_aux_data`\n", "method: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "my_dataset.add_aux_data('aux_data_1', [1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Origin" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "An individual dataset can have an origin, which specifies where the\n", "dataset can be downloaded automatically. SkyLLH provides the\n", ":py:class:`skyllh.core.dataset.DatasetOrigin` class to define such an origin.\n", "\n", "The origin consists of a host (possibly also a port), a base path and a sub path\n", "at the origin, and a transfer function which will be used to perform the actual\n", "data transfer.\n", "\n", "SkyLLH provides two dataset transfer methods, ``wget`` and ``rsync``. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from skyllh.core.dataset import (\n", " DatasetOrigin,\n", " WGETDatasetTransfer,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "origin = DatasetOrigin(\n", " host='data.mydomain.com',\n", " base_path='/downloads/data',\n", " sub_path='my_dataset',\n", " transfer_func=WGETDatasetTransfer(protocol='https').transfer)\n", "my_dataset.origin = origin" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "In the example above we specified that the dataset is available at the URL\n", "``data.mydomain.com/downloads/data/my_dataset``, which can be transfered\n", "using ``wget`` via the https protocol.\n", "\n", "Hence, the experimental and monte-carlo files ``exp.npy`` and ``mc.npy`` \n", "of the dataset must be available at \n", "``https://data.mydomain.com/downloads/data/my_dataset/exp.npy`` and\n", "``https://data.mydomain.com/downloads/data/my_dataset/mc.npy``, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Origin as archive file" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The dataset might be available as an archive file, e.g. a zip file on a \n", "webserver. In such cases the ``filename`` argument of the \n", ":py:class:`~skyllh.core.dataset.DatasetOrigin` class constructor can be used in\n", "combination with a post transfer function specified via the \n", "``post_transfer_func`` argument of the constructor:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "origin = DatasetOrigin(\n", " host='data.mydomain.com',\n", " base_path='/downloads/data',\n", " sub_path='',\n", " filename='my_dataset.zip',\n", " transfer_func=WGETDatasetTransfer(protocol='https').transfer,\n", " post_transfer_func=WGETDatasetTransfer.post_transfer_unzip)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The example above will transfer the single archive file \n", "``https://data.mydomain.com/downloads/data/my_dataset.zip`` and unzip the file\n", "on the local host." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading the dataset" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "If an origin is defined for an individual dataset, that dataset can be \n", "downloaded automatically using the \n", ":py:meth:`skyllh.core.dataset.Dataset.make_data_available` method of the\n", ":py:class:`~skyllh.core.dataset.Dataset` class." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }