{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset Definition"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Each analysis requires a dataset definition that defines a particular dataset.\n",
    "In practice usually a data sample exists, which is a collection of individual\n",
    "datasets. For example the public 10-year IceCube point-source data sample is a\n",
    "collection of individual datasets, one for each partial IceCube detector\n",
    "configuration.\n",
    "\n",
    "SkyLLh provides the :py:class:`skyllh.core.dataset.Dataset` class to create an\n",
    "individual dataset definition. Such a definition defines the experimental and\n",
    "monte-carlo data files and possibly additional information like data binning\n",
    "definitions or auxilary data files.\n",
    "\n",
    "Individual datasets can be combined into a dataset collection via the \n",
    ":py:class:`skyllh.core.dataset.DatasetCollection` class.\n",
    "\n",
    "A dataset collection is usually defined within one Python module providing the\n",
    "function ``create_dataset_collection``. For instance the 10-year public \n",
    "point-source data sample is defined in the \n",
    ":py:mod:`skyllh.datasets.i3.PublicData_10y_ps` module, and the its dataset \n",
    "collection can be created via the \n",
    ":py:func:`~skyllh.datasets.i3.PublicData_10y_ps.create_dataset_collection`\n",
    "function. This function requires a configuration. If no data repository base \n",
    "path is set in the configuration, that base path needs to be passed to the \n",
    "function as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skyllh.core.config import (\n",
    "    Config,\n",
    ")\n",
    "from skyllh.core.dataset import (\n",
    "    Dataset,\n",
    "    DatasetCollection,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create configuration instance.\n",
    "cfg = Config()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<skyllh.core.dataset.DatasetCollection at 0x7fa0a37834f0>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create individual dataset.\n",
    "my_dataset = Dataset(\n",
    "    cfg=cfg,\n",
    "    name='My Dataset',\n",
    "    exp_pathfilenames='exp.npy',\n",
    "    mc_pathfilenames='mc.npy',\n",
    "    livetime=365,\n",
    "    version=1,\n",
    "    verqualifiers={'patch': 0},\n",
    "    default_sub_path_fmt='my_dataset_v{version:03d}_p{patch:02d}',\n",
    "    base_path='/data/ana/analyses/',\n",
    ")\n",
    "\n",
    "# Create collection of individual datasets.\n",
    "dsc = DatasetCollection(\n",
    "    name='My Dataset Collection',\n",
    "    description='This is my dataset collection containing all my individual '\n",
    "        'datasets.')\n",
    "dsc.add_datasets((my_dataset,))"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "We can print the dataset collection, which will list all the individual datasets\n",
    "of this collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetCollection \"My Dataset Collection\"\n",
      "--------------------------------------------------------------------------------\n",
      "Description:\n",
      "This is my dataset collection containing all my individual datasets.\n",
      "Available datasets:\n",
      "\n",
      "  Dataset \"My Dataset\": v001patch00\n",
      "      { livetime = 365.000 days }\n",
      "      Experimental data:\n",
      "          [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n",
      "      MC data:\n",
      "          [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n",
      "      \n"
     ]
    }
   ],
   "source": [
    "print(dsc)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Individual datasets of the dataset collection can be retrieved via the\n",
    ":py:meth:`~skyllh.core.dataset.DatasetCollection.get_dataset` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset \"My Dataset\": v001patch00\n",
      "    { livetime = 365.000 days }\n",
      "    Experimental data:\n",
      "        [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n",
      "    MC data:\n",
      "        [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "my_dataset = dsc.get_dataset('My Dataset')\n",
    "print(my_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Auxiliary data files"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "If a dataset requires auxiliary data files, such files can be defined via the\n",
    ":py:meth:`~skyllh.core.dataset.Dataset.add_aux_data_definition` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_dataset.add_aux_data_definition('aux_file_key_1', 'aux_data/aux_file1.dat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset \"My Dataset\": v001patch00\n",
      "    { livetime = 365.000 days }\n",
      "    Experimental data:\n",
      "        [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/exp.npy\n",
      "    MC data:\n",
      "        [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/mc.npy\n",
      "    Auxiliary data:\n",
      "        aux_file_key_1:    \n",
      "            [\u001b[92mFOUND\u001b[0m] /data/ana/analyses/my_dataset_v001_p00/aux_data/aux_file1.dat\n"
     ]
    }
   ],
   "source": [
    "print(my_dataset)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "If the auxiliary data is not present as a file but as actual Python data, such\n",
    "data can be added via the :py:meth:`~skyllh.core.dataset.Dataset.add_aux_data`\n",
    "method: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_dataset.add_aux_data('aux_data_1', [1, 2, 3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Origin"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "An individual dataset can have an origin, which specifies where the\n",
    "dataset can be downloaded automatically. SkyLLH provides the\n",
    ":py:class:`skyllh.core.dataset.DatasetOrigin` class to define such an origin.\n",
    "\n",
    "The origin consists of a host (possibly also a port), a base path and a sub path\n",
    "at the origin, and a transfer function which will be used to perform the actual\n",
    "data transfer.\n",
    "\n",
    "SkyLLH provides two dataset transfer methods, ``wget`` and ``rsync``. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skyllh.core.dataset import (\n",
    "    DatasetOrigin,\n",
    "    WGETDatasetTransfer,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "origin = DatasetOrigin(\n",
    "    host='data.mydomain.com',\n",
    "    base_path='/downloads/data',\n",
    "    sub_path='my_dataset',\n",
    "    transfer_func=WGETDatasetTransfer(protocol='https').transfer)\n",
    "my_dataset.origin = origin"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "In the example above we specified that the dataset is available at the URL\n",
    "``data.mydomain.com/downloads/data/my_dataset``, which can be transfered\n",
    "using ``wget`` via the https protocol.\n",
    "\n",
    "Hence, the experimental and monte-carlo files ``exp.npy`` and ``mc.npy`` \n",
    "of the dataset must be available at \n",
    "``https://data.mydomain.com/downloads/data/my_dataset/exp.npy`` and\n",
    "``https://data.mydomain.com/downloads/data/my_dataset/mc.npy``, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Origin as archive file"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The dataset might be available as an archive file, e.g. a zip file on a \n",
    "webserver. In such cases the ``filename`` argument of the \n",
    ":py:class:`~skyllh.core.dataset.DatasetOrigin` class constructor can be used in\n",
    "combination with a post transfer function specified via the \n",
    "``post_transfer_func`` argument of the constructor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "origin = DatasetOrigin(\n",
    "    host='data.mydomain.com',\n",
    "    base_path='/downloads/data',\n",
    "    sub_path='',\n",
    "    filename='my_dataset.zip',\n",
    "    transfer_func=WGETDatasetTransfer(protocol='https').transfer,\n",
    "    post_transfer_func=WGETDatasetTransfer.post_transfer_unzip)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The example above will transfer the single archive file \n",
    "``https://data.mydomain.com/downloads/data/my_dataset.zip`` and unzip the file\n",
    "on the local host."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading the dataset"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "If an origin is defined for an individual dataset, that dataset can be \n",
    "downloaded automatically using the \n",
    ":py:meth:`skyllh.core.dataset.Dataset.make_data_available` method of the\n",
    ":py:class:`~skyllh.core.dataset.Dataset` class."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}