{ "info": { "author": "Maksym Surzhynskyi", "author_email": "maksym.surzhynskyi@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "\n# What is it?\n\n`Spltr` is a simple PyTorch-based data loader and splitter.\nIt may be used to load i) arrays and ii) matrices or iii) Pandas \nDataFrames and iv) CSV files containing numerical data with\nsubsequent split it into train, test (validation) subsets in\nthe form of PyTorch DataLoader objects. The special emphesis was \ngiven to ease of usage and automation of many data-handling procedures.\n\nOriginally it was developed in order to speed up a data preparation stage\nfor number of trivial ML tasks. Hope it may be useful for you as well.\n\n# Main Features\n\n`Spltr.process_x|y` : converts loaded Input/Target data into PyTorch Tensors \n with ability to i) preview, ii) define tensor dtype, iii) set the desired \n device of returned tensor (CPU/GPU), iv) use selected rows/columns from \n Input/Target data sources or process a single data table (for CSV and \n Pandas DataFrame only).\n\n`Spltr.reshape_xy` : reshapes subsets.\n\n`Spltr.split_data` : splits data subsets into train, test (validation) \n PyTorch DataLoader objects.\n\n`Spltr.clean_space` : optimizes memory by deleting unnecessary variables.\n\n# Installation\n\n```python\npip install spltr\n```\n\n# License\n\nOSI Approved :: MIT License\n\n# Documentation\n\nhttps://github.com/maksymsur/Spltr\n\n# Dependencies\n\n+ torch >= 1.1.0\n\n+ numpy >= 1.16.4\n\n+ pandas >= 0.24.2\n\n# Example of usage\n\nHereunder we'll build a simple neural network and describe how `Spltr` may be used in the process.\n\nLoading and reading an Iris dataset to be used as an example. The dataset may be found at: https://github.com/maksymsur/Spltr/blob/master/dataset/iris_num.csv\n\n\n```python\nimport pandas as pd\n\ndb = pd.read_csv('/your_path_here/iris_num.csv')\nprint(db.info())\n```\n\n \n RangeIndex: 150 entries, 0 to 149\n Data columns (total 5 columns):\n sepal.length 150 non-null float64\n sepal.width 150 non-null float64\n petal.length 150 non-null float64\n petal.width 150 non-null float64\n variety 150 non-null int64\n dtypes: float64(4), int64(1)\n memory usage: 5.9 KB\n None\n\n\nBy building the network we'll try to predict the 'variety' column. This column identifies Iris species: Setosa = 0, Versicolor = 1 and Virginica = 2. Let's verify that it is comprised from the mentioned 3 unique values.\n\n\n```python\nprint(db.variety.unique())\n```\n\n [0 1 2]\n\n\nMaking **`X`** (Input) dataset excluding the 'variety' column.\n\n\n```python\nbase_db = db.iloc[:,:-1]\nprint(base_db.head())\n```\n\n sepal.length sepal.width petal.length petal.width\n 0 5.1 3.5 1.4 0.2\n 1 4.9 3.0 1.4 0.2\n 2 4.7 3.2 1.3 0.2\n 3 4.6 3.1 1.5 0.2\n 4 5.0 3.6 1.4 0.2\n\n\nMaking **`y`** (Target) dataset including i) the 'variety' column and ii) another column: 'petal.width'. The latter will be excluded by `Spltr` later just to demonstrate how easily we may modify loaded datasets.\n\n\n```python\ntarget_db = db.iloc[:,-2:]\nprint(target_db.head())\n```\n\n petal.width variety\n 0 0.2 0\n 1 0.2 0\n 2 0.2 0\n 3 0.2 0\n 4 0.2 0\n\n\nLoading necessary packages and initiating `Spltr` by including **`X,y`** datasets into it.\n\nNote that for `Spltr` to work properly `torch`, `numpy` and `pandas` shall be installed and loaded.\n\n\n```python\nimport torch\nimport numpy as np\nfrom spltr import Spltr\n\nsplt = Spltr(base_db,target_db)\n```\n\nPreprocessing **`X`** (Input data) by converting it into PyTorch Tensors. Also by indicating `nrows={0:150}` we show how easily rows may be selected for reading. I'm not going to apply any data normalization procedures just to avoid overloading of this example.\n\n\n```python\nsplt.process_x(preview=True, nrows={0:150})\n```\n\n [X]: The following DataFrame was loaded:\n\n sepal.length sepal.width petal.length petal.width\n 0 5.1 3.5 1.4 0.2\n 1 4.9 3.0 1.4 0.2\n\n [X]: Tensor of shape torch.Size([150, 4]) was created\n\n\nPreprocessing **`y`** (Target dataset) by converting it into PyTorch Tensors as well. \n\nAt this stage we also indicate that only 2nd column from the `y` dataset will be used. Thus, by specifying `usecols = 1` ('variety' index) we exclude the 'petal.width' column with index 0. Note, `usecols` may also work with strings and expression `usecols = 'variety'` is also viable.\n\n\n```python\nsplt.process_y(preview=True, usecols=1)\n```\n\n [y]: The following DataFrame was loaded:\n\n 0 0\n 1 0\n Name: variety, dtype: int64\n\n [y]: Tensor of shape torch.Size([150]) was created\n\n\nSplitting the dataset into train - 30%, test - 20% which makes validation set equal to 50% (calculated automatically: val = 100% - train - test) and initializing permutation of the data.\n\n\n```python\nsplt.split_data(splits=(0.3,0.2), perm_seed=8)\n```\n\n [X,y]: The Data is splited into 3 datasets of length: Train 45, Test 30, Validation 75.\n DataLoaders with Batch Size 1 and Shufle False are created\n\n\nNow let's clean unnecessary variables saved in the memory. That step may be especially useful if you are dealing with a huge datasets and don't want for `X,y tensors` to share the memory with `X,y DataLoader objects`.\n\n\n```python\nsplt.clean_space()\n```\n\n All variables are deleted. Only Train-Test(Validation) data is left\n\n\nSetting up a very simple neural network. Pls note that the network architecture is comprised only to demonstrate how `Spltr` may be adopted. That's not an optimal way to solve Iris classification problem.\n\n\n```python\nfrom torch import nn, optim\nimport torch.nn.functional as F\n\nclass Net(nn.Module):\n def __init__(self):\n super(Net, self).__init__()\n\n self.input = nn.Linear(4,8)\n self.output = nn.Linear(8,3)\n\n def forward(self,x):\n\n x = F.softsign(self.input(x))\n x = self.output(x)\n return x\n\nmodel = Net()\ncriterion = nn.CrossEntropyLoss(reduction='none')\noptimizer = optim.RMSprop(model.parameters())\n\nfor n in range(10):\n train_loss = 0.0\n\n # Fitting a dataset for Training\n\n for train_data, train_target in splt.xy_train:\n model.zero_grad()\n train_result = model(train_data.float())\n loss = criterion(train_result, train_target)\n train_loss += loss.item()\n loss.backward()\n optimizer.step()\n\n test_correct = 0\n with torch.no_grad():\n\n # Fitting a dataset for Testing\n\n for test_data, test_target in splt.xy_test:\n test_result = model(test_data.float())\n _, predicted = torch.max(test_result, 1)\n test_correct += (predicted == test_target).sum().item()\n\n print(f'Epoch {n+1}. Train loss: ', round(train_loss/len(splt.xy_train), 5))\n print(f'Testing: Correctly identified samples {test_correct} out of {len(splt.xy_test)}', end='\\n\\n')\n\nval_correct = 0\n\nwith torch.no_grad():\n\n # Fitting a dataset for Validation\n\n for val_data, val_target in splt.xy_val:\n val_result = model(val_data.float())\n _, val_predicted = torch.max(val_result, 1)\n val_correct += (val_predicted == val_target).sum().item()\n\nprint(f'VALIDATION: Correctly identified samples {val_correct} out of {len(splt.xy_val)}', end='\\n\\n')\n```\n\n Epoch 1. Train loss: 1.01279\n Testing: Correctly identified samples 14 out of 30\n\n Epoch 2. Train loss: 0.57477\n Testing: Correctly identified samples 14 out of 30\n\n Epoch 3. Train loss: 0.48212\n Testing: Correctly identified samples 14 out of 30\n\n Epoch 4. Train loss: 0.44074\n Testing: Correctly identified samples 16 out of 30\n\n Epoch 5. Train loss: 0.41144\n Testing: Correctly identified samples 24 out of 30\n\n Epoch 6. Train loss: 0.38251\n Testing: Correctly identified samples 29 out of 30\n\n Epoch 7. Train loss: 0.34853\n Testing: Correctly identified samples 29 out of 30\n\n Epoch 8. Train loss: 0.31831\n Testing: Correctly identified samples 29 out of 30\n\n Epoch 9. Train loss: 0.29182\n Testing: Correctly identified samples 29 out of 30\n\n Epoch 10. Train loss: 0.27387\n Testing: Correctly identified samples 30 out of 30\n\n VALIDATION: Correctly identified samples 74 out of 75\n\n\n\nAs you might have noticed, we obtained pretty good results even by using a very few samples (45) for training the network.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/maksymsur/Spltr", "keywords": "PyTorch,Data loader,Data splitter,DataLoader,random_split,train test,train test validation split", "license": "", "maintainer": "", "maintainer_email": "", "name": "spltr", "package_url": "https://pypi.org/project/spltr/", "platform": "", "project_url": "https://pypi.org/project/spltr/", "project_urls": { "Homepage": "https://github.com/maksymsur/Spltr" }, "release_url": "https://pypi.org/project/spltr/0.3.1/", "requires_dist": null, "requires_python": ">=3.6", "summary": "A simple PyTorch-based data loader and splitter.", "version": "0.3.1" }, "last_serial": 5566191, "releases": { "0.3": [ { "comment_text": "", "digests": { "md5": "696f939cd967a87d8bd6f50fe2306248", "sha256": "e1f9ea54fe28156e82a690132fcff61f1ccd165f8fc7e615b7c32220a7af9e07" }, "downloads": -1, "filename": "spltr-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "696f939cd967a87d8bd6f50fe2306248", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 7784, "upload_time": "2019-07-20T12:52:30", "url": "https://files.pythonhosted.org/packages/39/da/a6c168353228bcc20e5cca9a06c1256a558d673c96fec01decd65df5b25e/spltr-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ca8432b0f9d01b308013b9a5d06db889", "sha256": "09f9acca98f92794d6af42eee96ae231b2427af97c20395b28ea96ef6382bec6" }, "downloads": -1, "filename": "spltr-0.3.tar.gz", "has_sig": false, "md5_digest": "ca8432b0f9d01b308013b9a5d06db889", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 10006, "upload_time": "2019-07-20T12:52:32", "url": "https://files.pythonhosted.org/packages/fc/5d/3c05f642acb5f14e3ef9dd0ccf7e470b9ffc683ad8e6f8e1524555f53830/spltr-0.3.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "36bf2207a1bd846ab18caa2e62588bc3", "sha256": "2aa734141df19dea4bad9f75e07f9e3ebdddece0d17a726da80ea28e6c8a838d" }, "downloads": -1, "filename": "spltr-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "36bf2207a1bd846ab18caa2e62588bc3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 8343, "upload_time": "2019-07-22T08:20:13", "url": "https://files.pythonhosted.org/packages/fd/4c/5d5fe24ee814971f1d40a1a25104139723d547481c8570fb33a21e3684fa/spltr-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1df1075582adfcd96247a4e23434ee6d", "sha256": "91381e392c6b6bccfd0cdac8bead217e451d0bb54b0b573b14982300fd6a0397" }, "downloads": -1, "filename": "spltr-0.3.1.tar.gz", "has_sig": false, "md5_digest": "1df1075582adfcd96247a4e23434ee6d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 11081, "upload_time": "2019-07-22T08:20:15", "url": "https://files.pythonhosted.org/packages/db/fe/d52dab2ecf843b75c6544a794817dac67d3f91f4f221ce949741be614a94/spltr-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "36bf2207a1bd846ab18caa2e62588bc3", "sha256": "2aa734141df19dea4bad9f75e07f9e3ebdddece0d17a726da80ea28e6c8a838d" }, "downloads": -1, "filename": "spltr-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "36bf2207a1bd846ab18caa2e62588bc3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 8343, "upload_time": "2019-07-22T08:20:13", "url": "https://files.pythonhosted.org/packages/fd/4c/5d5fe24ee814971f1d40a1a25104139723d547481c8570fb33a21e3684fa/spltr-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1df1075582adfcd96247a4e23434ee6d", "sha256": "91381e392c6b6bccfd0cdac8bead217e451d0bb54b0b573b14982300fd6a0397" }, "downloads": -1, "filename": "spltr-0.3.1.tar.gz", "has_sig": false, "md5_digest": "1df1075582adfcd96247a4e23434ee6d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 11081, "upload_time": "2019-07-22T08:20:15", "url": "https://files.pythonhosted.org/packages/db/fe/d52dab2ecf843b75c6544a794817dac67d3f91f4f221ce949741be614a94/spltr-0.3.1.tar.gz" } ] }