{ "info": { "author": "Matthew Burke", "author_email": "matthew.wesley.burke@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# XGBoost Python Deployment\n\nThis package allows users to take their XGBoost models they developed in python, and package them in a way that they can deploy the model in production using only pure python.\n\n## Installation\n\nYou can install this from pip using `pip install xgboost-deploy-python`. The current version is 0.0.2.\n\nIt was tested on python 3.6 and and XGBoost 0.8.1.\n\n\n## Deployment Process\n\nThe typical way of moving models to production is by using the `pickle` library to export the model file, and then loading it into the container or server that will make the predictions using `pickle` again. Both the training and deployment environments require the same package and versions of those packages instsalled for this process to work correctly.\n\nThis package follows a similar process for deployment, except that it relies only on a single JSON file, and the pure python code that makes predictions, removing the dependencies of `XGBoost` and `pickle` in deployment.\n\nBelow we will step through how to use this package to take an existing model and generate the file(s) needed for deploying your model. For a more detailed walk-through, please see `example.py`, which you can run locally to create your own files and view some potential steps for validating the consistency of your trained model vs the new production version.\n\n### `fmap` Creation\n\nIn this step we use the `fmap` submodule to create a feature map file that the module uses to accurately generate the JSON files.\n\nA `fmap` file format contains one line for each feature according to the following format: ` `\n\nSome notes about the full file format:\n\n* `line_number` values start at zero and increase incrementally.\n* `feature_name` values **cannot** contain spaces in them or the file will truncate the feature names after the first space in the JSON model file\n* `feature_type` options are as follows:\n - `q` for quantitative (or cintuous) variables\n - `int` for integer valued variables\n - `i` for binary variables, please note that this field does **not** allow null values and always expects either a 0 or a 1. Your predictor function may error and fail.\n\nHere is an `fmap` file generated using the `example.py` file:\n\n```\n0 mean_radius q\n1 mean_texture q\n2 mean_perimeter q\n3 mean_area q\n4 mean_smoothness q\n5 mean_compactness q\n6 mean_concavity q\n7 mean_concave_points q\n8 mean_symmetry q\n9 mean_fractal_dimension q\n10 radius_error q\n11 texture_error q\n12 perimeter_error q\n13 area_error q\n14 smoothness_error q\n15 compactness_error q\n16 concavity_error q\n17 concave_points_error q\n18 symmetry_error q\n19 fractal_dimension_error q\n20 worst_radius int\n21 worst_texture int\n22 worst_perimeter int\n23 worst_area int\n24 worst_smoothness i\n25 worst_compactness i\n26 worst_concavity i\n27 worst_concave_points q\n28 worst_symmetry q\n29 worst_fractal_dimension q\n30 target i\n```\n\nYou can either generate your `fmap` file using lists of feature names and types using the `generate_fmap` function or automatically from a `pandas` DataFrame using the `generate_fmap_from_pandas` which extracts column names and infers feature types.\n\n### JSON Model Creation\n\nThe `XGBoost` package already contains a method to generate text representations of trained models in either text or JSON formats.\n\nOnce you have the `fmap` file created successfully and your model trained, you can generate the JSON model file directly using the following command:\n\n```python\nmodel.dump_model(fout='xgb_model.json', fmap='fmap_pandas.txt', dump_format='json')\n```\n\nThis should generate a JSON file in your `fout` location specified that should have a list of JSON objects, each representing a tree in your model. Here's an example of a subset of one such file:\n\n```json\n[\n { \"nodeid\": 0, \"depth\": 0, \"split\": \"worst_perimeter\", \"split_condition\": 110, \"yes\": 1, \"no\": 2, \"missing\": 1, \"children\": [\n { \"nodeid\": 1, \"depth\": 1, \"split\": \"worst_concave_points\", \"split_condition\": 0.160299987, \"yes\": 3, \"no\": 4, \"missing\": 3, \"children\": [\n { \"nodeid\": 3, \"depth\": 2, \"split\": \"worst_concave_points\", \"split_condition\": 0.135049999, \"yes\": 7, \"no\": 8, \"missing\": 7, \"children\": [\n { \"nodeid\": 7, \"leaf\": 0.150075898 },\n { \"nodeid\": 8, \"leaf\": 0.0300908741 }\n ]},\n { \"nodeid\": 4, \"depth\": 2, \"split\": \"mean_texture\", \"split_condition\": 18.7449989, \"yes\": 9, \"no\": 10, \"missing\": 9, \"children\": [\n { \"nodeid\": 9, \"leaf\": -0.0510330871 },\n { \"nodeid\": 10, \"leaf\": -0.172740772 }\n ]}\n ]},\n { \"nodeid\": 2, \"depth\": 1, \"split\": \"worst_texture\", \"split_condition\": 20, \"yes\": 5, \"no\": 6, \"missing\": 5, \"children\": [\n { \"nodeid\": 5, \"depth\": 2, \"split\": \"mean_concave_points\", \"split_condition\": 0.0712649971, \"yes\": 11, \"no\": 12, \"missing\": 11, \"children\": [\n { \"nodeid\": 11, \"leaf\": 0.099997662 },\n { \"nodeid\": 12, \"leaf\": -0.142965034 }\n ]},\n { \"nodeid\": 6, \"depth\": 2, \"split\": \"mean_concave_points\", \"split_condition\": 0.0284200013, \"yes\": 13, \"no\": 14, \"missing\": 13, \"children\": [\n { \"nodeid\": 13, \"leaf\": -0.0510330871 },\n { \"nodeid\": 14, \"leaf\": -0.251898795 }\n ]}\n ]}\n ]},\n {...}\n]\n```\n\n### Model Predictions\n\nOnce the JSON file has been created, you need to perform three more things in order to make predictions:\n\n1. Store the base score value used in training your XGBoost model\n2. Note whether your problem is a classification or regression problem.\n - Right now, this package has only been tested for the `reg:linear` and `binary:logistic` objectives which represent regression and classification respectively.\n - When building a classification model, you **must** use the default base_score value of 0.5 (which ends up not adding an intercept bias to the results). If you use any other value, the production model will produce predictions that **do not match** the predictions from the original model.\n3. Load the JSON model file into a python list of dictionaries representing each model tree.\n\nOnce you have done that, you can create your production estimator:\n\n```python\nwith open('xgb_model.json', 'r') as f:\n model_data = json.load(f)\n\npred_type = 'regression'\nbase_score = 0.5 # default value\n\n\nestimator = ProdEstimator(model_data=model_data,\n pred_type=pred_type,\n base_score=base_score)\n```\n\nAfter that, all you have to do is format your input data into python dicts and pass them in individually to the estimator's `predict` function.\n\nIf you want more detailed info for validation purposes, there is a `get_leaf_values` function that can return either the leaf values or final nodes selected for each tree in the model for a given input.\n\n## Performance\n\nAs mentioned at the top, the initial test results indicate that for regression problems, there is a 100% match in predictions between original and production versions of the models up to a 0.001 error tolerance.\n\nAs for speed, for 10 trees, the prediction for input with 30 features is 0.00005 seconds with 0.000008 seconds standard deviation.\n\nObviously as the number of trees grows, the speed should decrease linearly, but it should be simple to modify this to add in parallelized tree predictions if that becomes an issue.\n\nIf you are really looking for optimized deployment tools, I would check out the following compiler for ensemble decision tree models: https://github.com/dmlc/treelite\n\n## Initial Test Results\n\nHere is the printed output for initial testing in the `example.py` file if you want to see it without running it yourself:\n\n```\nBenchmark regression modeling\n=============================\n\nActual vs Prod Estimator Comparison\n188 out of 188 predictions match\nMean difference between predictions: -1.270028868389972e-08\nStd dev of difference between predictions: 9.327025899795536e-09\n\nActual Estimator Evaluation Metrics\nAUROC Score 0.9780560216858138\nAccuracy Score 0.9521276595744681\nF1 Score 0.9649805447470817\n\nProd Estimator Evaluation Metrics:\nAUROC Score 0.9780560216858138\nAccuracy Score 0.9521276595744681\nF1 Score 0.9649805447470817\n\n\nTime Benchmarks for 1 records with 30 features using 10 trees\nAverage 3.938e-05 seconds with standard deviation 2.114e-06 per 1 predictions\n\n\nBenchmark classification modeling\n=================================\n\nActual vs Prod Estimator Comparison\n188 out of 188 predictions match\nMean difference between predictions: -1.7196643532927356e-08\nStd dev of difference between predictions: 2.7826259523143417e-08\n\nActual Estimator Evaluation Metrics\nAUROC Score 0.9777333161223698\nAccuracy Score 0.9468085106382979\nF1 Score 0.9609375\n\nProd Estimator Evaluation Metrics:\nAUROC Score 0.9777333161223698\nAccuracy Score 0.9468085106382979\nF1 Score 0.9609375\n\n\nTime Benchmarks for 1 records with 30 features using 10 trees\nAverage 3.812e-05 seconds with standard deviation 1.381e-06 per 1 predictions\n```\n\n## License\n\nLicensed under an MIT license.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/mwburke/xgboost-python-deploy", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "xgboost-deploy", "package_url": "https://pypi.org/project/xgboost-deploy/", "platform": "", "project_url": "https://pypi.org/project/xgboost-deploy/", "project_urls": { "Homepage": "https://github.com/mwburke/xgboost-python-deploy" }, "release_url": "https://pypi.org/project/xgboost-deploy/0.0.2/", "requires_dist": null, "requires_python": "", "summary": "Deploy XGBoost models in pure python", "version": "0.0.2" }, "last_serial": 4851201, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "743f3d23ca000b96ba4099d6e7bee1b2", "sha256": "0c90362512450440d8747d40d39f8c3d322350402a577deda0a5a3518acb8552" }, "downloads": -1, "filename": "xgboost_deploy-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "743f3d23ca000b96ba4099d6e7bee1b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9095, "upload_time": "2019-02-01T00:27:23", "url": "https://files.pythonhosted.org/packages/2c/f3/acbc2d79f7a5542e12aa4ddc551aa14968b75d660126581c7c89bea72728/xgboost_deploy-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9564c0688ab31239648ef350626e27cf", "sha256": "2b0702c042b72cb848c193034cb10028b8fd1cdd2988fe7c646a4cf3f0815849" }, "downloads": -1, "filename": "xgboost-deploy-0.0.1.tar.gz", "has_sig": false, "md5_digest": "9564c0688ab31239648ef350626e27cf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9150, "upload_time": "2019-02-01T00:27:27", "url": "https://files.pythonhosted.org/packages/d7/cc/7489aabf4b9ac1e424341ba63b0bac3dd7991bb8e55eb3ac435b09ebe5ba/xgboost-deploy-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "9f6108eebda87f23114c4753a10ae684", "sha256": "9e004877158ac3bf8b0b77cf083ae744008882166e699a7b03da7bc36dfc28a1" }, "downloads": -1, "filename": "xgboost_deploy-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "9f6108eebda87f23114c4753a10ae684", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9472, "upload_time": "2019-02-21T18:00:49", "url": "https://files.pythonhosted.org/packages/ea/85/eb5adead854d783b7aa90b4628fe100f4eaee45bc242afa0b2e3c3d6dc7a/xgboost_deploy-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c37e61bfd2381f1c08feec0168f02743", "sha256": "d76da0b88397e5b4e9c94045ce5c6ae9345c9e1b425a6521355b47835f995450" }, "downloads": -1, "filename": "xgboost-deploy-0.0.2.tar.gz", "has_sig": false, "md5_digest": "c37e61bfd2381f1c08feec0168f02743", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9103, "upload_time": "2019-02-21T18:00:52", "url": "https://files.pythonhosted.org/packages/bb/9a/bd15bbf4be53e6ce4cb0373dfdfc20f0b8c4ae9e95f32a59b25012279967/xgboost-deploy-0.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9f6108eebda87f23114c4753a10ae684", "sha256": "9e004877158ac3bf8b0b77cf083ae744008882166e699a7b03da7bc36dfc28a1" }, "downloads": -1, "filename": "xgboost_deploy-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "9f6108eebda87f23114c4753a10ae684", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9472, "upload_time": "2019-02-21T18:00:49", "url": "https://files.pythonhosted.org/packages/ea/85/eb5adead854d783b7aa90b4628fe100f4eaee45bc242afa0b2e3c3d6dc7a/xgboost_deploy-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c37e61bfd2381f1c08feec0168f02743", "sha256": "d76da0b88397e5b4e9c94045ce5c6ae9345c9e1b425a6521355b47835f995450" }, "downloads": -1, "filename": "xgboost-deploy-0.0.2.tar.gz", "has_sig": false, "md5_digest": "c37e61bfd2381f1c08feec0168f02743", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9103, "upload_time": "2019-02-21T18:00:52", "url": "https://files.pythonhosted.org/packages/bb/9a/bd15bbf4be53e6ce4cb0373dfdfc20f0b8c4ae9e95f32a59b25012279967/xgboost-deploy-0.0.2.tar.gz" } ] }