{ "info": { "author": "H4dr1en", "author_email": "h4dr1en@pm.me", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6" ], "description": "# learning-curves\n\nLearning-curves is Python module that extends [sklearn's learning curve feature](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html). It will help you visualizing the learning curve of your models.\n\nLearning curves give an opportunity to diagnose bias and variance in supervised learning models, but also to visualize how training set size influence the performance of the models (more informations [here](https://www.dataquest.io/blog/learning-curves-machine-learning/)).\n\nSuch plots help you answer the following questions:\n - Do I have enough data?\n - What would be the best accuracy I would have if I had more data?\n - Can I train my model with less data?\n - Is my training set biased?\n \n Learning-curves will also help you fitting the learning curve to extrapolate and find the saturation value of the curve.\n\n### Installation\n\n```\n$ pip install learning-curves\n```\n\nTo create learning curve plots, first import the module with `import learning_curves`.\n\n### Usage\n\nIt is as simple as:\n\n```\nlc = LearningCurve()\nlc.get_lc(estimator, X, Y)\n```\nWhere `estimator` implements `fit(X,Y)` and `predict(X,Y)`.\n\nOutput:\n\n![alt text](https://github.com/H4dr1en/learning-curves/blob/master/images/learning_curve_no_fit.png)\n\nOn this example the green curve suggests that adding more data to the training set is likely to improve a bit the model accuracy. The green curve also shows a saturation near 0.96. We can easily fit a function to this curve:\n\n```\nlc.plot(predictor=\"best\")\n```\nOutput:\n\n![alt text](https://github.com/H4dr1en/learning-curves/blob/master/images/learning_curve_simple.png)\n\nHere we used a predefined function, `pow`, to fit the green curve. The R2 score is very close to 1, meaning that the fit is optimal. We can therefore use this curve to extrapolate the evolution of the accuracy with the training set size.\n\nThis also tells us how many data we should use to train our model to maximize performances and accuracy.\n\n### Add custom functions to fit the learning curve\nSuch function are called `Predictor`. You can create a `Predictor` like this:\n```\npredictor = Predictor(\"myPredictor\", lambda x,a,b : a*x + b, [1,0])\n```\nHere we created a Predictor called \"myPredictor\" with the function `y(x) = a*x + b`.\nBecause internally SciPy `optimize.curve_fit` is called, a first guess of the parameters `a` and `b` are required. Here we gave them respective value 1 and 0.\nYou can then add the `Predictor` to the `LearningCurve` object in two different ways:\n- Pass the `Predictor` to the `LearningCurve` constructor:\n```\nlc = LearningCurve([predictor])\n```\n- Register the `Predictor` inside the predictors of the `LearningCurve` object:\n```\nlc.predictors.append(predictor)\n```\n\nBy default, 4 `Predictors` are instantiated: \n```\nself.predictors = [\n Predictor(\"pow\", lambda x, a, b, c, d : a - (b*x+d)**c, [1, 1.7, -.5, 1e-3]),\n Predictor(\"pow_log\", lambda x, a, b, c, m, n : a - b*x**c + m*np.log(x**n), [1, 1.7, -.5, 1e-3, 1e-3], True),\n Predictor(\"pow_log_2\", lambda x, a, b, c : a / (1 + (x/np.exp(b))**c), [1, 1.7, -.5]),\n Predictor(\"inv_log\", lambda x, a, b : a - b/np.log(x), [1, 1.6])\n]\n```\nSome predictors perform better (R2 score is closer to 1) than others, depending on the dataset, the model and the value to be preditected. \n\n### Find the best Predictor\n\nTo find the Predictor that will fit best your learning curve, we can call `get_predictor` function:\n```\nlc.get_predictor(\"best\")\n```\nOutput:\n```\n(pow [params:[ 0.9588563 11.74747659 -0.36232639 -236.46115903]][score:0.9997458683912492])\n```\n\n### Plot the Predictors\n\nYou can plot any `Predictor`s fitted function with the `plot` function:\n```\nlc.plot(predictor=\"all\")\n```\nOutput:\n\n![alt text](https://github.com/H4dr1en/learning-curves/blob/master/images/learning_curve_all.png)\n\n### Save and load LearningCurve instances\n\nBecause `Predictor` contains lambda functions, you can not simply save a `LearningCurve` instance. One possibility is to only save the data points of the curve inside `lc.recorder[\"data\"]` and retrieve then later on. But then the custom predictors are not saved. Therefore it is recommended to use the `save` and `load` methods:\n```\nlc.save(\"path/to/save.pkl\")\nlc = LearningCurve.load(\"path/to/save.pkl\")\n```\nThis internally uses the `dill` library to save the `LearningCurve` instance with all the `Predictor`s.\n\n### Find the best training set size\n\n`learning-curves` will help you finding the best training set size by extrapolation of the best fitted curve:\n```\nlc.plot(predictor=\"all\", saturation=\"best\")\n```\nOutput:\n\n![alt text](https://github.com/H4dr1en/learning-curves/blob/master/images/learning_curve_fit_sat_all.png)\n\nThe horizontal red line shows the saturation of the curve. The intersection of the two blue lines shows the best accuracy we can get, given a certain `threshold` (see below).\n\nTo retrieve the value of the best training set size:\n```\nlc.threshold(predictor=\"best\", saturation=\"best\")\n```\nOutput:\n```\n(0.9589, 31668, 0.9493)\n```\nThis tells us that the saturation value (the maximum accuracy we can get from this model without changing any other parameter) is `0.9589`. This value corresponds to an infinite number of samples in our training set! But with a threshold of `0.99` (this parameter can be changed with `threshold=x`), we can have an accuracy `0.9493` if our training set contains `31668` samples.\n\nNote: The saturation value is always the _second parameter_ of the function. Therefore, if you create your own `Predictor`, place the saturation factor in second position (called a in the predefined `Predictor`s). If the function of your custom `Predictor` is diverging, then no saturation value can be retrieven. In that case, pass `diverging=True` to the constructor of the `Predictor`. The saturation value will then be calculated considering the `max_scaling` parameter of the \n`threshold_cust` function (see documentation for details). You should set this parameter to the maximum number of sample you can add to your training set.\n\n## Documentation\n\nSome functions have their `function_name_cust` equivalent. Calling the function without the `_cust` suffix will internally call the function with the `_cust` suffix with default parameters (such as the data points of the learning curves). Thanks to `kwargs`, you can pass exactly the same parameters to both functions.\n\n| Function/Class | Parameters | Type | Default | Description |\n|------------------------|-------------------|--------------------------------------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Predictor.\\_\\_init\\_\\_ | | | | Instantiate a `Predictor` object. |\n| | name | str | _Required_ | Name of the `Predictor` |\n| | func | Lambda | _Required_ | Lambda function used for fitting of the learning curve |\n| | guess | List | _Required_ | Starting parameters used for fitting the curve |\n| | diverging | Bool | False | If the function is diverging, set diverging to True. If the function is converging, then the first parameter of the function has to be the convergence value. |\n| LC.\\_\\_init\\_\\_ | | | | Instantiate a `LearningCurve` object. |\n| | predictors | List | empty | Predictors to add to the `LearningCurve` object |\n| | scoring | Callable | r2_score | Scoring function used to evaluate the fits of the learning curve |\n| LC.get_lc | | | | Compute and plot the learning curve |\n| | estimator | Object | _Required_ | Model (any object implementing `fit(X,Y)` and `predict(X,Y)` methods) |\n| | X | array | _Required_ | X numpy array used for prediction |\n| | Y | array | _Required_ | Y numpy array used for prediction |\n| LC.train | | | | Compute the learning curve of an estimator over a dataset. Returns an object that can then be passed to plot_lc function |\n| | X | array | _Required_ | X numpy array used for prediction |\n| | Y | array | _Required_ | Y numpy array used for prediction |\n| | train_sizes | List | Predefined | List of training size used for calculating the learning curve. Can be a list of floats between 0 and 1 (assumed to be percentages) or a list of integers (assumed to be number of values) |\n| | test_size | int/float | 0.2 | percentage / value of the test set size |\n| | n_splits | int | 3 | Number of splits used for cross validation |\n| | verbose | int | 1 | The higher, the more verbose |\n| | n_jobs | int | -1 | Number of workers. -1 sets to maximum possible. See sklearn. |\n| LC.get_predictor | | | | Get the first predictor with matching {name}. Returns None if no predictor matches. |\n| | pred | str, List(str), Predictor, List(Predictor) | _Required_ | Name of the predictor(s). Can be \"all\" or \"best\" or even Predictor(s). |\n| LC.fit_all | | | | Fit a curve with all the Predictors and retrieve score if y_pred is finite. Returns an array of predictors with the updated params and score. |\n| | x | Array | _Required_ | 1D array (list) representing the training sizes |\n| | y | Array | _Required_ | 1D array (list) representing the scores |\n| LC.fit_all_cust | | | | Same as `fit_all` |\n| | x,y | Array | _Required_ | See `fit_all` |\n| | predictors | List(Predictors) | _Required_ | The predictors to use for the fitting. |\n| LC.fit | | | | Fit a curve with a predictor and retrieve score (default:R2) if y_pred is finite. Returns the predictor with the updated params and score. |\n| | predictor | Predictor | _Required_ | The predictor to use for fitting the learning curve |\n| | x | Array | _Required_ | 1D array (list) representing the training sizes |\n| | y | Array | _Required_ | 1D array (list) representing the scores |\n| LC.threshold | | | | Find the training set size providing the highest accuracy up to a predefined threshold. P(x) = y and for x -> inf, y -> saturation value. This method approximates x_thresh such as P(x_thresh) = threshold * saturation value. Returns (saturation value, x_thresh, y_thresh) |\n| | P | str, List(str), Predictor, list(Predictor | \"best\" | The predictor to use for the calculation of the saturation value. |\n| | kwargs | dict | Emtpy | See `LC.threshold_cust` for optional parameters. |\n| LC.threshold_cust | | | | See `threshold` |\n| | P | str, List(str), Predictor, list(Predictor | \"best\" | The predictor to use for the calculation of the saturation value. |\n| | x | array | _Required_ | X values (training set sizes) |\n| | threshold | float [0.0, 1.0] | 0.99 | Percentage of the saturation value to use for the calculus of the best training set size. |\n| | max_scaling | int | 3 | Order of magnitude added to the order of magnitude of the maximum train set size. If `Predictor` is diverging, the total order of magnitude is used for the calculation of the saturation value. Generally, a value of `3` is enough. A value bigger than `5` may lead to `MemoryException`. |\n| | force | Bool | False | Set to `True` not to raise a `ValueError` if `max_scaling` is > 5 |\n| LC.get_scale | | | | Returns the order of magnitude of the mean of an array |\n| | val | array | _Required_ | |\n| LC.best_predictor | | | | Returns the best predictor of the `LearningCurve` data for the test score learning curve |\n| | kwargs | dict | Empty | See `LC.best_predictor_cust` for optional parameters. |\n| LC.best_predictor_cust | | | | See `best_predictor` |\n| | predictors | List(Predictors) | _Required_ | `Predictor`s to consider |\n| | x | Array | _Required_ | 1D array (list) representing the training sizes |\n| | y | Array | _Required_ | 1D array (list) representing the scores |\n| LC.best_predictor_cust | | | | Find the best predictor for a custom learning curve |\n| | x | Array | _Required_ | 1D array (list) representing the training sizes |\n| | y | Array | _Required_ | 1D array (list) representing the scores |\n| | fit | Bool | True | Perform a fit of the `Predictor`s before classifying |\n| LC.plot | | | | Plot the training and test learning curve of the `LearningCurve` data, and optionally a fitted function |\n| | predictor | str, List(str), Predictor, List(Predictor) | None | `Predictor`s to use for plotting the fitted curve. Can also be \"all\" and \"best\". |\n| | kwargs | dict | None | See `LC.plot_cust` for optional parameters |\n| LC.plot_cust | | | | Plot any training and test learning curve, and optionally a fitted function. |\n| | train_sizes | array | _Required_ | Data points of the learning curve. The output of `LC.train` can be used as parameters of this function |\n| | train_scores_mean | array | _Required_ | See `train_sizes` parameter |\n| | train_scores_std | array | _Required_ | See `train_sizes` parameter |\n| | test_scores_mean | array | _Required_ | See `train_sizes` parameter |\n| | test_scores_std | array | _Required_ | See `train_sizes` parameter |\n| | predictor | array | _Required_ | See `LC.plot` |\n| | ylim | 2-uple | None | Limits of the y axis of the plot |\n| | figsize | 2-uple | None | Size of the figure |\n| | title | str | None | Title of the plot |\n| | saturation | str, List(str), Predictor, List(Predictor) | None | `Predictor`s to consider for displaying the saturation on the plot. |\n| | kwargs | dict | Empty | See `plot_saturation` for optional parameters |\n| LC.plot_fitted_curve | | | | Add to a matplotlib figure a fitted curve |\n| | ax | axe | _Required_ | Figure where the curve will be printed |\n| | P | Predictor | _Required_ | `Predictor` to use for the computing of the curve |\n| | x | array | _Required_ | 1D array (list) representing the training sizes |\n| | scores | Bool | True | Show the score of the `Predictor`s |\n| LC.save | | | | Save the `LearningCurve` object in disk using `dill` |\n| | path | Path/str | lc_data.pkl | Path to the file where the save will be done |\n| LC.load | | | | Load a `LearningCurve` object from disk. |\n| | path | Path/str | lc_data.pkl | Path to the file where the save is located |\n| LC.plot_saturation | | | | Add saturation lines to a plot. |\n| | ax | matplotlib ax | _Required_ | figure to use |\n| | P | Predictor | _Required_ | `Predictor` to consider |\n| | alpha | float | 1 | alpha applied to the lines |\n| | lw | float | 1.3 | matplotlib lw parameter applied to the lines. |\n| LC.get_unique_list | | | | Return a list of unique predictors. |\n| | predictors | List(Predictor) | _Required_ | List of `Predictor`s to consider. |", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/H4dr1en/learning-curves/archive/0.2.2.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/H4dr1en/learning-curves", "keywords": "Learning,curve,machine,learning,saturation,accuracy", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "learning-curves", "package_url": "https://pypi.org/project/learning-curves/", "platform": "", "project_url": "https://pypi.org/project/learning-curves/", "project_urls": { "Download": "https://github.com/H4dr1en/learning-curves/archive/0.2.2.tar.gz", "Homepage": "https://github.com/H4dr1en/learning-curves" }, "release_url": "https://pypi.org/project/learning-curves/0.2.2/", "requires_dist": null, "requires_python": "", "summary": "Python module allowing to easily calculate and plot the learning curve of a machine learning model and find the maximum expected accuracy", "version": "0.2.2" }, "last_serial": 5128237, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "cba8a3fe9ddaa3a64c0385e68eff83bd", "sha256": "3636211c75105f2268ab392538895d05a64e179d89014832395faa391a9bcf4e" }, "downloads": -1, "filename": "learning_curves-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "cba8a3fe9ddaa3a64c0385e68eff83bd", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 10292, "upload_time": "2019-04-11T07:07:16", "url": "https://files.pythonhosted.org/packages/85/e1/13d4952129b9c45d5e606d01110c99d54e9e794a815a9389c98f38c8cd9b/learning_curves-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e29aeff36728801f5bf370bad7b14290", "sha256": "f7e85ee1630b60f252f16faad81a2e96021ca44c254af5931ccbf08fbb2b879f" }, "downloads": -1, "filename": "learning-curves-0.1.0.tar.gz", "has_sig": false, "md5_digest": "e29aeff36728801f5bf370bad7b14290", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5643, "upload_time": "2019-04-09T12:57:38", "url": "https://files.pythonhosted.org/packages/33/73/85af2c542cc46431ec44e6a5957d0c10d804aa3f025887558a758e84740a/learning-curves-0.1.0.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "bc768eccec92e3aafe53696b0944a1f7", "sha256": "9c96fe7f6aa7fb04343285dfec46862d3ba79fd9f6412584d0c655589f087b9f" }, "downloads": -1, "filename": "learning-curves-0.2.2.tar.gz", "has_sig": false, "md5_digest": "bc768eccec92e3aafe53696b0944a1f7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19739, "upload_time": "2019-04-11T08:26:18", "url": "https://files.pythonhosted.org/packages/a7/9e/4121a07a9360f9bdb93971af9df8b078c0c03755e4a154822d89855b19a2/learning-curves-0.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "bc768eccec92e3aafe53696b0944a1f7", "sha256": "9c96fe7f6aa7fb04343285dfec46862d3ba79fd9f6412584d0c655589f087b9f" }, "downloads": -1, "filename": "learning-curves-0.2.2.tar.gz", "has_sig": false, "md5_digest": "bc768eccec92e3aafe53696b0944a1f7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19739, "upload_time": "2019-04-11T08:26:18", "url": "https://files.pythonhosted.org/packages/a7/9e/4121a07a9360f9bdb93971af9df8b078c0c03755e4a154822d89855b19a2/learning-curves-0.2.2.tar.gz" } ] }