{ "info": { "author": "Konstantin Lopuhin", "author_email": "kostia.lopuhin@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Internet :: WWW/HTTP :: Indexing/Search" ], "description": "soft404: a classifier for detecting soft 404 pages\n==================================================\n\n.. image:: https://img.shields.io/pypi/v/soft404.svg\n :target: https://pypi.python.org/pypi/soft404\n :alt: PyPI Version\n\n.. image:: https://img.shields.io/travis/TeamHG-Memex/soft404/master.svg\n :target: http://travis-ci.org/TeamHG-Memex/soft404\n :alt: Build Status\n\n.. image:: http://codecov.io/github/TeamHG-Memex/soft404/coverage.svg?branch=master\n :target: http://codecov.io/github/TeamHG-Memex/soft404?branch=master\n :alt: Code Coverage\n\nA \"soft\" 404 page is a page that is served with 200 status,\nbut is really a page that says that content is not available.\n\n.. contents::\n\n\nInstallation\n------------\n\n::\n\n pip install soft404\n\n\nUsage\n-----\n\nThe easiest way is to use the ``soft404.probability`` function::\n\n >>> import soft404\n >>> soft404.probability('

Page not found

')\n 0.9736860086882132\n\nYou can also create a classifier explicitly::\n\n >>> from soft404 import Soft404Classifier\n >>> clf = Soft404Classifier()\n >>> clf.predict('

Page not found

')\n 0.9736860086882132\n\n\nDevelopment\n-----------\n\nClassifier is trained on 120k pages from 25k domains, with 404 page ratio of about 1/3.\nWith 10-fold cross-validation, PR AUC (average precision) is 0.990 \u00b1 0.003,\nand ROC AUC is 0.995 \u00b1 0.002.\n\n\nGetting data for training\n+++++++++++++++++++++++++\n\nInstall dev requirements::\n\n pip install -r requirements_dev.txt\n\nRun the crawler for a while (results will appear in ``pages.jl.gz`` file)::\n\n cd crawler\n scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job\n\n\nTraining\n++++++++\n\nFirst, extract text and structure from html::\n\n ./soft404/convert_to_text.py pages.jl.gz items\n\nThis will produce two files, ``items.meta.jl.gz`` and ``items.items.jl.gz``.\nNext, train the classifier::\n\n ./soft404/train.py items\n\nVectorizer takes a while to run, but it's result is cached (the filename\nwhere it is cached will be printed on the next run).\nIf you are happy with results, save the classifier::\n\n ./soft404/train.py items --save soft404/clf.joblib\n\n\nLicense\n-------\n\nLicense is MIT.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/TeamHG-Memex/soft404", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "soft404", "package_url": "https://pypi.org/project/soft404/", "platform": "", "project_url": "https://pypi.org/project/soft404/", "project_urls": { "Homepage": "https://github.com/TeamHG-Memex/soft404" }, "release_url": "https://pypi.org/project/soft404/0.2.1/", "requires_dist": null, "requires_python": "", "summary": "A classifier for detecting soft 404 pages", "version": "0.2.1" }, "last_serial": 4308943, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "191fc20b37ad3091136744f8e0e98d07", "sha256": "dcbe8e9ffc19bc2efde5f002dfba06427d15d7bdce8a998b3b0b77234c7b9087" }, "downloads": -1, "filename": "soft404-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "191fc20b37ad3091136744f8e0e98d07", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 74700, "upload_time": "2016-09-19T11:40:26", "url": "https://files.pythonhosted.org/packages/20/0f/663cbdc592a9edb9f47df67e6446e48af192d5af6620b0be93123799d062/soft404-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b2b0cb0b9cbbdd9784dfe627f8a256da", "sha256": "22b3aa24d8cca606f31ba5169060a79ff9e61e571b19260eaef0c4947228ae2a" }, "downloads": -1, "filename": "soft404-0.1.0.tar.gz", "has_sig": false, "md5_digest": "b2b0cb0b9cbbdd9784dfe627f8a256da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 73310, "upload_time": "2016-09-19T11:40:23", "url": "https://files.pythonhosted.org/packages/e2/82/8200ef3f98c2558ccf869f971abc07306dd0b73dab0c12995cf5c0a1a2f5/soft404-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "7d031e0628941a4026ffc7d7323a0fa7", "sha256": "35c23aafbd3d3d8863af8afb4695976625f3bc32126c8de65964f5db0f9c2971" }, "downloads": -1, "filename": "soft404-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "7d031e0628941a4026ffc7d7323a0fa7", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 75345, "upload_time": "2016-09-20T13:39:49", "url": "https://files.pythonhosted.org/packages/3f/5a/a90f34bbc6d08252a7a43c3161e8bb3f309f50b1dc2e414a9affaafd2107/soft404-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb958264cddb57bb7efd790e4ad0ea9d", "sha256": "dbc58f7cb405bb2f2a02e3598372462c9357c52fa6f73c5d238c1d95050193ea" }, "downloads": -1, "filename": "soft404-0.1.1.tar.gz", "has_sig": false, "md5_digest": "eb958264cddb57bb7efd790e4ad0ea9d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 74181, "upload_time": "2016-09-20T13:39:46", "url": "https://files.pythonhosted.org/packages/6c/a9/32e4519cc2d60b23f0e67cdc1fd8e4f5c68caeb7d706e914b6ee5324640c/soft404-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "9193592cfaf484245ba5fb590fbbb734", "sha256": "9cfa1f5e4ead27cc2dd7c6074ce3fe8f7be9aaf1965433ee119813870996b162" }, "downloads": -1, "filename": "soft404-0.1.2.tar.gz", "has_sig": false, "md5_digest": "9193592cfaf484245ba5fb590fbbb734", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 74450, "upload_time": "2016-12-26T12:44:14", "url": "https://files.pythonhosted.org/packages/dc/e2/359ef14ae180bcb51a647a0b30f8f6501faa4a3c8ab93095d782f4e23916/soft404-0.1.2.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "661ccb41bc16c586f67891ae4022bb02", "sha256": "d791894697c852b662725ef282cf10219623297f413b656a1ee178fa1d75c046" }, "downloads": -1, "filename": "soft404-0.2.0.tar.gz", "has_sig": false, "md5_digest": "661ccb41bc16c586f67891ae4022bb02", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30251, "upload_time": "2017-01-19T14:19:55", "url": "https://files.pythonhosted.org/packages/05/59/d62e728891fd3e56cfdb655e2cc6f6ddaae462e4c3c139b94ad93c00eea6/soft404-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "cdeef15dd9456a109f3354bce5376410", "sha256": "09d2cf8bd6264d542f3f8613e3d8693e229eca149b203b30822506e4e0805c4f" }, "downloads": -1, "filename": "soft404-0.2.1.tar.gz", "has_sig": false, "md5_digest": "cdeef15dd9456a109f3354bce5376410", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30332, "upload_time": "2017-05-22T20:51:01", "url": "https://files.pythonhosted.org/packages/de/16/1691f87f56a6a8ef0bbd164b389b1ab97a3fe82556921a1d619dcb5ba6a7/soft404-0.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cdeef15dd9456a109f3354bce5376410", "sha256": "09d2cf8bd6264d542f3f8613e3d8693e229eca149b203b30822506e4e0805c4f" }, "downloads": -1, "filename": "soft404-0.2.1.tar.gz", "has_sig": false, "md5_digest": "cdeef15dd9456a109f3354bce5376410", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30332, "upload_time": "2017-05-22T20:51:01", "url": "https://files.pythonhosted.org/packages/de/16/1691f87f56a6a8ef0bbd164b389b1ab97a3fe82556921a1d619dcb5ba6a7/soft404-0.2.1.tar.gz" } ] }