{ "info": { "author": "Timothy Duffy", "author_email": "tim@timduffy.me", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Natural Language :: English", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3 :: Only" ], "description": "# iddt\nInternet Document Discovery Tool\n\n## What is it\n\nhttps://github.com/thequbit/iddt\n\nThere are three parts of `iddt`\n\n- Worker\n- Dispatcher\n- MongoDB\n\nThe worker is what does all of the hard lifting with the internet, and \nthe dispatcher keep everyone in line. You can have any many workers as\nyou're system will allow mongdb connections. MongoDB is used as the\ncentral cache to limit the amount of bandwidth needed to scrape target\nURLs.\n\n##How to use it\n\n###Requirements\n\n`iddt` uses MongoDB as a central cache while it is working. You'll need to\ninstall MongoDB to use `iddt`.\n\n- Ubuntu\n\n $ sudo apt-get install mongodb\n\n###Worker\n\nYou will probably want to run the worker ( or many workers ) as daemons.\nThis functionality is built into `iddt`. use the following code as a \nstarting point:\n\n import sys\n from iddt import Worker\n\n import logging\n\n logging.basicConfig(level=logging.INFO)\n logger = logging.getLogger(\"iddt.worker_test\")\n\n class MyWorker(Worker):\n\n def __init__(self, *args, **kwargs):\n super(MyWorker, self).__init__()\n logging.info(\"MyWorker __init__() complete.\")\n\n def new_doc(self, document):\n # do something with the document\n pass\n \n if __name__ == '__main__':\n pidfile_path = '/tmp/worker.pid'\n if len(sys.argv) == 3:\n pidfile_path = sys.argv[2]\n worker = MyWorker(pidfile=pidfile_path)\n worker.register_callback(worker.new_doc)\n if len(sys.argv) >= 2:\n #logger.info('{} {}'.format(sys.argv[0], sys.argv[1]))\n if 'start' == sys.argv[1]:\n worker.start()\n elif 'stop' == sys.argv[1]:\n worker.stop()\n elif 'restart' == sys.argv[1]:\n worker.restart()\n elif 'status' == sys.argv[1]:\n worker.status()\n else:\n print(\"Unknown command\")\n sys.exit(2)\n sys.exit(0)\n else:\n #logger.warning('show cmd deamon usage')\n print(\"Usage: {} start|stop|restart\".format(sys.argv[0]))\n sys.exit(2)\n\n\nThis will allow you to start, stop, and restart a worker daemon at the\ncommand prompt. If you are interested in using the worker NOT as a \ndaemon, you can execute the same functionality ( note this function\nis fully blocking ) by using the .run() function.\n\n from iddt import Worker\n \n def new_doc(document):\n # do something with the document\n pass\n \n worker = MyWorker()\n worker.register_callback(new_doc)\n worker.run()\n\nYou're on your own to gracefully exit the `run()` function. If you set\n`worker._running` to `False` it *should* gracefully exit after a short while.\n\n##Dispatcher\n\nThe dispatcher tells the workers what to work on. You use it something like\nthis:\n\n from iddt.dispatcher import Dispatcher\n\n d = Dispatcher()\n d.dispatch({\n 'target_url': 'http://example.com/',\n 'link_level': 1,\n 'allowed_domains': [],\n })\n\n # this is how you query the results based on mime type\n some_docs = dispatcher.get_documents(['application/pdf'])\n\n # this is how you get ALL of the documents\n all_docs = dispatcher.get_documents()\n\nNote that the `dispatcher.dispatch()` function requires a dict with the \nfollowing fields:\n\n- `target_url`\n - This is the URL that the Workers (scrapers) should be working on\n- `link_level`\n - This is the number of links to follow. Be careful with numbers above 3\n- `allowed_domains`\n - The `iddt` Worker won't follow links away from the TLD of the \n `target_url`. If you would like it to, you can supply the list of\n allowed domains here.\n\n## Caution\n\nThis is a really powerful tool. Please be curtious with it.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "UNKNOWN", "keywords": "documents\nscraper\ndiscovery", "license": "GPLv3+", "maintainer": null, "maintainer_email": null, "name": "iddt", "package_url": "https://pypi.org/project/iddt/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/iddt/", "project_urls": { "Download": "UNKNOWN", "Homepage": "UNKNOWN" }, "release_url": "https://pypi.org/project/iddt/0.1.13/", "requires_dist": null, "requires_python": null, "summary": "Internet Document Discovery Tool", "version": "0.1.13" }, "last_serial": 1882328, "releases": { "0.1.0": [], "0.1.11": [ { "comment_text": "", "digests": { "md5": "128a8450cb6ac5aa79e181ed092318a4", "sha256": "e9723149093fed37befad527e792db6eab35e84cefb0fe0970308e565f35aa1c" }, "downloads": -1, "filename": "iddt-0.1.11.tar.gz", "has_sig": false, "md5_digest": "128a8450cb6ac5aa79e181ed092318a4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22787, "upload_time": "2015-09-07T02:21:22", "url": "https://files.pythonhosted.org/packages/01/58/32190564d7fbd9a630c99b53f398f9222e21f260f5551216cf5055f2d7eb/iddt-0.1.11.tar.gz" } ], "0.1.12": [ { "comment_text": "", "digests": { "md5": "465500c1b0f6f0918135a18e1c494a0d", "sha256": "a90bfdc4a767bc942625612cb44e5400f974accaec66f32578ee4b5b3b96c2ba" }, "downloads": -1, "filename": "iddt-0.1.12.tar.gz", "has_sig": false, "md5_digest": "465500c1b0f6f0918135a18e1c494a0d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22108, "upload_time": "2015-09-07T23:32:38", "url": "https://files.pythonhosted.org/packages/99/b7/532803bdf63e207a2e26cf8e29204daccf21e8c5852d2df3ce5abc8b2300/iddt-0.1.12.tar.gz" } ], "0.1.13": [ { "comment_text": "", "digests": { "md5": "7fd222ee24387843f31e3e90ee718dc0", "sha256": "b3f9f9268d5342b26088f644952c59c7a352af7979a444cd4fa0afa135ca8e05" }, "downloads": -1, "filename": "iddt-0.1.13.tar.gz", "has_sig": false, "md5_digest": "7fd222ee24387843f31e3e90ee718dc0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22515, "upload_time": "2015-12-30T13:24:42", "url": "https://files.pythonhosted.org/packages/3e/e2/a4eb84dac9918d1fc9772dc0edc2421017015ffc1d2b33f3cf57cb4c828f/iddt-0.1.13.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "24dae684fd37fc3780aa8abe812cff79", "sha256": "e11a6acdec4f74e9314501012f4eb498481a6b66ca7db3598ca8a93fbabbc120" }, "downloads": -1, "filename": "iddt-0.1.2.tar.gz", "has_sig": false, "md5_digest": "24dae684fd37fc3780aa8abe812cff79", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20710, "upload_time": "2015-09-03T18:13:38", "url": "https://files.pythonhosted.org/packages/7e/50/d8b1c1b6c10ffe87481aa012d8d6e905079112850da1e1557e9f9330b54a/iddt-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "4f4c20107f344713a208f16967fcb78f", "sha256": "84de0026b85df2caadf5ab350315293cad3dcc577ce45235bdfcdf54a1db04bb" }, "downloads": -1, "filename": "iddt-0.1.3.tar.gz", "has_sig": false, "md5_digest": "4f4c20107f344713a208f16967fcb78f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21943, "upload_time": "2015-09-03T18:36:48", "url": "https://files.pythonhosted.org/packages/00/66/2dab4130cc125abe4f96e32d9cbb985508415803c26e990144073ec1e7c4/iddt-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "076ce17f81d462f15bdd49b0749d1888", "sha256": "55843468490f07218465f059e8c8b210e4706ec2d98b8b7be567483cfb3ed74f" }, "downloads": -1, "filename": "iddt-0.1.4.tar.gz", "has_sig": false, "md5_digest": "076ce17f81d462f15bdd49b0749d1888", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22421, "upload_time": "2015-09-04T14:13:46", "url": "https://files.pythonhosted.org/packages/77/b1/b7512439058780cc96335168d31f6df16860847415792d776f5c4c282e8f/iddt-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "ed4efbed50591ce431e65b59da6a0995", "sha256": "9bfb50317df7deb6758a62870f17bb6001c30d17f5ac8fc379111f80a0bfd463" }, "downloads": -1, "filename": "iddt-0.1.5.tar.gz", "has_sig": false, "md5_digest": "ed4efbed50591ce431e65b59da6a0995", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22570, "upload_time": "2015-09-04T15:25:02", "url": "https://files.pythonhosted.org/packages/39/33/2d452186108e586d3eaa8e62c57a4d62c3823fed385bcb465e0691b07c2d/iddt-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "8bd223476e9e9430d27f1f7cfafc42e3", "sha256": "8eb0334df89b9a645bd01d639ae071efeb52b9a5887bb00dfc6556caba7ff357" }, "downloads": -1, "filename": "iddt-0.1.6.tar.gz", "has_sig": false, "md5_digest": "8bd223476e9e9430d27f1f7cfafc42e3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22584, "upload_time": "2015-09-04T18:31:20", "url": "https://files.pythonhosted.org/packages/3d/d1/70c0495380b4b90fe7a667a9c26c0b0c71429cb607fde488bb5e0e8003ad/iddt-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "da791b51d87592d1556d39a0043190cf", "sha256": "fe4a5b158b5b857a168f6eb3fea5a5e823837d408e699e2c9f1cdefec09c103c" }, "downloads": -1, "filename": "iddt-0.1.7.tar.gz", "has_sig": false, "md5_digest": "da791b51d87592d1556d39a0043190cf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22714, "upload_time": "2015-09-06T22:47:37", "url": "https://files.pythonhosted.org/packages/a9/fd/a867abfeade685cff7b865b40a585617886167bfb320eb8be4459bb790e7/iddt-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "7754e69794f3087fb8577b28ec8da6a9", "sha256": "829e1944715d58bf5f44db6d59631df8cebd87af46c6199ff38a299ce25d8d5c" }, "downloads": -1, "filename": "iddt-0.1.8.tar.gz", "has_sig": false, "md5_digest": "7754e69794f3087fb8577b28ec8da6a9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22742, "upload_time": "2015-09-07T02:07:29", "url": "https://files.pythonhosted.org/packages/40/6d/87b61d35f22a396aad35e3af62d7087cf8c6bacb0f72ae04168848eff2ec/iddt-0.1.8.tar.gz" } ], "0.1.9": [ { "comment_text": "", "digests": { "md5": "08900b003110ef9c8627f3636f78d4ad", "sha256": "c679685bba9aeca8528180091401f9234dd3623e4a34d156b79170aaf9590991" }, "downloads": -1, "filename": "iddt-0.1.9.tar.gz", "has_sig": false, "md5_digest": "08900b003110ef9c8627f3636f78d4ad", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22763, "upload_time": "2015-09-07T02:14:51", "url": "https://files.pythonhosted.org/packages/51/b3/60d117b5808229cce68366b4810674eeb6c429ff17d008ca74f063730506/iddt-0.1.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7fd222ee24387843f31e3e90ee718dc0", "sha256": "b3f9f9268d5342b26088f644952c59c7a352af7979a444cd4fa0afa135ca8e05" }, "downloads": -1, "filename": "iddt-0.1.13.tar.gz", "has_sig": false, "md5_digest": "7fd222ee24387843f31e3e90ee718dc0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22515, "upload_time": "2015-12-30T13:24:42", "url": "https://files.pythonhosted.org/packages/3e/e2/a4eb84dac9918d1fc9772dc0edc2421017015ffc1d2b33f3cf57cb4c828f/iddt-0.1.13.tar.gz" } ] }