{ "info": { "author": "Dominik Mach\u00e1\u010dek", "author_email": "gldkslfmsd@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Science/Research", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Text Processing :: Linguistic" ], "description": "ConcordanceCrawler\n==================\n\n|Build Status|\n\nConcordanceCrawler is a tool for automatic concordance extraction from\nthe Internet. A concordance is a sentence containing some given word.\nTake an English word, and ConcordanceCrawler will be able to download\nyou hundreds thousands of sentences with your word.\n\nConcordanceCrawler is a console application whose main purpose is\ncrawling of English verbs or other words, but you can also use its API.\nIt allows you to adapt ConcordanceCrawler to any other language.\n\nTry web demo!\n-------------\n\nWill be launched soon.\n\nInstallation\n------------\n\nI recommend ConcordanceCrawler installation in Python virtual\nenvironment (see https://virtualenv.pypa.io/ ). All dependent libraries\nwill be installed automaticaly except one. The only nontrivial\ndependency is ``lxml``, you must install it globaly with root\npermissions. On the other hand it's needed for fast parsing. Install\n``lxml`` by following command (if you use Python 3) or follow\ninstructions on its `homepage `__:\n\n::\n\n sudo apt-get install python3-lxml\n\nIf you have working ``lxml`` library, you can create your new Python\nvirtual environment with access to the global ``lxml`` installation, and\nthen install ConcordanceCrawler there. For Python3:\n\n::\n\n virtualenv -p python3 p3 --system-site-packages\n source p3/bin/activate\n pip install ConcordanceCrawler\n\n(For Python2 just substitute 3 for 2.)\n\nIf you want to use also automatic conjugating of English verbs (or\ninflecting of nouns or comparing of adjectives), then install also\n``textblob`` library, but it's not necessary, you can use\nConcordanceCrawler pretty well without it. It's quite huge, because it\nuses ``nltk``. For installing it just type ``pip install textblob`` and\n``python -m textblob.download_corpora``.\n\nNow you can use simply ``ConcordanceCrawler -h`` to run it and see its\noptions.\n\nPython versions\n---------------\n\nConcordanceCrawler works with Python2 >= 2.7 and with Python3 >= 3.2.\n\nUnfortunately you cannot use ``textblob`` together with Python3.2, so\nyou cannot use automatic morphological analysis with this Python\nversion. However all other features work well.\n\nUsage\n-----\n\nYou can use ConcordanceCrawler like a command-line application just as a\nPython library. This is a part of ConcordanceCrawler's help message\nsummarizing all your options:\n\n::\n\n usage: ConcordanceCrawler [-h] [-n NUMBER_OF_CONCORDANCES] [-m MAX_PER_PAGE]\n [--disable-english-filter] [-o OUTPUT]\n [-b {RANDOM,WIKI_ARTICLES,WIKI_TITLES,NUMBERS}]\n [-f {json,xml}] [-v {0,1,2,3}] [-p PART_OF_SPEECH]\n [-e ENCODING] [--backup-off]\n [--backup-file BACKUP_FILE]\n [--extend-corpus EXTEND_CORPUS]\n [--continue-from-backup CONTINUE_FROM_BACKUP]\n [word [word ...]]\n\nBrand-new crawling job\n~~~~~~~~~~~~~~~~~~~~~~\n\nConcordanceCrawler allows you to create a brand-new crawling job or\ncontinue crawling with an old job which was unexpectedly interrupted\nbefore finishing.\n\nDuring creating a new job you may want to use following options:\n\n``word [word ...]`` is the target word in the centre of your interest.\nYou can specify more of its forms, then all sentences containing at\nleast one of this words will be crawled. The first word is considered as\ncanonical form and ConcordanceCrawler will use it for seeking links on a\nsearch engine.\n\n``-p REGEX``, ``--part-of-speech REGEX`` is target word's part-of-speech\ntag regex, they're values are adopted from Penn Treebank II tagset. See\nhttp://www.clips.ua.ac.be/pages/mbsp-tags for detailed description.\n\nDefault value for this option is ``.*``, it means any arbitrary\npart-of-speech. If you select other regex (e.g. ``V.*`` for verbs,\n``N.*`` for nouns, ``J.*`` for adjectives), then a ``textblob`` library\nwill be used. Size of this library is not neglible, because it uses\n``nltk``, therefore it's not an integral part of ConcordanceCrawler. You\nmust install it manually, if you wish. Instead of it you can omit this\noption, decline your target word manually and use all its forms as\nadditional values for ``word`` argument.\n\n*Example usage:* assume that target word is ``fly`` and given regex is\n``V.*``, it means we want to crawl only verbs. Then a word ``flies``\n(tagged ``VBS``) matches, as well as ``flew`` whose tag is ``VBD``. On\nthe other hand an insect ``fly`` with tag ``NN`` doesn't match, so\nsentences with this word will be ignored.\n\n``-n N``, ``--number-of-concordances N`` is a number of concordances\nthat you wish a program would crawl. By default it's 10.\n\n``-m M``, ``--max-per-page M`` is a maximum number of concordances that\nwill be crawled from one site. They are gotten from top to down. Default\nis to skip this option and then this number won't be limited.\n\n``--disable-english-filter`` option disables filtering of Non-English\nsentences from concordances. By default it's enabled. This option\naffects quality of resulting corpus.\n\n``-e ENCODING, --encoding ENCODING``. If not given, documents without\nrespect to their encoding will be crawled. If you select ASCII as\nencoding, all concordances containing any non-ASCII character will be\nignored. If you select any other charset, e.g. utf-8, all documents\nwithout this charset specification in http header or in html metatag\nwill be ignored (as well as documents with unequal charset values in\nhttp header and in metatag). This option has an impact on quality of\ncorpus and speed of crawling.\n\n``-b {RANDOM,WIKI_ARTICLES,WIKI_TITLES,NUMBERS}, --bazword-generator``\nis a way how ConcordanceCrawler generates or takes bazwords. It uses\nthem for increasing a number of links that can be found at Bing.com.\n\n- ``RANDOM`` -- bazword will be a random four-letter word. This option\n is default.\n\n- ``WIKI_ARTICLES`` -- from\n https://en.wikipedia.org/wiki/Special:Random will be downloaded\n random Wikipedia article. Its words will be used as bazwords, one\n word per one searching request.\n\n- ``WIKI_TITLES`` -- same as previous, but as bazwords will be chosen\n just words from titles.\n\n- ``NUMBERS`` -- bazwords will be 0, 1, 2, 3, ...\n\n``-o OUTPUT``, ``--output OUTPUT`` is a name of an output \"file\". It can\nbe e.g. ``concordances.xml``, ``concordances.json`` or any other \"file\"\nas ``/dev/null``. If a file exists, then it's overwritten, otherwise a\nnew one is created. Default ``OUTPUT`` is standard output.\n\n``-f {json, xml}, --format`` is an output format, default is json.\n\n``-v {0,1,2,3}, --verbosity`` is verbosity level, see rest of help\nmessage (``ConcordanceCrawler -h``) for more info.\n\n``--buffer-size BUFFER_SIZE`` -- setup maximal number of items in memory\nbuffers which are used to prevent repeated visit of the same url and\nrepeated crawling of the same concordance. Default value is 1000000.\nSelecting of too big number can lead to out of memory error (but this\nshould happen only after a very long time). Selecting of small number\ncan lead to repeated visit of same url or repeated crawling of the same\nconcordance, because the buffers are like queues, when they are full\nthey delete the old values to save the new ones.\n\n``--backup-file BACKUP_FILE`` -- name for backup file. This backup file\nallows you to continue aborted crawling job and extend corpus. Default\nname is ``ConcordanceCrawler.backup`` and is created in working\ndirectory.\n\n``--backup-off`` -- don't create a backup file\n\nRestarting of crawling job\n~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIf you want to restart a crawling job that was interrupted earlier, you\nmust start ConcordanceCrawler with this 2 options:\n\n``--extend-corpus EXTEND_CORPUS`` -- an output file created in previous\ncrawling job which will be extended now.\n\n``--continue-from-backup CONTINUE_FROM_BACKUP`` -- backup file\n\nExamples\n--------\n\n``ConcordanceCrawler hello -n 1 -v 3`` will crawl you one sentence\ncontaining word \"hello\".\n\n::\n\n [\n {\n \"start\": 0,\n \"keyword\": \"hello\",\n \"end\": 5,\n \"url\": \"http://grokbase.com/t/mysql/maxdb/0469p5zksn/error-in-complete-backup\",\n \"date\": \"2016-04-30 14:45:01.220790\",\n \"concordance\": \"Hello Marco,you can find the output and error output of TSM's adint2 in the files dbm.ebp and dbm.ebl.\",\n \"id\": 1\n }\n ]\n\nHere you can see some informations about progress of crawling. We\nordered one sentence with a verb *think*.\n\n::\n\n ConcordanceCrawler think -p 'V.*' -n 1\n\n::\n\n 2016-04-30 14:52:53,220 STATUS: ConcordanceCrawler version 1.0.0 started, press Ctrl+C for interrupt\n 2016-04-30 14:52:53,811 DETAILS: crawled SERP, parsed 50 links\n 2016-04-30 14:52:53,812 STATUS: Crawling status \n serp 1 (0 errors) \n links crawled 50 (0 filtered because of format suffix, 0 crawled repeatedly)\n pages visited 0 (0 filtered by encoding filter, 0 filtered by language filter, 0 errors)\n concordances 0 (0 crawled repeatedly)\n 2016-04-30 14:53:04,387 ERROR: 'HTTPConnectionPool(host='spknclothing.com', port=80): Read timed out. (read timeout=10)' occured during getting http://spknclothing.com/about/\n 2016-04-30 14:53:09,411 DETAILS: page http://spknclothing.com/ visited, 1 concordances found\n 2016-04-30 14:53:09,411 STATUS: Crawling status \n serp 1 (0 errors) \n links crawled 50 (0 filtered because of format suffix, 0 crawled repeatedly)\n pages visited 1 (0 filtered by encoding filter, 0 filtered by language filter, 1 errors)\n concordances 0 (0 crawled repeatedly)\n 2016-04-30 14:53:09,411 STATUS: Crawling status \n serp 1 (0 errors) \n links crawled 50 (0 filtered because of format suffix, 0 crawled repeatedly)\n pages visited 1 (0 filtered by encoding filter, 0 filtered by language filter, 1 errors)\n concordances 1 (0 crawled repeatedly)\n [\n {\n \"date\": \"2016-04-30 14:53:09.411287\",\n \"concordance\": \"not everyone thinks about BMX when they think of Maine but these guys have a great scene\u2026\",\n \"url\": \"http://spknclothing.com/\",\n \"id\": 1,\n \"keyword\": \"think\",\n \"end\": 19,\n \"start\": 13\n }\n ]\n\nWithout ``textblob`` installation we could get the same input by this\ncommand:\n\n::\n\n ConcordanceCrawler think thinks thinking thought -n 1\n\nCustomize ConcordanceCrawler!\n-----------------------------\n\nYou can also use ConcordanceCrawler as a library in your own project.\nThen you can for example adapt it for your own language, improve or\ncustomize its functions or do anything else.\n\nCheck this\n`documentation `__\nand\n`example `__.\n\nHow does ConcordanceCrawler work?\n---------------------------------\n\nConcordanceCrawler finds links on Bing.com search engine, visits them\nand finds there sentences containing the target word.\n\nThere's a little problem, you can find on Bing.com at most first 1000\nlinks for every keyword, and that's too few. Therefore\nConcordanceCrawler lets finding keywords as for example \"sdtn look\",\n\"naxe look\", \"jzmw look\" and similar combinations of bazword and target\nword. By this approach it gets sufficient number of different links to\ncrawl concordances.\n\nContact me!\n-----------\n\nI'll be pleased if you contact me. You can send me anything (except a\nspam :), a review, a request or idea for other feature, you can report\nan issue, fix a bug, and of course ask me a question about anything.\n\nYou can contact me via `GitHub `__ or\nemail: gldkslfmsd-at-gmail.com.\n\n.. |Build Status| image:: https://travis-ci.org/Gldkslfmsd/concordance-crawler.svg?branch=master\n :target: https://travis-ci.org/Gldkslfmsd/concordance-crawler\n", "description_content_type": null, "docs_url": null, "download_url": "http://pypi.python.org/pypi/ConcordanceCrawler", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Gldkslfmsd/concordance-crawler", "keywords": "concordance,crawling,corpus,Internet", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "ConcordanceCrawler", "package_url": "https://pypi.org/project/ConcordanceCrawler/", "platform": "any", "project_url": "https://pypi.org/project/ConcordanceCrawler/", "project_urls": { "Download": "http://pypi.python.org/pypi/ConcordanceCrawler", "Homepage": "https://github.com/Gldkslfmsd/concordance-crawler" }, "release_url": "https://pypi.org/project/ConcordanceCrawler/1.0.2/", "requires_dist": null, "requires_python": null, "summary": "A module for automatic concordance extraction from the Internet", "version": "1.0.2" }, "last_serial": 2119807, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "85fb382d6ff56d7f367db698be29f959", "sha256": "6fbe7f83e1f518e235cd3505c6b128e4ec74be85769843687da201cc7bfd9233" }, "downloads": -1, "filename": "ConcordanceCrawler-0.1.0.tar.gz", "has_sig": false, "md5_digest": "85fb382d6ff56d7f367db698be29f959", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9453, "upload_time": "2015-10-24T13:59:47", "url": "https://files.pythonhosted.org/packages/c8/d7/d89bdfb2bd978e6a65a3967b83ea4643f9cd9ad0929ccdb1525a7598d801/ConcordanceCrawler-0.1.0.tar.gz" } ], "0.1.21": [ { "comment_text": "", "digests": { "md5": "336c6028f78af1a975595d18723520ec", "sha256": "e19abe9936aaec547b618874f6990ed7440b9dc30371659c713920ae1072da56" }, "downloads": -1, "filename": "ConcordanceCrawler-0.1.21.tar.gz", "has_sig": false, "md5_digest": "336c6028f78af1a975595d18723520ec", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20052, "upload_time": "2015-11-24T17:50:02", "url": "https://files.pythonhosted.org/packages/3f/0d/6225d90c2466f2a524cb31eb58af962b71b055943d9ebcd25c5173e49642/ConcordanceCrawler-0.1.21.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "c8de0c4c6f6f987a2d98342c067326a5", "sha256": "43258aba88996106854349f2da8d40c02d05aef6bf9dd537f6a848aa35a51717" }, "downloads": -1, "filename": "ConcordanceCrawler-1.0.1.tar.gz", "has_sig": false, "md5_digest": "c8de0c4c6f6f987a2d98342c067326a5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1743014, "upload_time": "2016-05-13T16:31:16", "url": "https://files.pythonhosted.org/packages/fe/82/6b733a161390c47746f490fead6afc19e8b029e0bdd1a6f9c0b65a9042a0/ConcordanceCrawler-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "f345ae471728c0bc45699d8a7752df1c", "sha256": "0ed73b9d8344effdec6164af2bc0b27f5032cfcaed14f8db0942088955a3af17" }, "downloads": -1, "filename": "ConcordanceCrawler-1.0.2.tar.gz", "has_sig": false, "md5_digest": "f345ae471728c0bc45699d8a7752df1c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 947146, "upload_time": "2016-05-17T13:24:15", "url": "https://files.pythonhosted.org/packages/38/26/490782efe17deec2c3e8fe5b0bb9666db19d17cc799206a76a46a41ed150/ConcordanceCrawler-1.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f345ae471728c0bc45699d8a7752df1c", "sha256": "0ed73b9d8344effdec6164af2bc0b27f5032cfcaed14f8db0942088955a3af17" }, "downloads": -1, "filename": "ConcordanceCrawler-1.0.2.tar.gz", "has_sig": false, "md5_digest": "f345ae471728c0bc45699d8a7752df1c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 947146, "upload_time": "2016-05-17T13:24:15", "url": "https://files.pythonhosted.org/packages/38/26/490782efe17deec2c3e8fe5b0bb9666db19d17cc799206a76a46a41ed150/ConcordanceCrawler-1.0.2.tar.gz" } ] }