{ "info": { "author": "Richard Smith", "author_email": "randkego@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Topican - topic analyzer\n\nfrom the command line:\n```python3\ntopican_by_nouns_on_csv \n```\nIdentify topics by assuming topics can be identified from Nouns and a \"context\" word: \n- [spaCy](https://spacy.io/) is used to identify Nouns (including Proper nouns) in the text \n- nltk WordNet and spaCy are used to group similar nouns together (WordNet \"hyponyms\" are checked first; spaCy similarity is used if a hyponym is not found) \n- the top context words are then found for each noun \n- Output is a list of noun groups and associated context words, in order of frequency \n- The output also indicates the nouns that were grouped together\n\nFor example, the text \"I like python\", \"I love Python\", and \"I like C\" would be analysed as having 2 topic groups \"_python\" and \"_C\":\n```python3\n '_python', 2: [('like', 1), ('love', 1),] {('python', 2), }\n '_C', 1: [('like', 1), ] {('C', 1), }\n```\n\n## Meta\nRichard Smith \u2013 randkego@gmail.com\n\nDistributed under the MIT license. See ``LICENSE`` for more information.\n\n[https://github.com/randkego/topican](https://github.com/randkego/topican)\n\n## Installation\n\nPre-requisites (Linux and Windows):\n\n```sh\npip3 install topican\n\n# Install spaCy's large English language model\n# ** Warning: this requires approx 1GB of disk space\npython3 -m spacy download en_core_web_lg\n\n```\n\nNotes: Additional pre-requisites for Windows: \n- ```install spacy``` will fail if Microsoft Visual C++ is not already installed \n([https://visualstudio.microsoft.com/visual-cpp-build-tools/](https://visualstudio.microsoft.com/visual-cpp-build-tools/) may help in this case) \n- ```spaCy download en_core_web_lg``` may be unable to create a symbolic link. This can be manually created if required\n\n\n## Usage\nfrom the command line:\n```python3\nusage: topican_by_nouns_on_csv [-h]\n filepath text_col exclude_words\n top_n_noun_groups top_n_words max_hyponyms\n max_hyponym_depth sim_threshold\n\npositional arguments:\n filepath path of CSV file\n text_col name of text column in CSV file\n exclude_words words to exclude: list of words | True to just ignore\n NLTK stop-words | False | None\n top_n_noun_groups number of noun groups to find (0 to find all\n noun/'synonym' groups)\n top_n_words number of associated words to print for each noun group\n (0 to print all words)\n max_hyponyms maximum number of hyponyms a word may have before it is\n ignored - use this to exclude very general words that may\n not convey useful information (0 to have no limit on the\n number of hyponyms a word may have)\n max_hyponym_depth level of hyponym to extract (0 to extract all hyponyms)\n sim_threshold spaCy similarity level that words must reach to qualify\n as being similar\n\noptional arguments:\n -h, --help show this help message and exit\n```\n\nas a function:\n```python3\ntopican.print_words_associated_with_common_noun_groups(\n nlp, name, free_text_Series, exclude_words, top_n_noun_groups, top_n_words, max_hyponyms, max_hyponym_depth, sim_threshold)\n```\n- nlp: spaCy nlp object - this must be initialised with a language model that includes the word vectors\n- name: descriptive name for free_text_Series\n- free_text_Series: pandas Series of text in which to find the noun groups and associated words\n- exclude_words: to ignore certain words, e.g. not so useful 'stop words' or artificial words. \n This should take one of the following values: \n - True: to ignore NTLK stop-words and their capitalizations \n - A list of words to exclude \n - False or None otherwise\n- top_n_noun_groups: number of noun groups to find (specify 'None' to find all noun/'synonym' groups)\n- top_n_words: number of words that are associated with each noun group (specify 'None' for all words)\n- max_hyponyms: the maximum number of hyponyms a word may have before it is ignored (this is used to\n exclude very general words that may not convey useful information: specify 'None' for no restriction)\n- max_hyponym_depth: the level of hyponym to extract (specify 'None' to find all levels)\n- sim_threshold: the spaCy similarity level that words must reach to qualify as being a similar word\n\n\n## Usage examples\nfrom the command line:\n```python3\ntopican_by_nouns_on_csv test.csv text_col None 10 0 100 1 0.7\n```\n\nfunction:\n```python3\n# Some text to test\nimport pandas as pd\ntest_df = pd.DataFrame({'Text_col' : [\"I love Python\", \"I really love python\", \"I like python.\", \"python\", \"I like C but I prefer Python\", \"I don't like C any more\", \"I don't like python\", \"I really don't like C\"]})\n\n# Download NLTK stop-words if you want them in exclude_words\nimport nltk\nnltk.download('stopwords')\n\n# Load spaCy's large English language model (the large model is required to be able to use similarity)\n# ** Warning: this requires approx 1.8GB of RAM\nimport spacy\nnlp = spacy.load('en_core_web_lg')\n\nimport topican\ntopican.print_words_associated_with_common_noun_groups(nlp, \"test\", test_df['Text_col'], False, 10, None, 100, 1, 0.7)\n```\n![alt text](images/readme_usage_output.png \"topican usage example\")\n\n## Release History\n\n* 0.0.17\n * First release to GitHub\n* 0.0.18\n * Updates to README.md to note Windows install pre-requisites and the need to download wordnet\n* 0.0.19\n * Add script topican_by_nouns_on_csv to apply print_words_associated_with_common_noun_groups to a text column of a CSV file\n * function get_top_word_groups_by_synset_then_similarity: allow max_hyponyms and n_word_groups to be None to indicate no restriction on them\n * function print_words_associated_with_common_noun_groups: do not list words that will be excluded\n* 0.0.20\n * Update setup.py to add a topican_by_nouns_on_csv as an entry_point to console_scripts to be able to call that scipt directly\n* 0.0.21\n * Update setup.py to add the packages required for installation\n* 0.0.22\n * topican_by_nouns_on_csv.py: fix main signature and add param to parser.parse_args so that topican_by_nouns_on_csv can be called from the command line; remove nargs='+' type for exclude_words\n* 0.0.23\n * topican_by_nouns_on_csv.py: if exclude_words is True, nltk.download('stopwords')\n* 0.0.24\n * README.md: in the usage example for the function, download 'stopwords' not 'wordnet'\n\n## Contributing\n\n1. Fork it ()\n2. Create your feature branch (`git checkout -b feature/fooBar`)\n3. Commit your changes (`git commit -am 'Add some fooBar'`)\n4. Push to the branch (`git push origin feature/fooBar`)\n5. Create a new Pull Request\n\n\n[wiki]: https://github.com/randkego/topican/wiki\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/randkego/topican", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "topican", "package_url": "https://pypi.org/project/topican/", "platform": "", "project_url": "https://pypi.org/project/topican/", "project_urls": { "Homepage": "https://github.com/randkego/topican" }, "release_url": "https://pypi.org/project/topican/0.0.24/", "requires_dist": [ "pandas", "nltk", "spacy" ], "requires_python": "", "summary": "Topic analyser", "version": "0.0.24" }, "last_serial": 4257099, "releases": { "0.0.24": [ { "comment_text": "", "digests": { "md5": "cecbbba66398569ab715caaee91d8e0c", "sha256": "dcba442e7535d8458d2dda633be55b6b2f250bee07ca71cafa806622c5301d42" }, "downloads": -1, "filename": "topican-0.0.24-py3-none-any.whl", "has_sig": false, "md5_digest": "cecbbba66398569ab715caaee91d8e0c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 89967, "upload_time": "2018-09-10T10:02:28", "url": "https://files.pythonhosted.org/packages/9e/94/360fd74e6ea53e0782d9913019a341faa8e20c4408e60ad9c25ed3494c90/topican-0.0.24-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "762dfe6f6b845d703f39422dd8560937", "sha256": "e90ee4acef4ae2c44ab54df2cec30e86bd5c33dfd5f0e492dec569707a0cbc07" }, "downloads": -1, "filename": "topican-0.0.24.tar.gz", "has_sig": false, "md5_digest": "762dfe6f6b845d703f39422dd8560937", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 87301, "upload_time": "2018-09-10T10:02:30", "url": "https://files.pythonhosted.org/packages/11/80/cd1c7e65c69ed429cdccff1715959ed8f6c62206f076195f09170a1d31f3/topican-0.0.24.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cecbbba66398569ab715caaee91d8e0c", "sha256": "dcba442e7535d8458d2dda633be55b6b2f250bee07ca71cafa806622c5301d42" }, "downloads": -1, "filename": "topican-0.0.24-py3-none-any.whl", "has_sig": false, "md5_digest": "cecbbba66398569ab715caaee91d8e0c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 89967, "upload_time": "2018-09-10T10:02:28", "url": "https://files.pythonhosted.org/packages/9e/94/360fd74e6ea53e0782d9913019a341faa8e20c4408e60ad9c25ed3494c90/topican-0.0.24-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "762dfe6f6b845d703f39422dd8560937", "sha256": "e90ee4acef4ae2c44ab54df2cec30e86bd5c33dfd5f0e492dec569707a0cbc07" }, "downloads": -1, "filename": "topican-0.0.24.tar.gz", "has_sig": false, "md5_digest": "762dfe6f6b845d703f39422dd8560937", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 87301, "upload_time": "2018-09-10T10:02:30", "url": "https://files.pythonhosted.org/packages/11/80/cd1c7e65c69ed429cdccff1715959ed8f6c62206f076195f09170a1d31f3/topican-0.0.24.tar.gz" } ] }