{
    "info": {
        "author": "Richard Smith",
        "author_email": "randkego@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 2 - Pre-Alpha",
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "# Topican - topic analyzer\n\nfrom  the command line:\n```python3\ntopican_by_nouns_on_csv  \n```\nIdentify topics by assuming topics can be identified from Nouns and a \"context\" word:  \n- [spaCy](https://spacy.io/) is used to identify Nouns (including Proper nouns) in the text  \n- nltk WordNet and spaCy are used to group similar nouns together (WordNet \"hyponyms\" are checked first; spaCy similarity is used if a hyponym is not found)  \n- the top context words are then found for each noun  \n- Output is a list of noun groups and associated context words, in order of frequency  \n- The output also indicates the nouns that were grouped together\n\nFor example, the text \"I like python\", \"I love Python\", and \"I like C\" would be analysed as having 2 topic groups \"_python\" and \"_C\":\n```python3\n    '_python', 2: [('like', 1), ('love', 1),]    {('python', 2), }\n    '_C', 1: [('like', 1), ]    {('C', 1), }\n```\n\n## Meta\nRichard Smith \u2013 randkego@gmail.com\n\nDistributed under the MIT license. See ``LICENSE`` for more information.\n\n[https://github.com/randkego/topican](https://github.com/randkego/topican)\n\n## Installation\n\nPre-requisites (Linux and Windows):\n\n```sh\npip3 install topican\n\n# Install spaCy's large English language model\n# ** Warning: this requires approx 1GB of disk space\npython3 -m spacy download en_core_web_lg\n\n```\n\nNotes: Additional pre-requisites for Windows:  \n- ```install spacy``` will fail if Microsoft Visual C++ is not already installed \n([https://visualstudio.microsoft.com/visual-cpp-build-tools/](https://visualstudio.microsoft.com/visual-cpp-build-tools/) may help in this case)  \n- ```spaCy download en_core_web_lg``` may be unable to create a symbolic link. This can be manually created if required\n\n\n## Usage\nfrom the command line:\n```python3\nusage: topican_by_nouns_on_csv [-h]\n                               filepath text_col exclude_words\n                               top_n_noun_groups top_n_words max_hyponyms\n                               max_hyponym_depth sim_threshold\n\npositional arguments:\n  filepath           path of CSV file\n  text_col           name of text column in CSV file\n  exclude_words      words to exclude: list of words | True to just ignore\n                     NLTK stop-words | False | None\n  top_n_noun_groups  number of noun groups to find (0 to find all\n                     noun/'synonym' groups)\n  top_n_words        number of associated words to print for each noun group\n                     (0 to print all words)\n  max_hyponyms       maximum number of hyponyms a word may have before it is\n                     ignored - use this to exclude very general words that may\n                     not convey useful information (0 to have no limit on the\n                     number of hyponyms a word may have)\n  max_hyponym_depth  level of hyponym to extract (0 to extract all hyponyms)\n  sim_threshold      spaCy similarity level that words must reach to qualify\n                     as being similar\n\noptional arguments:\n  -h, --help         show this help message and exit\n```\n\nas a function:\n```python3\ntopican.print_words_associated_with_common_noun_groups(\n    nlp, name, free_text_Series, exclude_words, top_n_noun_groups, top_n_words, max_hyponyms, max_hyponym_depth, sim_threshold)\n```\n- nlp: spaCy nlp object - this must be initialised with a language model that includes the word vectors\n- name: descriptive name for free_text_Series\n- free_text_Series: pandas Series of text in which to find the noun groups and associated words\n- exclude_words: to ignore certain words, e.g. not so useful 'stop words' or artificial words.  \n  This should take one of the following values:  \n  <nbsp>- True: to ignore NTLK stop-words and their capitalizations  \n  <nbsp>- A list of words to exclude  \n  <nbsp>- False or None otherwise\n- top_n_noun_groups: number of noun groups to find (specify 'None' to find all noun/'synonym' groups)\n- top_n_words: number of words that are associated with each noun group (specify 'None' for all words)\n- max_hyponyms: the maximum number of hyponyms a word may have before it is ignored (this is used to\n  exclude very general words that may not convey useful information: specify 'None' for no restriction)\n- max_hyponym_depth: the level of hyponym to extract (specify 'None' to find all levels)\n- sim_threshold: the spaCy similarity level that words must reach to qualify as being a similar word\n\n\n## Usage examples\nfrom the command line:\n```python3\ntopican_by_nouns_on_csv test.csv text_col None 10 0 100 1 0.7\n```\n\nfunction:\n```python3\n# Some text to test\nimport pandas as pd\ntest_df = pd.DataFrame({'Text_col' : [\"I love Python\", \"I really love python\", \"I like python.\", \"python\", \"I like C but I prefer Python\", \"I don't like C any more\", \"I don't like python\", \"I really don't like C\"]})\n\n# Download NLTK stop-words if you want them in exclude_words\nimport nltk\nnltk.download('stopwords')\n\n# Load spaCy's large English language model (the large model is required to be able to use similarity)\n# ** Warning: this requires approx 1.8GB of RAM\nimport spacy\nnlp = spacy.load('en_core_web_lg')\n\nimport topican\ntopican.print_words_associated_with_common_noun_groups(nlp, \"test\", test_df['Text_col'], False, 10, None, 100, 1, 0.7)\n```\n![alt text](images/readme_usage_output.png \"topican usage example\")\n\n## Release History\n\n* 0.0.17\n    * First release to GitHub\n* 0.0.18\n    * Updates to README.md to note Windows install pre-requisites and the need to download wordnet\n* 0.0.19\n    * Add script topican_by_nouns_on_csv to apply print_words_associated_with_common_noun_groups to a text column of a CSV file\n    * function get_top_word_groups_by_synset_then_similarity: allow max_hyponyms and n_word_groups to be None to indicate no restriction on them\n    * function print_words_associated_with_common_noun_groups: do not list words that will be excluded\n* 0.0.20\n    * Update setup.py to add a topican_by_nouns_on_csv as an entry_point to console_scripts to be able to call that scipt directly\n* 0.0.21\n    * Update setup.py to add the packages required for installation\n* 0.0.22\n    * topican_by_nouns_on_csv.py: fix main signature and add param to parser.parse_args so that topican_by_nouns_on_csv can be called from the command line; remove nargs='+' type for exclude_words\n* 0.0.23\n    * topican_by_nouns_on_csv.py: if exclude_words is True, nltk.download('stopwords')\n* 0.0.24\n    * README.md: in the usage example for the function, download 'stopwords' not 'wordnet'\n\n## Contributing\n\n1. Fork it (<https://github.com/randkego/topican/fork>)\n2. Create your feature branch (`git checkout -b feature/fooBar`)\n3. Commit your changes (`git commit -am 'Add some fooBar'`)\n4. Push to the branch (`git push origin feature/fooBar`)\n5. Create a new Pull Request\n\n<!-- Markdown link & img dfn's -->\n[wiki]: https://github.com/randkego/topican/wiki\n\n\n",
        "description_content_type": "",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/randkego/topican",
        "keywords": "",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "topican",
        "package_url": "https://pypi.org/project/topican/",
        "platform": "",
        "project_url": "https://pypi.org/project/topican/",
        "project_urls": {
            "Homepage": "https://github.com/randkego/topican"
        },
        "release_url": "https://pypi.org/project/topican/0.0.24/",
        "requires_dist": [
            "pandas",
            "nltk",
            "spacy"
        ],
        "requires_python": "",
        "summary": "Topic analyser",
        "version": "0.0.24"
    },
    "last_serial": 4257099,
    "releases": {
        "0.0.24": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "cecbbba66398569ab715caaee91d8e0c",
                    "sha256": "dcba442e7535d8458d2dda633be55b6b2f250bee07ca71cafa806622c5301d42"
                },
                "downloads": -1,
                "filename": "topican-0.0.24-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "cecbbba66398569ab715caaee91d8e0c",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 89967,
                "upload_time": "2018-09-10T10:02:28",
                "url": "https://files.pythonhosted.org/packages/9e/94/360fd74e6ea53e0782d9913019a341faa8e20c4408e60ad9c25ed3494c90/topican-0.0.24-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "762dfe6f6b845d703f39422dd8560937",
                    "sha256": "e90ee4acef4ae2c44ab54df2cec30e86bd5c33dfd5f0e492dec569707a0cbc07"
                },
                "downloads": -1,
                "filename": "topican-0.0.24.tar.gz",
                "has_sig": false,
                "md5_digest": "762dfe6f6b845d703f39422dd8560937",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 87301,
                "upload_time": "2018-09-10T10:02:30",
                "url": "https://files.pythonhosted.org/packages/11/80/cd1c7e65c69ed429cdccff1715959ed8f6c62206f076195f09170a1d31f3/topican-0.0.24.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "cecbbba66398569ab715caaee91d8e0c",
                "sha256": "dcba442e7535d8458d2dda633be55b6b2f250bee07ca71cafa806622c5301d42"
            },
            "downloads": -1,
            "filename": "topican-0.0.24-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cecbbba66398569ab715caaee91d8e0c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 89967,
            "upload_time": "2018-09-10T10:02:28",
            "url": "https://files.pythonhosted.org/packages/9e/94/360fd74e6ea53e0782d9913019a341faa8e20c4408e60ad9c25ed3494c90/topican-0.0.24-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "762dfe6f6b845d703f39422dd8560937",
                "sha256": "e90ee4acef4ae2c44ab54df2cec30e86bd5c33dfd5f0e492dec569707a0cbc07"
            },
            "downloads": -1,
            "filename": "topican-0.0.24.tar.gz",
            "has_sig": false,
            "md5_digest": "762dfe6f6b845d703f39422dd8560937",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 87301,
            "upload_time": "2018-09-10T10:02:30",
            "url": "https://files.pythonhosted.org/packages/11/80/cd1c7e65c69ed429cdccff1715959ed8f6c62206f076195f09170a1d31f3/topican-0.0.24.tar.gz"
        }
    ]
}