{ "info": { "author": "Shane Stephenson / metriczulu", "author_email": "stephenson.shane.a@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Word Vector Embedding Package\n\n**wordvecpy** is a library for processing text data, tokenizing it, and building word vector dictionaries and whole word vector embeddings from the corpus text.\n\n**TextProcessor** takes a corpus of unprocessed text and processes it for use with word vectors. Punctuation, stopwords, substitutions, contractions, and lemmatization can all be customized. TextProcessor can store a processed (or unprocessed) text corpus (with or without it's associated labels) in a specially designed text format that can interact with other classes in the library. This gives extra functionality when dealing with large text corpii which take up a lot of local memory that is needed for other processes, like building the vector embedded forms of the corpus. The **LoadCorpus** class allows pulling only selected chunks of the dataset into memory for processing at one time, so large files can be processed in small chunks that fit in your computer's memory. **Chunkify** works with this and with other datasets completely stored in memory to split a corpus into chunks and generate vector embeddings one chunk at a time (vector embeddings for multiple docs can quickly take up all of a computer's memory if you don't manage the memory well).\n\n**VectorDictionary** loads pretrained word embeddings from .txt files so they can be used with other classes. Every class in this package that requires a vector dictionary can take a pymagnitude vector or a VectorDictionary object.\n\n**Vectokenizer** and **FastVectokenizer** both convert processed text corpus into integer embeddings and create vector dictionaries for those associated integer embeddings. Both classes do the exact same thing but FastVectokenizer requires Keras to create integer embeddings and Vectokenizer does not. Both classes allow you to easily and quickly generate necessary integer embeddings and word vector dictionaries to import directly into a Keras embedding layer. I've considered removing **Vectokenizer** but I have future plans for it's functionality that cannot be easily done using keras to tokenize the corpus.\n\n**VectorEmbedder** is where the magic happens and a text corpus (or chunk of a text corpus) is embedded with word vectors. This can be done with both static vector dictionaries like Stanford's various GLoVe embeddings or with dynamic embeddings that are context dependent like Google's ELMO embeddings.\n\n## Current Version is 1.0\n\nThe current version 1.0 is significantly different from the previous version. Multiple classes from the previous version were removed and redesigned or recombined with new classes in a more organic manner. I've done significant testing on this version to make sure everything works properly but it's obviously very time consuming to test every possible combination of interactions. If you come across a use case where wordvecpy fails, *please please please* let me know.\n\n## Installation\n\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install wordvecpy.\n\n```bash\npip install wordvecpy\n```\n\n## Usage\n\nSome use cases can be found in the various python notebooks in my exploration of the Rotten Tomatoes dataset found here: https://github.com/metriczulu/rotten-tomato-reviews\n\n## Future Plans\n\nCurrently working on functionality to reduce the size of integer embeddings by clustering words based on their vector representations. This will hopefully allow smaller vector dictionaries to be used while maintaining good functionality.\n\nAlso working on functionality to build multi-channel (or concatenated, your choice) embeddings by combining multiple embedding frameworks together.\n\nCurrently, it is possible to use word vector embeddings with PyTorch by converting all documents to embeddings with **VectorEmbedder**. There is a better, more organic way to work with PyTorch than this and I plan on implementing it once I'm satisfied with the overall functionality of the package. Currently, only Keras (and TF with it) are directly supported as it was the easiest library to get the package functioning and out the door quickest.\n\n**Vectokenizer** currently has a memory intensive ad-hoc design (which is why **FastVectokenizer** is currently much better). I will be reworking it so that it is in-line with **FastVectokenizer**'s speed while providing the additional capability of clustering words by vector to reduce dictionary size mentioned above.\n\nI also need to get the comments and documentation up to par to make it easier for others to get it working.\n\n## License\n\nNo license but I'm just throwing this out there: if this library fails miserably and turns your computer into a black hole which swallows your life's work, I'd like to say first and foremost, I'm sorry. However, use at your own risk.\n\n# wordvecpy", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/metriczulu/wordvecpy/archive/v1.1.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/metriczulu/wordvecpy", "keywords": "", "license": "All yours bro", "maintainer": "", "maintainer_email": "", "name": "wordvecpy", "package_url": "https://pypi.org/project/wordvecpy/", "platform": "", "project_url": "https://pypi.org/project/wordvecpy/", "project_urls": { "Download": "https://github.com/metriczulu/wordvecpy/archive/v1.1.tar.gz", "Homepage": "https://github.com/metriczulu/wordvecpy" }, "release_url": "https://pypi.org/project/wordvecpy/1.1/", "requires_dist": null, "requires_python": "", "summary": "Package for working with word vector embeddings", "version": "1.1" }, "last_serial": 5423555, "releases": { "0.71": [ { "comment_text": "", "digests": { "md5": "377d71c89cf7b0b092332478a0fc8e12", "sha256": "47ed0cdfc64c534ed5de1f9fbab53ef81d334858be821157cfa7feb621f6db18" }, "downloads": -1, "filename": "wordvecpy-0.71.tar.gz", "has_sig": false, "md5_digest": "377d71c89cf7b0b092332478a0fc8e12", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11845, "upload_time": "2019-06-14T01:36:34", "url": "https://files.pythonhosted.org/packages/42/78/f00bfe99f62fff9fb2dc4c4a73869a7fbaa9f008e8a04301b18a4630745d/wordvecpy-0.71.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "546ec1907642de6c05012907ba964e2a", "sha256": "9c0c1a3631526dac533ebdc14d40a36ef6d3fd20a4ae77a8d50050ccc7c492aa" }, "downloads": -1, "filename": "wordvecpy-1.1.tar.gz", "has_sig": false, "md5_digest": "546ec1907642de6c05012907ba964e2a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16360, "upload_time": "2019-06-20T02:45:14", "url": "https://files.pythonhosted.org/packages/07/68/36582ef2c56347b613c3ab77b9526de8014f2e329de4e1f3c7cc18067912/wordvecpy-1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "546ec1907642de6c05012907ba964e2a", "sha256": "9c0c1a3631526dac533ebdc14d40a36ef6d3fd20a4ae77a8d50050ccc7c492aa" }, "downloads": -1, "filename": "wordvecpy-1.1.tar.gz", "has_sig": false, "md5_digest": "546ec1907642de6c05012907ba964e2a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16360, "upload_time": "2019-06-20T02:45:14", "url": "https://files.pythonhosted.org/packages/07/68/36582ef2c56347b613c3ab77b9526de8014f2e329de4e1f3c7cc18067912/wordvecpy-1.1.tar.gz" } ] }