{ "info": { "author": "sally14", "author_email": "sally14.doe@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "\n# Embeddings\n\n\nEmbedding generation with text preprocessing.\n\n\n## Preprocessor\n\nFor Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.\n\nPreprocesses the text/set of text in the following way :\n1. Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'\n\n2. Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'\n\n3. Lowers words\n\n4. Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.\n\n5. Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.\n\n Outputs a vocabulary file and the modified files.\n\nUsage example :\n\n\n```python\nfrom preprocessing.preprocessor import PreprocessorConfig, Preprocessor\nconfig = PreprocessorConfig('/tmp/logdir')\nconfig.set_config(writing_dir='/tmp/outputs')\nconfig.save_config()\n\n\nprep = Preprocessor('/tmp/logdir')\nprep.fit('~/mydata/')\nprep.filter()\nprep.transform('~/mydata')\n```\n\n\n ## Word2Vec\n\nFor the Word2Vec, we just wrote a simple cli wrapper that takes the\npreprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)\n\n\nUsage example:\n\n```bash\npython training_word2vec.py file_dir writing_dir\n```\n\n## Documentation \n\nDocumentation is available here : https://sally14.github.io/embeddings/\n\nTODO :\n- [ ] Clean code for CLI wrapper\n\n- [X] Also write a python Word2Vec model class so that user doesn't have to switch from python to cli\n\n- [ ] Also write a cli wrapper for preprocessing\n\n- [X] Memory leak in preprocessor.transform\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/sally14/embeddings", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "embeddings-prep", "package_url": "https://pypi.org/project/embeddings-prep/", "platform": "", "project_url": "https://pypi.org/project/embeddings-prep/", "project_urls": { "Homepage": "https://github.com/sally14/embeddings" }, "release_url": "https://pypi.org/project/embeddings-prep/0.1.0/", "requires_dist": [ "docopt (>=0.6.2)", "gensim (>=3.8.1)", "nltk (>=3.4.5)" ], "requires_python": ">=3.6", "summary": "A word2vec preprocessing and training package", "version": "0.1.0", "yanked": false, "yanked_reason": null }, "last_serial": 6029824, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "5ef26b5874b126769d3e3bfb5541297d", "sha256": "bf51e2f3ebcfe7b6039fd251f0d59df3beb097ad6e8267188f502cf8d6039c39" }, "downloads": -1, "filename": "embeddings_prep-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "5ef26b5874b126769d3e3bfb5541297d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 2120, "upload_time": "2019-10-25T14:07:08", "upload_time_iso_8601": "2019-10-25T14:07:08.872715Z", "url": "https://files.pythonhosted.org/packages/f7/a6/7b3af2603f78a6d240a996cb7b62b59a327ca3f3a4903750e0c50c46a2ae/embeddings_prep-0.1.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "3a351554c1e56e307fa31a7e32a80eb6", "sha256": "b6e9c1cca853dd8564817e26947f78779947827922bbe4cf41dd358c4a5ebc36" }, "downloads": -1, "filename": "embeddings-prep-0.1.0.tar.gz", "has_sig": false, "md5_digest": "3a351554c1e56e307fa31a7e32a80eb6", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1673, "upload_time": "2019-10-25T14:07:11", "upload_time_iso_8601": "2019-10-25T14:07:11.074859Z", "url": "https://files.pythonhosted.org/packages/6a/76/117d950bae3c6b0a371d0d6b1498c860e69e3112bae0b008090e0033ae9b/embeddings-prep-0.1.0.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5ef26b5874b126769d3e3bfb5541297d", "sha256": "bf51e2f3ebcfe7b6039fd251f0d59df3beb097ad6e8267188f502cf8d6039c39" }, "downloads": -1, "filename": "embeddings_prep-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "5ef26b5874b126769d3e3bfb5541297d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 2120, "upload_time": "2019-10-25T14:07:08", "upload_time_iso_8601": "2019-10-25T14:07:08.872715Z", "url": "https://files.pythonhosted.org/packages/f7/a6/7b3af2603f78a6d240a996cb7b62b59a327ca3f3a4903750e0c50c46a2ae/embeddings_prep-0.1.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "3a351554c1e56e307fa31a7e32a80eb6", "sha256": "b6e9c1cca853dd8564817e26947f78779947827922bbe4cf41dd358c4a5ebc36" }, "downloads": -1, "filename": "embeddings-prep-0.1.0.tar.gz", "has_sig": false, "md5_digest": "3a351554c1e56e307fa31a7e32a80eb6", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1673, "upload_time": "2019-10-25T14:07:11", "upload_time_iso_8601": "2019-10-25T14:07:11.074859Z", "url": "https://files.pythonhosted.org/packages/6a/76/117d950bae3c6b0a371d0d6b1498c860e69e3112bae0b008090e0033ae9b/embeddings-prep-0.1.0.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }