{ "info": { "author": "Daniel McDonald", "author_email": "mcddjx@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "[](https://travis-ci.org/interrogator/buzz)\n[](https://codecov.io/gh/interrogator/buzz)\n[](https://buzz.readthedocs.io/en/latest/)\n[](https://badge.fury.io/py/buzz)\n[](https://github.com/python/black)\n\n# buzz: python corpus linguistics\n\n\n> Version 3.1.0\n\n> *buzz* is a linguistics tool for parsing and then exploring plain or metadata-rich text. This README provides an overview of functionality. Visit the [full documentation](https://buzz.readthedocs.io/en/latest/) for a more complete user guide.\n\n## Install\n\n\n```bash\npip install buzz[word]\n# or\ngit clone http://github.com/interrogator/buzz\ncd buzz\npython setup.py install\n```\n\n## Frontend: *buzzword*\n\n*buzz* has an optional frontend, *buzzword*, for exploring parsed corpora. To use it, install:\n\n```bash\npip install buzz[word]\n```\n\nThen, generate a workspace, `cd` into it, and start:\n\n```bash\npython -m buzzword.create workspace\ncd workspace\npython -m buzzword\n```\n\nMore complete documentation is available [here](https://buzz.readthedocs.io/en/latest/buzzword/), as well from the main page of the app itself.\n\nA URL will be printed, which can be used to access the app in your browser.\n\n## Creating corpora\n\n*buzz* models plain text, or [CONLL-U formatted](https://universaldependencies.org/format.html) files. The remainder of this guide will assume that you are have plain text data, and want to process and analyse it on the command line using *buzz*\n\nFirst, you need to make sure that your corpus is in a format and structure that *buzz* can work with. This simply means putting all your text files into a folder, and optionally within subfolders (representing subcorpora).\n\nText files should be plain text, with a `.txt` extension. Importantly though, they can be augmented with metadata, which can be stored in two ways. First, speaker names can be added by using capital letters and a colon, much like in a script. Second, you can use XML style metadata markup. Here is an example file, `sopranos/s1/e01.txt`:\n\n```html\n\nMELFI: My understanding from Dr. Cusamano, your family physician, is you collapsed? Possibly a panic attack? \nTONY: They said it was a panic attack \nMELFI: You don't agree that you had a panic attack? \n...\n```\n\nIf you add a `meta` element at the start of the text file, it will be understood as file-level metadata. For sentence-specific metadata, the element should follow the sentence, ideally at the end of a line. Span- and token-level metadata should wrap the tokens you want to annotate. All metadata will be searchable later, so the more you can add, the more you can do with your corpus.\n\nTo load corpora as *buzz* objects:\n\n```python\nfrom buzz import Corpus\n\ncorpus = Corpus(\"sopranos\")\n```\n\nYou can also make virtual corpora from strings, optionally saving the corpus to disk.\n\n```python\ncorpus = Corpus.from_string(\"Some sentences here.\", save_as=\"corpusname\")\n```\n\n## Parsing\n\nbuzz uses [`spaCy`](https://spacy.io/) to parse your text, saving the results as CONLL-U files to your hard drive. Parsing a corpus is very simple:\n\n```python\nparsed = corpus.parse()\n# if you don't need constituency parses, you can speed things up a lot with:\nparsed = corpus.parse(cons_parser=None)\n```\n\nYou can also parse text strings, optionally passing in a name under which to save the corpus:\n\n```python\nfrom buzz import Parser\nparser = Parser(cons_parser=\"benepar\")\nfor text in list_of_texts:\n dataset = parser.run(text, save_as=False)\n```\n\nThe main advantages of parsing with *buzz* are that:\n\n* Parse results are stored as valid CONLL-U 2.0\n* Metadata is respected, and transferred into the output files\n* You can do constituency and dependency parsing at the same time (with parse trees being stored as CONLL-U metadata)\n\nthe `parse()` method returns another `Corpus` object, representing the newly created files. We can explore this corpus via commands like:\n\n```python\nparsed.subcorpora.s1.files.e01\nparsed.files[0]\nparsed.subcorpora.s1[:5]\nparsed.subcorpora[\"s1\"]\n```\n\n### Parse command\n\nYou can also parse corpora without entering a Python session by using the `parse` command:\n\n```bash\nparse --language en --cons-parser=benepar|bllip|none path/to/conll/files\n# or \npython -m buzz.parse path/to/conll/files\n```\n\nBoth commands will create `path/to/conll/files-parsed`, a folder containing CONLL-U files.\n\n### Loading corpora into memory\n\nYou can use the `load()` method to load a whole or partial corpus into memory, as a Dataset object, which extends the [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n\n```python\nloaded = parsed.load()\n```\n\nYou don't need to load corpora into memory to work on them, but it's great for small corpora. As a rule of thumb, datasets under a million words should be easily loadable on a personal computer.\n\nThe loaded corpus is a `Dataset` object, which is based on the pandas DataFrame. So, you can use pandas methods on it:\n\n\n```python\nloaded.head()\n```\n\n
| \n | \n | \n | w | \nl | \nx | \np | \ng | \nf | \ne | \naired | \nemph | \nent_id | \nent_iob | \nent_type | \nexposition | \ninterrogative_type | \nmove | \nquestion | \nsent_id | \nsent_len | \nspeaker | \ntext | \n_n | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| file | \ns | \ni | \n\n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n | \n |
| text | \n1 | \n1 | \nMy | \n-PRON- | \nDET | \nPRP$ | \n2 | \nposs | \n_ | \n10.01.1999 | \n_ | \n2 | \nO | \n_ | \nTrue | \nintonation | \ninfo-request | \n_ | \n1 | \n14 | \nMELFI | \nMy understanding from Dr. Cusamano, your family physician, is you collapsed? | \n0 | \n
| 2 | \nunderstanding | \nunderstanding | \nNOUN | \nNN | \n13 | \nnsubjpass | \n_ | \n10.01.1999 | \n_ | \n2 | \nO | \n_ | \nTrue | \nintonation | \ninfo-request | \n_ | \n1 | \n14 | \nMELFI | \nMy understanding from Dr. Cusamano, your family physician, is you collapsed? | \n1 | \n||
| 3 | \nfrom | \nfrom | \nADP | \nIN | \n2 | \nprep | \n_ | \n10.01.1999 | \n_ | \n2 | \nO | \n_ | \nTrue | \nintonation | \ninfo-request | \n_ | \n1 | \n14 | \nMELFI | \nMy understanding from Dr. Cusamano, your family physician, is you collapsed? | \n2 | \n||
| 4 | \nDr. | \nDr. | \nPROPN | \nNNP | \n5 | \ncompound | \n_ | \n10.01.1999 | \n_ | \n2 | \nO | \n_ | \nTrue | \nintonation | \ninfo-request | \n_ | \n1 | \n14 | \nMELFI | \nMy understanding from Dr. Cusamano, your family physician, is you collapsed? | \n3 | \n||
| 5 | \nCusamano | \nCusamano | \nPROPN | \nNNP | \n3 | \npobj | \n_ | \n10.01.1999 | \n_ | \n3 | \nB | \nPERSON | \nTrue | \nintonation | \ninfo-request | \n_ | \n1 | \n14 | \nMELFI | \nMy understanding from Dr. Cusamano, your family physician, is you collapsed? | \n4 | \n