PK!corpy/__init__.pyPK!:W#corpy/morphodita/README.rst================ corpy.morphodita ================ .. _overview: Overview ======== A slightly more (newbie) user friendly but also probably somewhat less efficient wrapper around the default Swig-generated Python bindings for the `MorphoDiTa `_ morphological tagging and lemmatization framework. The target audiences are: - beginner programmers interested in NLP - seasoned programmers who want to use MorphoDiTa through a more Pythonic interface, without having to dig into the `API reference `_ and the `examples `_, and who are not too worried about a possible performance hit as compared with full manual control Pre-trained tagging models which can be used with MorphoDiTa can be found `here `_. Currently, Czech and English models are available. **Please respect their CC BY-NC-SA 3.0 license!** At the moment, only a subset of the functionality offered by the MorphoDiTa API is available through ``corpy.morphodita`` (tokenization, tagging). Usage ===== If stuck, check out the docstrings of the modules and objects in the package for more details. Or directly the code, they're just straightforward wrappers, not rocket science :) Tokenization ------------ In addition to tokenization, the MorphoDiTa tokenizers perform sentence splitting at the same time. The easiest way to get started is to import one of the following pre-instantiated tokenizers from ``corpy.morphodita.tokenizer``: ``vertical``, ``czech``, ``english`` or ``generic``, and use it like so: .. code:: python >>> from corpy.morphodita.tokenizer import generic >>> for sentence in generic.tokenize("foo bar baz"): ... print(sentence) ... ['foo', 'bar', 'baz'] Tagging ------- **NOTE**: Unlike tokenization, tagging in MorphoDiTa requires you to supply your own pre-trained tagging models (see :ref:`overview` above). Initialize a new tagger: .. code:: python >>> from corpy.morphodita import Tagger >>> t = Tagger("path/to/czech-morfflex-pdt-160310.tagger") Sentence split, tokenize, tag and lemmatize a text represented as a string: .. code:: python >>> list(t.tag("Je zima. Bude sněžit.")) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> list(t.tag("Je zima. Bude sněžit.", sents=True)) [[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------')], [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')]] Tag and lemmatize an already sentence-split and tokenized piece of text, represented as an iterable of iterables of strings: .. code:: python >>> list(t.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']])) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] PK! corpy/morphodita/__init__.pyimport logging logging.basicConfig(level=logging.INFO) log = logging.getLogger(__package__) log.setLevel(logging.INFO) from .tagger import Tagger # noqa: E402, F401 PK!5ocorpy/morphodita/tagger.pyfrom . import log from collections import namedtuple from collections.abc import Iterable from lazy import lazy from functools import lru_cache import ufal.morphodita as ufal Token = namedtuple("Token", "word lemma tag") class Tagger: """A MorphoDiTa morphological tagger and lemmatizer associated with particular set of tagging models. """ _NO_TOKENIZER = ( "No tokenizer defined for tagger {!r}! Please provide " "pre-tokenized and sentence-split input." ) _TEXT_REQS = ( "Please provide a string or an iterable of iterables (not " "strings!) of strings as the ``text`` parameter." ) def __init__(self, tagger): """Create a ``Tagger`` object. :param tagger: Path to the pre-compiled tagging models. :type tagger: str """ self._tpath = tagger log.info("Loading tagger.") self._tagger = ufal.Tagger.load(tagger) if self._tagger is None: raise RuntimeError("Unable to load tagger from {!r}!".format(tagger)) self._morpho = self._tagger.getMorpho() self._forms = ufal.Forms() self._lemmas = ufal.TaggedLemmas() self._tokens = ufal.TokenRanges() self._tokenizer = self._tagger.newTokenizer() if self._tokenizer is None: log.warn(self._NO_TOKENIZER.format(tagger)) @lazy def _vtokenizer(self): return ufal.Tokenizer_newVerticalTokenizer() @lazy def _pdt_to_conll2009_converter(self): return ufal.TagsetConverter_newPdtToConll2009Converter() @lazy def _strip_lemma_comment_converter(self): return ufal.TagsetConverter_newStripLemmaCommentConverter(self._morpho) @lazy def _strip_lemma_id_converter(self): return ufal.TagsetConverter_newStripLemmaIdConverter(self._morpho) @lru_cache(maxsize=16) def _get_converter(self, convert): try: converter = ( getattr(self, "_" + convert + "_converter") if convert is not None else None ) except AttributeError as e: converters = [ a[1:-10] for a in dir(self) if "converter" in a and a != "_get_converter" ] raise ValueError( "Unknown converter {!r}. Available converters: " "{!r}.".format(convert, converters) ) from e return converter def tag(self, text, sents=False, guesser=False, convert=None): """Perform morphological tagging and lemmatization on text. If ``text`` is a string, sentence-split, tokenize and tag that string. If it's an iterable of iterables (typically a list of lists), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization. :param text: Input text. :type text: Either str (tokenization is left to the tagger) or iterable of iterables (of str), representing individual sentences. :param sents: Whether to signal sentence boundaries by outputting a sequence of lists (sentences). :type sents: bool :param guesser: Whether to use the morphological guesser provided with the tagger (if available). :type guesser: bool :param convert: Conversion strategy to apply to lemmas and / or tags before outputting them. :type convert: str, one of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required >>> list(t.tag("Je zima. Bude sněžit.")) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> list(t.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']])) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> list(t.tag("Je zima. Bude sněžit.", sents=True)) [[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------')], [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')]] """ if isinstance(text, str): yield from self.tag_untokenized(text, sents, guesser, convert) # The other accepted type of input is an iterable of iterables of # strings, but we only do a partial check whether the top-level object # is an Iterable, because it would have to be consumed in order to # inspect its first item. A second check which signals the frequent # mistake of passing an iterable of strings (which results in tagging # each character separately) occurs in ``Tagger.tag_tokenized()``. elif isinstance(text, Iterable): yield from self.tag_tokenized(text, sents, guesser, convert) else: raise TypeError(self._TEXT_REQS) def tag_untokenized(self, text, sents=False, guesser=False, convert=None): """This is the method ``Tagger.tag()`` delegates to when ``text`` is a string. See docstring for ``Tagger.tag()`` for details about parameters. """ converter = self._get_converter(convert) if self._tokenizer is None: raise RuntimeError(self._NO_TOKENIZER.format(self._tpath)) yield from self._tag(text, self._tokenizer, sents, guesser, converter) def tag_tokenized(self, text, sents=False, guesser=False, convert=None): """This is the method ``Tagger.tag()`` delegates to when ``text`` is an iterable of iterables of strings. See docstring for ``Tagger.tag()`` for details about parameters. """ converter = self._get_converter(convert) for sent in text: # refuse to process if sent is a string or not an iterable, because # that would result in tagging each character separately, which is # nonsensical if isinstance(sent, str) or not isinstance(sent, Iterable): raise TypeError(self._TEXT_REQS) yield from self._tag( "\n".join(sent), self._vtokenizer, sents, guesser, converter ) def _tag(self, text, tokenizer, sents, guesser, converter): tagger, forms, lemmas, tokens = ( self._tagger, self._forms, self._lemmas, self._tokens, ) tokenizer.setText(text) while tokenizer.nextSentence(forms, tokens): tagger.tag(forms, lemmas, guesser) s = [] for i in range(len(lemmas)): lemma = lemmas[i] t = tokens[i] word = text[t.start : t.start + t.length] # noqa: E203 if converter is not None: converter.convert(lemma) token = Token(word, lemma.lemma, lemma.tag) if sents: s.append(token) else: yield token if sents: yield s PK!&yjcorpy/morphodita/tokenizer.py"""An interface to MorphoDiTa tokenizers. In addition to tokenization, the MorphoDiTa tokenizers perform sentence splitting at the same time. The easiest way to get started is to import one of the following pre-instantiated tokenizers: ``vertical``, ``czech``, ``english`` or ``generic``, and use it like so: >>> from corpy.morphodita.tokenizer import generic >>> for sentence in generic.tokenize("foo bar baz"): ... print(sentence) ... ['foo', 'bar', 'baz'] If you want more flexibility, e.g. for tokenizing several in texts in parallel with the same type of tokenizer, then create your own objects (each tokenizer can only be tokenizing one text at a time!): >>> from corpy.morphodita.tokenizer import Tokenizer >>> my_tokenizer1 = Tokenizer("generic") >>> my_tokenizer2 = Tokenizer("generic") """ import ufal.morphodita as ufal class Tokenizer: """A wrapper API around the tokenizers offered by MorphoDiTa. Usage: >>> t = Tokenizer("generic") >>> for sentence in t.tokenize("foo bar baz"): ... print(sentence) ... ['foo', 'bar', 'baz'] Available tokenizers (specified by the first parameter to the ``Tokenizer()`` constructor): "vertical", "czech", "english", "generic". See the ``new*`` static methods on the MorphoDiTa ``tokenizer`` class described at https://ufal.mff.cuni.cz/morphodita/api-reference#tokenizer for details. """ def __init__(self, tokenizer_type): """Create a new tokenizer instance. :param tokenizer_type: Type of the requested tokenizer, depends on the tokenizer constructors made available on the ``tokenizer`` class in MorphoDiTa. Typically one of "vertical", "czech", "english" and "generic". :type tokenizer_type: str """ constructor = "new" + tokenizer_type.capitalize() + "Tokenizer" self._tokenizer = getattr(ufal.Tokenizer, constructor)() self._forms = ufal.Forms() self._tokens = ufal.TokenRanges() def tokenize(self, text): """Tokenize ``text``. :param text: Text to tokenize. :type text: str The method returns a generator of sentences as lists of strings. The underlying tokenizer object is shared by all such generators, which means this probably doesn't do what you want it to: >>> t = Tokenizer("generic") >>> toks1 = t.tokenize("Foo bar baz. Bar baz qux.") >>> toks2 = t.tokenize("A b c. D e f. G h i.") >>> for s1, s2 in zip(toks1, toks2): ... for t1, t2 in zip(s1, s2): ... print(t1, t2) Foo A bar b baz c . . D G e h f i . . What happens in the ``zip()`` call is that the underlying tokenizer's text is first set to ``"Foo bar baz. Bar baz qux."``, and the sentence ``['Foo', 'bar', 'baz', '.']`` is yielded by ``toks1``. Then it is set to ``"A b c. D e f."`` and ``['A', 'b', 'c', '.']`` is yielded by ``toks2``. These two values are zipped and bound to ``(s1, s2)`` in the first iteration of the outer for-loop. From this point on, the text doesn't change anymore (we're in the loop yielding individual sentences), so **toks1** (now using the same text as ``toks2``) yields ``['D', 'e', 'f', '.']`` and **toks2** the last ``['G', 'h', 'i', '.']``. These become ``(s1, s2)`` in the second and final iteration of the for-loop, because after this, both ``toks1`` and ``toks2`` (since they ended up with the same text) are exhausted. For the use case above, either create multiple tokenizers: >>> t1 = Tokenizer("generic") >>> t2 = Tokenizer("generic") >>> toks1 = t1.tokenize("Foo bar baz. Bar baz qux.") >>> toks2 = t2.tokenize("A b c. D e f. G h i.") >>> for s1, s2 in zip(toks1, toks2): ... for t1, t2 in zip(s1, s2): ... print(t1, t2) Foo A bar b baz c . . Bar D baz e qux f . . Or exhaust the generators and zip the resulting lists: >>> t = Tokenizer("generic") >>> toks1 = list(t.tokenize("Foo bar baz. Bar baz qux.")) >>> toks2 = list(t.tokenize("A b c. D e f. G h i.")) >>> for s1, s2 in zip(toks1, toks2): ... for t1, t2 in zip(s1, s2): ... print(t1, t2) Foo A bar b baz c . . Bar D baz e qux f . . """ # this is more elegant than just segfaulting in the MorphoDiTa C++ library if None is # passed... if not isinstance(text, str): raise TypeError( "``text`` should be a str, you passed in {}.".format(type(text)) ) self._tokenizer.setText(text) while self._tokenizer.nextSentence(self._forms, self._tokens): yield list(self._forms) vertical = Tokenizer("vertical") czech = Tokenizer("czech") english = Tokenizer("english") generic = Tokenizer("generic") PK!rl corpy/phonetics/README.rst=============== corpy.phonetics =============== Overview ======== Doing phonetics with Python. Czech rule-based grapheme-to-phoneme conversion =============================================== In addition to rules, an exception system is also implemented which makes it possible to capture less regular pronunciation patterns. Usage ----- The simplest public interface is the ``transcribe`` function. See its docstring for more information on the types of accepted input as well as on output options and other available customizations. Here are a few usage examples -- default output is SAMPA: .. code:: python >>> from corpy.phonetics import cs >>> cs.transcribe("máš hlad") [('m', 'a:', 'Z'), ('h\\', 'l', 'a', 't')] But other options including IPA are available: .. code:: python >>> cs.transcribe("máš hlad", alphabet="IPA") [('m', 'aː', 'ʒ'), ('ɦ', 'l', 'a', 't')] Hyphens can be used to prevent interactions between neighboring phones, e.g. assimilation of voicing: .. code:: python >>> cs.transcribe("máš -hlad") [('m', 'a:', 'S'), ('h\\', 'l', 'a', 't')] As you can see, these special hyphens get deleted in the process of transcription, so if you want a literal hyphen, it must be inside a token with either no alphabetic characters, or at least one other non-alphabetic character: .. code:: python >>> cs.transcribe("- --- -.- -hlad?") ['-', '---', '-.-', '-hlad?'] In general, tokens containing non-alphabetic characters (modulo the special treatment of hyphens described above) are passed through as is: .. code:: python >>> cs.transcribe("máš ? hlad") [('m', 'a:', 'Z'), '?', ('h\\', 'l', 'a', 't')] And you can even configure some of them to constitute a blocking boundary for interactions between phones (notice that unlike in the previous example, "máš" ends with a /S/ → assimilation of voicing wasn't allowed to spread past the ".."): .. code:: python >>> cs.transcribe("máš .. hlad", prosodic_boundary_symbols={".."}) [('m', 'a:', 'S'), '..', ('h\\', 'l', 'a', 't')] Finally, when the input is a single string, it's simply split on whitespace, but you can also provide your own tokenization. E.g. if your input string contains unspaced square brackets to mark overlapping speech, this is probably not the output you want: .. code:: python >>> cs.transcribe("[máš] hlad") ['[máš]', ('h\\', 'l', 'a', 't')] But if you pretokenize the input yourself according to rules that make sense in your situation, you're good to go: .. code:: python >>> cs.transcribe(["[", "máš", "]", "hlad"]) ['[', ('m', 'a:', 'Z'), ']', ('h\\', 'l', 'a', 't')] Acknowledgments =============== The choice of (X-)SAMPA and IPA transcription symbols follows the `guidelines `_ published by the Institute of Phonetics, Faculty of Arts, Charles University, Prague, which are hereby gratefully acknowledged. PK!corpy/phonetics/__init__.pyPK!/ߺ"7"7corpy/phonetics/cs.py"""Perform rule-based phonetic transcription of Czech. Some frequent exceptions to the otherwise fairly regular orthography-to-phonetics mapping are overridden using a pronunciation lexicon. """ import unicodedata as ud from functools import lru_cache from operator import itemgetter from pathlib import Path from typing import Dict, Iterable, List, Optional, Set, Tuple, Union # noqa: F401 import regex as re # ------------------------------ Utils ------------------------------ def _filter_comments(lines): for line in lines: if not line.strip().startswith("#"): yield line def _load_phones(tsv: str) -> Dict[str, Dict[str, str]]: ans: Dict[str, Dict[str, str]] = {} lines = tsv.splitlines() header_, lines = lines[0], lines[1:] header = [h.lower() for h in header_.split("\t")] header.pop(0) for line_ in _filter_comments(lines): line = line_.split("\t") key = line.pop(0) val = ans.setdefault(key, {}) for k, v in zip(header, line): val[k] = v return ans def _load_substr2phones(tsv: str, allowed: Dict) -> Dict[str, List[str]]: ans: Dict[str, List[str]] = {} lines = tsv.splitlines() lines.pop(0) for line in _filter_comments(lines): substr, phones = line.split("\t") phones_for_substr = ans.setdefault(substr, []) for ph in phones.split(): assert ph in allowed, f"Unexpected phone {ph!r}" phones_for_substr.append(ph) return ans def _load_voicing_pairs( tsv: str, allowed: Dict ) -> Tuple[Dict[str, str], Dict[str, str], Set[str], Set[str]]: devoiced2voiced, voiced2devoiced = {}, {} lines = tsv.splitlines() lines.pop(0) for line in _filter_comments(lines): devoiced, voiced = line.split("\t") assert devoiced in allowed, f"Unexpected phone {devoiced!r}" assert voiced in allowed, f"Unexpected phone {voiced!r}" devoiced2voiced[devoiced] = voiced voiced2devoiced[voiced] = devoiced trigger_voicing = set(voiced2devoiced.keys()) trigger_voicing.remove("v") trigger_voicing.remove("P\\") trigger_devoicing = set(devoiced2voiced.keys()) trigger_devoicing.remove("Q\\") return devoiced2voiced, voiced2devoiced, trigger_voicing, trigger_devoicing def _create_substr_re(substr_list: Iterable[str]) -> re.Regex: substr_list = sorted(substr_list, key=len, reverse=True) + ["."] return re.compile("|".join(substr_list)) class _ExceptionRewriter: def __init__(self, tsv: str) -> None: lines = tsv.splitlines() lines.pop(0) rules = [] for line in _filter_comments(lines): match, rewrite = line.split("\t") match = f"(?P{match})" if "(" not in match else match orig = re.search(r"\(\?P(.*?)\)", match).group(1) rules.append((match, orig, rewrite)) # reverse sort by length of substring matched, so that longest match applies rules.sort(key=itemgetter(1), reverse=True) re_str = "(" + "|".join(match for (match, _, _) in rules) + ")" self._re = re.compile(re_str) self._orig2rewrite: Dict[str, str] = { orig: rewrite for (_, orig, rewrite) in rules } @lru_cache() def sub(self, string: str) -> str: self._at = 0 return self._re.sub(self._rewrite, string) def _rewrite(self, match) -> str: # entire match matched = match.group() # multiple rewrites are allowed, but they must be contiguous and start # at the beginning of the string; otherwise, the match is returned # unchanged if match.start() == self._at: self._at += len(matched) else: return matched # the part of the match we want to replace orig = match.group("x") # what we want to replace it with rewrite = self._orig2rewrite[orig] return matched.replace(orig, rewrite) # ------------------------------ Load config ------------------------------ DIR = Path(__file__) PHONES = _load_phones( DIR.with_name("phones.tsv").read_text(encoding="utf-8") # pylint: disable=E1101 ) SUBSTR2PHONES = _load_substr2phones( DIR.with_name("substr2phones.tsv").read_text( encoding="utf-8" ), # pylint: disable=E1101 PHONES, ) DEVOICED2VOICED, VOICED2DEVOICED, TRIGGER_VOICING, TRIGGER_DEVOICING = _load_voicing_pairs( DIR.with_name("voicing_pairs.tsv").read_text( encoding="utf-8" ), # pylint: disable=E1101 PHONES, ) SUBSTR_RE = _create_substr_re(SUBSTR2PHONES.keys()) REWRITER = _ExceptionRewriter( DIR.with_name("exceptions.tsv").read_text(encoding="utf-8") # pylint: disable=E1101 ) # ------------------------------ Public API ------------------------------ class Phone: def __init__(self, value: str, *, word_boundary: bool = False) -> None: self.value: str = value self.word_boundary = word_boundary def __repr__(self): return f"/{self.value}/" EMPTY_PHONE = Phone("") class ProsodicUnit: """A prosodic unit which should be transcribed as a whole. This means that various connected speech processes are emulated at word boundaries within the unit as well as within words. """ def __init__(self, orthographic: List[str]) -> None: """Initialize ProsodicUnit with orthographic transcript.""" self.orthographic = orthographic self._phonetic: Optional[List[Phone]] = None def phonetic( self, *, alphabet: str = "SAMPA", hiatus=False ) -> List[Tuple[str, ...]]: """Phonetic transcription of ProsodicUnit.""" if self._phonetic is None: t = self._str2phones(self.orthographic) # CSPs are implemented in one reverse pass (assimilation of voicing # can propagate) and one forward pass t = self._voicing_assim(t) t = self._other_csps(t, hiatus=hiatus) self._phonetic = t return self._split_words_and_translate(self._phonetic, alphabet) @staticmethod def _str2phones(input_: List[str]) -> List[Phone]: """Convert string to phones. Use pronunciation from dictionary if available, fall back to generic rewriting rules. """ output: List[Phone] = [] for word in input_: word = word.lower() # rewrite exceptions word = REWRITER.sub(word) # force hiatus in sequences; is there because the # exceptions above can insert it in place of to prevent # palatalization word = re.sub(r"([iy])([ií])", r"\1j\2", word) # remove duplicate graphemes (except for short vowels, cf. ) # cf. no gemination below for the phonetic counterpart of this rule word = re.sub(r"([^aeoiuy])\1", r"\1", word) for match in SUBSTR_RE.finditer(word.lower()): substr = match.group() try: phones = SUBSTR2PHONES[substr] except KeyError as e: raise ValueError( f"Unexpected substring in input: {substr!r}" ) from e output.extend(Phone(ph) for ph in phones) output[-1].word_boundary = True return output @staticmethod def _voicing_assim(input_: List[Phone]) -> List[Phone]: r"""Perform assimilation of voicing. Usually regressive, but P\ assimilates progressively as well. """ output = [] previous_phone = EMPTY_PHONE for ph in reversed(input_): if previous_phone.value in TRIGGER_VOICING: ph.value = DEVOICED2VOICED.get(ph.value, ph.value) elif ph.word_boundary or previous_phone.value in TRIGGER_DEVOICING: ph.value = VOICED2DEVOICED.get(ph.value, ph.value) # for P\, the assimilation works the other way round too elif previous_phone.value == "P\\" and ph.value in TRIGGER_DEVOICING: previous_phone.value = "Q\\" output.append(ph) previous_phone = ph output.reverse() return output @staticmethod def _other_csps(input_: List[Phone], *, hiatus=False) -> List[Phone]: """Perform other connected speech processes.""" output = [] for i, ph in enumerate(input_): try: next_ph = input_[i + 1] except IndexError: next_ph = EMPTY_PHONE # assimilation of place for nasals if ph.value == "n" and next_ph.value in ("k", "g"): ph.value = "N" elif ph.value == "m" and next_ph.value in ("f", "v"): ph.value = "F" # no gemination (except across word boundaries and for short # vowels); cf. remove duplicate graphemes above for the # orthographic counterpart of this rule elif ( ph.value == next_ph.value and ph.value not in "aEIou" and not ph.word_boundary ): continue # drop CSP-blocking pseudophones (they've done their job by now) elif ph.value == "-": continue output.append(ph) # optionally add transient /j/ between high front vowel and subsequent vowel if ( hiatus and re.match("[Ii]", ph.value) and re.match("[aEIoui]", next_ph.value) ): output.append(Phone("j")) return output @staticmethod def _split_words_and_translate( input_: List[Phone], alphabet ) -> List[Tuple[str, ...]]: output = [] word = [] alphabet = alphabet.lower() for ph in input_: word.append(PHONES.get(ph.value, {}).get(alphabet, ph.value)) if ph.word_boundary: output.append(tuple(word)) word = [] return output def _separate_tokens( tokens: List[str], prosodic_boundary_symbols: Set[str] ) -> Tuple[List[Optional[str]], List[str]]: """Separate tokens for transcription from those that will be left as is. Returns two lists: the first one is a matrix for the result containing non-alphabetic tokens and gaps for the alphabetic ones, the second one contains just the alphabetic ones. """ matrix: List[Optional[str]] = [] to_transcribe = [] for token in tokens: if re.fullmatch(r"[\p{Alphabetic}\-]*\p{Alphabetic}[\p{Alphabetic}\-]*", token): # instead of simply checking for a final hyphen in the outer # condition and silently shoving an otherwise transcribable token # into matrix, it's better to fail and alert the user they probably # meant something else if token.endswith("-"): raise ValueError( f"Can't transcribe token ending with hyphen ({token!r}), place hyphen at " "beginning of next token instead" ) to_transcribe.append(token) matrix.append(None) elif token in prosodic_boundary_symbols: to_transcribe.append("-") matrix.append(token) else: matrix.append(token) return matrix, to_transcribe def transcribe( phrase: Union[str, Iterable[str]], *, alphabet="sampa", hiatus=False, prosodic_boundary_symbols=set(), ) -> List[Union[str, Tuple[str, ...]]]: """Phonetically transcribe ``phrase``. ``phrase`` is either a string (in which case it is split on whitespace) or an iterable of strings (in which case it's considered as already tokenized by the user). Transcription is attempted for tokens which consist purely of alphabetical characters and possibly hyphens (``-``). Other tokens are passed through unchanged. Hyphens have a special role: they prevent interactions between graphemes or phones from taking place, which means you can e.g. cancel assimilation of voicing in a cluster like "tb" by inserting a hyphen between the graphemes: "t-b". They are removed from the final output. If you want a **literal hyphen**, it must be inside a token with either no alphabetic characters, or at least one other non-alphabetic character (e.g. "-", "---", "-hlad?", etc.). Returns a list where **transcribed tokens** are represented as **tuples of strings** (phones) and **non-transcribed tokens** (which were just passed through as-is) as plain **strings**. ``alphabet`` is one of SAMPA, IPA, CS or CNC (case insensitive) and determines the symbol alphabet used in the phonetic transcript. When ``hiatus == True``, a /j/ phone is added between a high front vowel and a subsequent vowel. Various connected speech processes such as assimilation of voicing are emulated even across word boundaries. By default, this happens **irrespective of intervening non-transcribed tokens**. If you want some types of non-transcribed tokens to constitute an obstacle to interactions between phones, pass them as a set via the ``prosodic_boundary_symbols`` argument. E.g. ``prosodic_boundary_symbols={"?", ".."}`` will prevent CSPs from being emulated across ``?`` and ``..`` tokens. """ try: if isinstance(phrase, str): tokens = ud.normalize("NFC", phrase.strip()).split() else: tokens = [ud.normalize("NFC", t) for t in phrase] except TypeError as e: raise TypeError( f"Expected str or Iterable[str] as phrase argument, got {type(phrase)} instead" ) from e matrix, to_transcribe = _separate_tokens(tokens, prosodic_boundary_symbols) transcribed = ProsodicUnit(to_transcribe).phonetic(alphabet=alphabet, hiatus=hiatus) return [m if m is not None else transcribed.pop(0) for m in matrix] # type: ignore PK!nL''corpy/phonetics/exceptions.tsvMATCH REWRITE # HOWTO: if only a substring of the MATCH regex should be rewritten, # wrap it in a capturing group named x: (?P...); also useful if # you need to add anchors and other special characters around the # substring that should be rewritten akusti akusty augustin augustyn besti besty charakteristi charakterysty (?Pcyklisti). cyklisty destin destyn drasti drasty dynasti dynasty (?Pfantasti). fantasty festival festyval instinkt instynkt instit instyt investi investy justi justy logisti logisty mysti mysty (?Pnacionalisti). nacionalisty (?Pnacisti). nacisty prestiž prestyž palestin palestyn plasti plasty prostitu prostytu (?Prealisti). realisty (?Psarkasti). sarkasty sebastian sebastyan (?Pslávisti). slávisty (?Psocialisti). socialisty (?Pstatisti). statysty stimul stymul sugesti sugesty (?Pteroristi). teroristy textil textyl (?Pturisti). turisty vestibul vestybul alternativ alternatyv aromati aromaty (?Pautomati). automaty autoritativ autoritatyv battist battyst charitativ charitatyv (?Pdemokrati). demokraty dramati dramaty gramati gramaty informati informaty inovativ inovatyv inspirativ inspiratyv klimati klimaty kompatibil kompatybil konzervativ konzervatyv kreativ kreatyv kvalitativ kvalitatyv legislativ legislatyv lukrativ lukratyv matemati matematy matik matyk narativ naratyv negativ negatyv normativ normatyv operativ operatyv pneumatik pneumatyk privatiz privatyz problemati problematy provokativ provokatyv relativ relatyv reprezentativ reprezentatyv (?Pstati)[kcč] staty sympati sympaty systemati systematy temati tematy témati tématy vatikán vatykán dillí dylí dia dya die dye diferenc dyferenc digi dygi dikt dykt dim dym dip dyp diri dyri disci dysci disk dysk disp dysp distr dystr dividend dyvidend diviz dyviz diář dyář aktiv aktyv arkti arkty antarkti antarkty atrakti atrakty destrukti destrukty detekti detekty efektiv efektyv fakti fakty fiktiv fiktyv interaktiv interaktyv kolektiv kolektyv konstruktiv konstruktyv objektiv objektyv perspektiv perspektyv prakti prakty produktiv produktyv respektiv respektyv subjektiv subjektyv takti takty analyti analyty graffiti grafity kriti krity legiti legity pervitin pervityn politi polity pozitiv pozityv primitiv primityv (?Patleti). atlety elektromagneti elektromagnety energeti energety esteti estety eti ety geneti genety kosmeti kosmety (?Pmagneti). magnety marketing marketyng (?Ppeti). pety (?Ppoeti). poety synteti syntety (?Pteoreti). teorety daniel danyel anim anym anit anyt botani botany humani humany manifest manyfest manipul manypul mechani mechany organi organy panik panyk # TODO: if variants are implemented, allow homophone (virgin) panice panyce reorgani reorgany sanit sanyt zorgani zorgany monik monyk veronik veronyk antonio antonyo architektoni architektony ceremoni ceremony chroni chrony (?Pelektroni). elektrony filharmoni filharmony harmoni harmony ironi irony (?Pkoloni). kolony kroni krony mattoni matony monitor monytor symfoni symfony (?Ptelefoni). telefony anti anty argenti argenty (?Pcenti). centy entit entyt (?Pgiganti). giganty identi identy intim intym kontin kontyn mantinel mantynel preventiv preventyv romanti romanty valentine valentajn ventil ventyl imunit imunyt junio junyo komuni komuny (?Pkomunisti). komunysty muni muny unie unye unij unyj unii unyi unií unyí unifor unyfor unikát unykát unikum unykum unipetrol unypetrol univerzit unyverzit univerzál unyverzál (?Pindi). indy kandid kandyd kondi kondy skandin skandyn dominik dominyk aerolini aeroliny defini definy exmini exminy klini kliny lini liny mini miny poliklini polikliny edi edy benedik benedyk encyklopedi encyklopedy expedi expedy ingredi ingredy komedi komedy kredit kredyt medic medyc medik medyk (?Pmedit). medyt mediá medyá profimedia profimédya využ vy-už vyuč vy-uč zadostiuči zadosti-uči tradi trady radik radyk radio rádyo (?Prádi). rády radiá radyá sporadi sporady stadio stadyo stadió stadyó denis denys dennis denys geniál genyál hygieni hygieny penis penys provenien provenyen senio senyo suvereni suvereny tenis tenys emotiv emotyv (?Peroti). eroty flotil flotyl goti goty lokomotiv lokomotyv motiv motyv buddhism budhizm charism charizm fašism fašizm kapitalism kapitalizm katolicism katolicizm liberalism liberalizm metabolism metabolizm nacionalism nacionalizm nacism nacizm rasism rasizm realism realizm socialism socializm terorism terorizm techni techny # TODO: possibly allow for zh- if variants are added in the future...? shod schod sho scho shora zhora shá schá shrn schrn shro schro shled schled tibet tybet tip typ # NOTE: this no-op rule is a way to bypass the rewrite rule above, # which is triggered by a shorter match tipec tipec titul tytul deskriptiv deskriptyv opti opty skepti skepty subtil subtyl exi egzi ordi ordy ferdinand ferdynand kardi kardy koordi koordy verdi verdy pudin pudyn studi study certifi certyfi partici partyci partie partye partii partyi partií partyí partiích partyích sortiment sortyment vertik vertyk dc c komodit komodyt melodi melody metodi metody modifi modyfi parodi parody detail detajl medail medajl trailer trajler fénix fényx géni gény trénin trényn viet vjet (?Pmédi). médy (?Ptragédi). tragédy email ímejl kaliforni kaliforny moderni moderny verni verny nissan nysan nil nyl británi britány albáni albány administrati adminystraty iniciati inyciaty vicky viky patrick patrik frederick frederik rick rik mick mik rock rok exe egze audi audy claudi klaudy (?Podd). od-d (?Podt). ot-t třia tři-a čtyřia čtyři-a štyřia štyři-a štyrya štyry-a th t výu vý-u vyu vy-u vyo vy-o john džon ^(?Psoftware)$ softvér softwar softvér ^(?Phardware) hárdvér hardwar hárdvér filosof filozof (?Pdiplomati). dyplomaty carlos karlos café kafé carmen karmen canon kanon srdc src tesc tesk oscar oskar idio idyo prezidi prezídy prezídi prezídy jidiš jidyš scott skot server servr mozaik mozajk josef jozef klause klauze klausovi klauzovi krause krauze krausovi krauzovi leasing lízing stadium stádyum stadiu stádyu stadia stádya (?Pstádi). stády přesvědč přesvěč svědč svěč asij azij fantasy fantazy etni etny telecom telekom econom ekonom jack džek jazz džez optimisti optymisty protiú proti-ú vyú vy-ú lai la-i mail mejl antibiotik antybiotyk komunism komunyzm chinaski činaski chilli čili stipendi stypendy definiti definyty party párty technoparty technopárty inje iňe pódi pódy slavii slávii slavie slávie # TODO: what about syllabic consonants? → write some rules to label them # optionally washington vošinktn charles čárls potter potr volkswagen folksvágn fiction fikšn design dyzajn asie ázie asii ázii afghánistán afgánystán vitamin vitamín minus mínus celsia celzia impulsy impulzy fischer fišer ophél ofél summit samit toyot tojot optimism optymizm resort rezort iont jont mítink mítynk exo egzo (?Pexoti). egzoty thriller triler přese přeze kreseb krezeb kasin kasín wales vejls hollywood holyvúd phil fil kokain kokajn intuiti intuity indonés indonéz piano piáno orgasm orgazm podtitul podtytul laser lejzr tchajwan tajvan combi kombi copy kopy protein protejn facto fakto pizz pic whisk visk shop šop puzzle pucle diet dyet tequi teki gay gej lady lejdy group grúp bosch boš bush buš brown braun sezona sezóna playoff plejof czech ček grace grejs business biznys pojďme pojťme googl gúgl ^(?Pgoogle)$ gúgl university junyversity country kántry schulz šulc alexander alexandr alois alojz blues blús time tajm black blek instinktiv instynktyv office ofis media médya charlott šarlot charlotte šarlot aaron áron einstein ajnštajn interview intervjú jamese džejmse jimmy džimi holding holdyng # TODO: maybe smuggle a /ɣ/ in there? těchh těh neumann nojman dick dyk churchill čerčil money many boom búm jerry džery green grín beatles bítls peugeot pežot holocaust holokaust credit kredyt tsunami cunami brandy brendy kvantitati kvantytaty schmeling šmeling nicole nykol nicol nykol ^(?Pnikol)$ nykol nikolaj nykolaj nikola nykola (?Pnikol)[kcč] nykol octav oktáv franka frenka czechtrade čektrejd czechtradu čektrejdu debut debit stockholm stokholm people pípl roy roj multimedi multymedy opel opl team tým revue reví rath rát christopher kristofr fair fér berger bergr ^(?Pout)$ aut hadamczik hadamčik fabi fábi flora flóra portfol portfól arthur artur telefónic telefónyk arlene árlín life lajf iq íkvé juli júli franz franc charlie čárlí schumacher šumachr karoser karosér abraham ejbrehem free frí justin džastyn uveďme uveťme biodiverz biodyverz horváth horvát makeup mejkap british brityš rutin rutyn schwarz švarc handicap hendykep ghett get radioaktiv radyoaktyv victor viktor představme předstafme (?Pfarmaceuti). farmaceuty dublin dablin konkurs konkurz cash keš electric elektrik zombie zombí diabetik dyabetyk classic klasik kognitiv kognytyv bismarck bizmark ruin rujn gang geng malajs malajz panasonic panasonyk haag hág ralph ralf seifert sajfrt annette anet museum mjúzíum khaki kaki sherry šeri konsorci konzorci marcus markus sherlock šerlok buďme buťme malcolm malkolm alfred alfréd hacke hek focus fokus fritz fric playboy plejboj piknik piknyk poesi poezi olympic olympik reggae rege menzel mencl (?Pdidakti). dydakty chevrolet ševrolet dalajlam dalajlám ^(?Pfranco)$ franko isaac ajzek tbili t-bili poker pokr genius džínyjus ulti ulty picass pikas diesel dýzl coca koka cola kola cocacola kokakola akreditiv akredytyv isabel izabel innsbruck inzbruk semitism semityzm marco marko použ po-už zentiv zentyv mechanism mechanyzm organism organyzm atlanti atlanty autenti autenty benjam beňam benj bendž dekorativ dekoratyv dinosaur dynosaur hegel hégl jseš seš jsi si jsme sme ^(?Pjsou)$ sou jste ste (?Pkyberneti). kybernety sentiment sentyment telekomuni telekomuny # NOTE: you can also add prefixes that don't contain any rewrites # but that are allowed to occur in front of rewritten substrings ne ne mikro mikro nej nej PK!corpy/phonetics/phones.tsvSAMPA IPA CS CNC # - is just a technical symbol used in exceptions.tsv to prevent # CSPs like voicing assimilation, hiatus insertion, merging of # two vowels into a diphthong etc. - - - - i: iː í í I ɪ i i E ɛ e e E: ɛː é é a a a a a: aː á á o o o o o: oː ó ó u u u u u: uː ú ú o_u o͡u ou̯ ou a_u a͡u au̯ au E_u ɛ͡u eu̯ eu p p p p b b b b t t t t d d d d c c ť ť J\ ɟ ď ď k k k k g ɡ g g f f f f v v v v s s s s z z z z S ʃ š š Z ʒ ž ž x x ch ch h\ ɦ h h t_s ʦ c c t_S ʧ č č m m m m n n n n J ɲ ň ň r r r r l l l l j j j j P\ r̝ ř ř G ɣ ɣ ɣ d_z ʣ dz ʒ d_Z ʤ dž ʒ̆ Q\ r̝̊ ř̭ ř N ŋ ŋ ŋ F ɱ ɱ ɱ r= r̩ r̥ r l= l̩ l̥ l m= m̩ m̥ m ? ʔ ʔ @ ə ə ə PK!.zee!corpy/phonetics/substr2phones.tsvSUBSTR PHONE ch x au a_u eu E_u ou o_u dě J\ E ně J E mě m J E tě c E di J\ I ni J I ti c I dí J\ i: ní J i: tí c i: dz d_z dž d_Z qu k v a a b b c t_s d d e E f f g g h h\ i I j j k k l l m m n n o o p p q k v r r s s t t u u v v w v x k s y I z z á a: ä E ą a m â a é E: ě j E ë E í i: ó o: ö E: ő E: ô o ú u: ů u: ü I ű I ý i: č t_S ć t_S ď J\ đ d ł l ĺ l ľ l ň J ń J ř P\ ŕ r š S ś S ş S ť c ž Z ż Z # - is just a technical symbol used in exceptions.tsv to prevent # CSPs like voicing assimilation, hiatus insertion, merging of # two vowels into a diphthong etc. - - PK!LL!corpy/phonetics/voicing_pairs.tsvDEVOICED VOICED p b t d c J\ k g s z S Z t_s d_z t_S d_Z x G x h\ f v Q\ P\ PK!corpy/scripts/__init__.pyPK!x. corpy/scripts/xc.pyimport os.path as osp import click as cli import logging as log import unicodedata as ud from collections import Counter import regex as re from lxml import etree NAME = osp.splitext(osp.basename(__file__))[0] LOG = log.getLogger(NAME) LOGLEVELS = [ s for f, s in sorted( (v, k) for k, v in vars(log).items() if k.isupper() and isinstance(v, int) ) ] NORM_FORMS = ("NFC", "NFD", "NFKC", "NFKD") def count_extended_grapheme_clusters(text): return Counter(m.group() for m in re.finditer(r"\X", text)) def check_normalization(fdist, expected_form="NFC"): LOG.info("Checking normalization of identified extended grapheme clusters.") for extended_grapheme_cluster in fdist.keys(): normalized = ud.normalize(expected_form, extended_grapheme_cluster) if extended_grapheme_cluster != normalized: LOG.warn( f"Expected {normalized!r} according to {expected_form}, got " f"{extended_grapheme_cluster!r} instead!" ) def parse(file, xml=False): if xml: LOG.info(f"Parsing {file.name!r} as XML.") tree = etree.parse(file) for elem in tree.iter(): yield from elem.attrib.values() yield elem.text yield elem.tail else: yield from file def print_fdist(fdist): for extended_grapheme_cluster, count in fdist.most_common(): names, codepoints = [], [] for codepoint in extended_grapheme_cluster: name = ud.name(codepoint, None) # control characters have no names, and for them, we want to print their repr instead codepoints.append(repr(codepoint) if name is None else codepoint) names.append("__NO_NAME__" if name is None else name) print(count, "".join(codepoints), "+".join(names), sep="\t") @cli.command() @cli.option( "--expected-normalization", help="Warn if identified extended grapheme clusters do not " "match expected normalization form.", type=cli.Choice(NORM_FORMS), ) @cli.option("--lower", help="Convert to lowercase before processing.", is_flag=True) @cli.option( "--xml", help="Parse input as XML and process only text nodes and attribute values.", is_flag=True, ) @cli.option( "lvl", "--log", help="Set logging level.", type=cli.Choice(LOGLEVELS), default="WARN", ) @cli.option("--verbose", "-v", help="(Repeatedly) increase logging level.", count=True) @cli.option("--quiet", "-q", help="(Repeatedly) decrease logging level.", count=True) @cli.argument("files", type=cli.File("rt", encoding="utf-8"), nargs=-1) def main(expected_normalization, lower, xml, lvl, verbose, quiet, files): """`wc -c` on steroids. Count extended grapheme clusters, print their frequency distribution. FILES are the files to process. Leave empty or - for STDIN. """ lvl = getattr(log, lvl) - 10 * verbose + 10 * quiet log.basicConfig( level=lvl, format="[%(asctime)s {}:%(levelname)s] %(message)s".format(NAME) ) files = files if files else (cli.File("rt", encoding="utf-8")("-"),) fdist = Counter() LOG.info("Aggregating counts of extended grapheme clusters in input.") for file in files: for fragment in parse(file, xml): if fragment is not None: fragment = fragment.lower() if lower else fragment fdist.update(count_extended_grapheme_clusters(fragment)) if expected_normalization: check_normalization(fdist, expected_normalization) print_fdist(fdist) PK!ڒcorpy/scripts/zip_verticals.pyimport os.path as osp import click as cli import logging as log NAME = osp.splitext(osp.basename(__file__))[0] LOG = log.getLogger(NAME) LOGLEVELS = [ s for f, s in sorted( (v, k) for k, v in vars(log).items() if k.isupper() and isinstance(v, int) ) ] def print_position(lines, line_no): lines = [l.strip(" \n").split("\t") for l in lines] word = lines[0][0] position = [word] for i, line in enumerate(lines): assert line[0] == word, ( f"Expected first attribute {word} but got {line[0]} in vertical " f"#{i+1} at line #{line_no+1}. Are you sure the verticals " "represent the same corpus?" ) position.extend(line[1:]) print("\t".join(position)) @cli.command() @cli.option( "lvl", "--log", help="Set logging level.", type=cli.Choice(LOGLEVELS), default="WARN", ) @cli.option("--verbose", "-v", help="(Repeatedly) increase logging level.", count=True) @cli.option("--quiet", "-q", help="(Repeatedly) decrease logging level.", count=True) @cli.argument("files", type=cli.File("rt", encoding="utf-8"), nargs=-1) def main(lvl, verbose, quiet, files): """Zip verticals together. Intended for "zipping" together verticals of the same corpus. At least one of them must have multiple positional attributes. Structures and the first positional attribute (which is included only once) are taken from the first vertical provided. FILES are the files to process. Leave empty or - for STDIN. """ lvl = getattr(log, lvl) - 10 * verbose + 10 * quiet log.basicConfig( level=lvl, format="[%(asctime)s {}:%(levelname)s] %(message)s".format(NAME) ) files = files if files else (cli.File("rt", encoding="utf-8")("-"),) LOG.info(f"Zipping the following vertical files: {files}") for line_no, lines in enumerate(zip(*files)): if any("\t" in l for l in lines): print_position(lines, line_no) else: print(lines[0].strip(" \n")) LOG.info("Done.") PK!]^}}corpy/util/__init__.pyfrom pprint import pprint def _head_gen(items, n): for idx, item in enumerate(items): if n == idx: break yield item def head(collection, n=None): """Inspect collection, truncated if too long. If n is None, an appropriate value is determined based on the type of the collection. """ type_ = type(collection) if n is None: if type_ in (str, bytes): n = 100 else: n = 10 if len(collection) <= n: pprint(collection) return if type_ == str: constructor = "".join elif type_ == bytes: constructor = b"".join else: constructor = type_ items = collection.items() if hasattr(collection, "items") else collection pprint(constructor(_head_gen(items, n))) def cmp(lhs, rhs, test="__eq__"): """Wrap assert statement to automatically raise an informative error.""" msg = f"{head(lhs)} {test} {head(rhs)} is not True!" ans = getattr(lhs, test)(rhs) # operators automatically fall back to identity comparison if the # comparison is not implemented for the given types, magic methods don't → # if comparison is not implemented, we must fall back to identity # comparison manually, because NotImplemented is truthy and makes the # assert succeed if ans is NotImplemented: ans = lhs is rhs assert ans, msg PK!Ƥmcorpy/vertical/README.rst============== corpy.vertical ============== Overview ======== Tools for parsing corpora in the vertical format devised originally for `CWB `_, used also by `(No)SketchEngine `_. It would have been nice if verticals were just standards compliant XML, but they appeared before XML, so they're not. Hence this. Usage ===== Iterating over positions in a vertical file ------------------------------------------- This allows you to iterate over all positions while keeping track of the structural attributes of the structures they're contained within, without risking errors from hand-coding this logic every time you need it. .. code:: python >>> from corpy.vertical import Syn2015Vertical >>> from pprint import pprint >>> v = Syn2015Vertical("path/to/syn2015.gz") >>> for i, position in enumerate(v.positions()): ... if i % 100 == 0: ... # structural attributes of position ... pprint(v.sattrs) ... print() ... # position itself ... pprint(position) ... print() ... elif i > 100: ... break ... {'doc': {'audience': 'GEN: obecné publikum', 'author': 'Typlt, Jaromír', 'authsex': 'M: muž', 'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská ' 'imaginace.', 'first_published': '1993', 'genre': 'X: neuvedeno', 'genre_group': 'X: neuvedeno', 'id': 'pi291', 'isbnissn': '80-7110-132-X', 'issue': '', 'medium': 'B: kniha', 'periodicity': 'NP: neperiodická publikace', 'publisher': 'Pražská imaginace', 'pubplace': 'Praha', 'pubyear': '1993', 'srclang': 'cs: čeština', 'subtitle': 'Groteskní mýtus', 'title': 'Zápas s rodokmenem', 'translator': 'X', 'transsex': 'X: neuvedeno', 'txtype': 'NOV: próza', 'txtype_group': 'FIC: beletrie'}, 'p': {'id': 'pi291:1:1', 'type': 'normal'}, 's': {'id': 'pi291:1:1:1'}, 'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}} Position(word='ZÁPAS', lemma='zápas', tag=UtklTag(pos='N', sub='N', gen='I', num='S', case='1', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='T', afun='ExD', parent='0', eparent='0', prep='', p_lemma='', p_tag='', p_afun='', ep_lemma='', ep_tag='', ep_afun='') {'doc': {'audience': 'GEN: obecné publikum', 'author': 'Typlt, Jaromír', 'authsex': 'M: muž', 'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská ' 'imaginace.', 'first_published': '1993', 'genre': 'X: neuvedeno', 'genre_group': 'X: neuvedeno', 'id': 'pi291', 'isbnissn': '80-7110-132-X', 'issue': '', 'medium': 'B: kniha', 'periodicity': 'NP: neperiodická publikace', 'publisher': 'Pražská imaginace', 'pubplace': 'Praha', 'pubyear': '1993', 'srclang': 'cs: čeština', 'subtitle': 'Groteskní mýtus', 'title': 'Zápas s rodokmenem', 'translator': 'X', 'transsex': 'X: neuvedeno', 'txtype': 'NOV: próza', 'txtype_group': 'FIC: beletrie'}, 'p': {'id': 'pi291:1:3', 'type': 'normal'}, 's': {'id': 'pi291:1:3:2'}, 'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}} Position(word='chvil', lemma='chvíle', tag=UtklTag(pos='N', sub='N', gen='F', num='P', case='2', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='M', afun='Atr', parent='-1', eparent='-1', prep='', p_lemma='několik', p_tag='Ca--4-----------', p_afun='Adv', ep_lemma='několik', ep_tag='Ca--4-----------', ep_afun='Adv') Performing frequency distribution queries ----------------------------------------- This can be done elegantly and fairly quickly with :meth:`Vertical.search`. All you have to do is provide a match function, which identifies positions which the query should match, and a count function, which specifies what should be counted for each match. The return value is an index of occurrences and the total size of the corpus. The index is a dictionary of numpy array of position indices within the corpus, which can be further processed e.g. using :func:`ipm` or :func:`arf` to compute different types of frequencies. .. code:: python >>> from corpy.vertical import Syn2015Vertical, ipm, arf >>> v = Syn2015Vertical("path/to/syn2015.gz") >>> # log progress every 50M positions >>> v.report = 50_000_000 >>> def match(posattrs, sattrs): ... # match all nouns within txtype_group "FIC: beletrie" ... return sattrs["doc"]["txtype_group"] == "FIC: beletrie" and posattrs.tag.pos == "N" ... >>> def count(posattrs, sattrs): ... # at each matched position, record the txtype and lemma ... return sattrs["doc"]["txtype"], posattrs.lemma ... >>> index, N = v.search(match, count) Processed 0 lines in 0:00:00.007382. Processed 50,000,000 lines in 0:05:58.185566. Processed 100,000,000 lines in 0:11:35.394294. **NOTE:** this was run on a desktop workstation, with the data being stored on a networked filesystem. If the performance of any future versions on a similar task becomes significantly worse than this ballpark, it should be considered a bug. .. code:: python >>> # absolute frequency >>> len(index[("NOV: próza", "plíseň")]) 211 >>> # relative frequency (instances per million) >>> ipm(index[("NOV: próza", "plíseň")], N) 1.747430618598555 >>> # average reduced frequency (takes into account dispersion) >>> arf(index[("NOV: próza", "plíseň")], N) 54.220727998809153 Subclass :class:`Vertical` for your custom corpus ------------------------------------------------- If you have a corpus with a different structure, you can easily adapt the tools by subclassing :class:`Vertical`. See its docstring for further info, or the implementation of :class:`Syn2015Vertical` for a practical example. PK!_1]]corpy/vertical/__init__.py# TODO: Put positions in a buffer (queue). Yield the middle position and give a handle on the # context to match and count functions. Gotchas: sattrs will have to be reimplemented if they're to # be available on the context; corpora shorter than the queue size; start and end corner cases # (before the queue fills up / as it's emptying out). import sys import gzip import os.path as osp from typing import List import re import datetime as dt from collections import namedtuple, defaultdict import numpy as np __all__ = ["Vertical", "Syn2015Vertical", "ipm", "arf"] Structure = namedtuple("Structure", "name attrs") UtklTag = namedtuple( "UtklTag", "pos sub gen num case pgen pnum pers tense grad neg act p13 p14 var asp" ) class Vertical: """Base class for a corpus in the vertical format. Create subclasses for specific corpora by at least specifying a list of :attr:`struct_names` and :attr:`posattrs`. """ struct_names: List[str] = [] posattrs: List[str] = [] def __init__(self, path): if not (self.struct_names and self.posattrs): raise Exception( f"The class attributes `struct_names` and `posattrs` must be specified. You " f"probably want to subclass {self.__class__.__name__!r}." ) if not osp.isfile(path): raise Exception(f"File {path!r} does not exist!") self.path = path self._struct_re = re.compile( r"<\s*?(/?)\s*?({})(?:\s*?(/?)\s*?| (.*?))>".format( "|".join(self.struct_names) ) ) self.position_template = namedtuple("Position", self.posattrs) # if an integer > 0, then modulo for reporting progress; if falsey, then turns off reporting self.report = None def open(self): return open(self.path, "rt") def parse_position(self, position): return self.position_template(*position.split("\t")) def positions(self, parse_sattrs=True, ignore_fn=None, hook_fn=None): self.sattrs = {} # ignore_fn specifies which positions to completely ignore, based on pos and struct attrs # hook_fn is a function to be called at each position (receives pos and struct attrs) start = dt.datetime.now() with self.open() as file: for i, line in enumerate(file): line = line.strip(" \n\r") m = self._struct_re.fullmatch(line) if m: close, tag, self_close, attrs = m.groups() if close: self.sattrs.pop(tag) elif self_close: pass else: # TODO: figure out a way to allow nested tags...? if tag in self.sattrs: raise Exception( f"{tag!r} already in `sattrs`; nested tags?" ) if parse_sattrs: attrs = { m.group(1): m.group(2) for m in re.finditer( r'\s*?(\S+?)="([^"]*?)"', "" if attrs is None else attrs, ) } self.sattrs[tag] = attrs else: position = self.parse_position(line) if hook_fn: hook_fn(position, self.sattrs) if not (ignore_fn and ignore_fn(position, self.sattrs)): yield position if self.report and i % self.report == 0: time = dt.datetime.now() - start print(f"Processed {i:,} lines in {time}.", file=sys.stderr) def search(self, match_fn, count_fn=None, **kwargs): # match_fn specifies what to match, based on pos and struct attrs # count_fn specifies what to count, based on pos and struct attrs; if it returns a list, # it's understood as a list of things to count if count_fn is None: count_fn = match_fn index = defaultdict(list) for i, position in enumerate(self.positions(**kwargs)): if match_fn(position, self.sattrs): count = count_fn(position, self.sattrs) if isinstance(count, list): for c in count: index[c].append(i) else: index[count].append(i) index = {k: np.array(v) for k, v in index.items()} return index, i class Syn2015Vertical(Vertical): struct_names = ["doc", "text", "p", "s", "hi", "lb"] posattrs = [ "word", "lemma", "tag", "proc", "afun", "parent", "eparent", "prep", "p_lemma", "p_tag", "p_afun", "ep_lemma", "ep_tag", "ep_afun", ] def open(self): return gzip.open(self.path, "rt") def parse_position(self, position): position = position.split("\t") position[2] = UtklTag(*position[2]) return self.position_template(*position) class ShuffledSyn2015Vertical(Syn2015Vertical): struct_names = ["block"] + Syn2015Vertical.struct_names def ipm(occurrences, N): return 1e6 * len(occurrences) / N def arf(occurrences, N): freq = len(occurrences) if freq == 0: return 0 shifted = np.roll(occurrences, 1) distances = (occurrences - shifted) % N avg_dist = N / freq return sum(min(d, avg_dist) for d in distances) / avg_dist PK!)~corpy/vis/README.rst================ corpy.vis ================ Overview ======== Wrappers for quick visualizations of linguistic data. .. code:: python >>> from corpy.vis import wordcloud >>> import os >>> wc = wordcloud(os.__doc__) >>> wc.to_image().show() In Jupyter, just inspect the ``wc`` variable to display the wordcloud. For further details, see the docstring of the ``wordcloud()`` function. PK!AEcorpy/vis/__init__.pyfrom collections import Counter from collections.abc import Mapping, Iterable import numpy as np from wordcloud import WordCloud CM_PER_IN = 2.54 def size_in_pixels(width, height, unit="in", ppi=300): """Convert size in inches/cm to pixels. width, height: dimensions, as measured by unit unit: "in" for inches, "cm" for centimeters ppi: pixels per inch Sample values for ppi: - for displays: you can detect your monitor's DPI using the following website: ; a typical value is 96 (of course, double that for HiDPI) - for print output: 300 at least, 600 is high quality """ allowed_units = ("in", "cm") if unit not in allowed_units: raise ValueError(f"`unit` must be one of {allowed_units}.") if unit == "cm": width = round(width * CM_PER_IN) height = round(height * CM_PER_IN) return width * ppi, height * ppi def _optimize_dimensions(size, fast, fast_limit): width, height = size # NOTE: Reasonable numbers for width and height are in the hundreds # to low thousands of pixels. If the requested size is large, for # faster results, we shrink the canvas during wordcloud # computation, and only scale it back up during rendering. if fast and width * height > fast_limit ** 2: scale = max(size) / fast_limit width = round(width / scale) height = round(height / scale) else: scale = 1 return width, height, scale def _elliptical_mask(width, height): x_center = half_width = round(width / 2) y_center = half_height = round(height / 2) x = np.arange(0, width) y = np.arange(0, height)[:, None] mat = ((x - x_center) / half_width) ** 2 + ((y - y_center) / half_height) ** 2 return (mat >= 1) * 255 def wordcloud( data, size=(400, 400), *, rounded=False, fast=True, fast_limit=800, **kwargs ): """Generate a wordcloud. If `data` is a string, the wordcloud is generated using the method `.generate_from_text()`, which automatically ignores stopwords (customizable with the `stopwords` argument) and includes "collocations" (i.e. bigrams). If `data` is a sequence or a mapping, the wordcloud is generated using the method `.generate_from_frequencies()` and these preprocessing responsibilities fall to the user. data: input data -- either one long string of text, or an iterable of tokens, or a mapping of word types to their frequencies; use the second or third option if you want full control over the output size: size in pixels, as a tuple of integers, (width, height); if you want to specify the size in inches or cm, use the `size_in_pixels()` function to generate this tuple rounded: whether or not to enclose the wordcloud in an ellipse; incompatible with the `mask` keyword argument fast: when True, optimizes large wordclouds for speed of generation rather than precision of word placement fast_limit: speed optimizations for "large" wordclouds are applied when the requested canvas size is larger than `fast_limit**2` kwargs: remaining keyword arguments are passed on to the `wordcloud.WordCloud` initializer """ if rounded and kwargs.get("mask"): raise ValueError("Can't specify `rounded` and `mask` at the same time.") # tweak defaults kwargs.setdefault("background_color", "white") # if Jupyter gets better at rendering transparent images, then # maybe these would be better defaults: # kwargs.setdefault("mode", "RGBA") # kwargs.setdefault("background_color", None) width, height, scale = _optimize_dimensions(size, fast, fast_limit) if rounded: kwargs["mask"] = _elliptical_mask(width, height) wc = WordCloud(width=width, height=height, scale=scale, **kwargs) # raw text if isinstance(data, str): return wc.generate_from_text(data) # frequency counts elif isinstance(data, Mapping): return wc.generate_from_frequencies(data) # tokenized text # NOTE: the second condition is there because of nltk.text, which # behaves like an Iterable / Collection / Sequence for all # practical intents and purposes, but the corresponding abstract # base classes don't pick up on it (maybe because it only has a # __getitem__ magic method?) elif isinstance(data, Iterable) or hasattr(data, "__getitem__"): return wc.generate_from_frequencies(Counter(data)) else: raise ValueError( "`data` must be a string, a mapping from words to frequencies, or an iterable of words." ) def _wordcloud_png(wc): from IPython.display import display return display(wc.to_image()) try: from IPython import get_ipython ip = get_ipython() if ip is not None: png_formatter = ip.display_formatter.formatters["image/png"] png_formatter.for_type(WordCloud, _wordcloud_png) except ImportError: pass PK!H b>[&corpy-0.1.2.dist-info/entry_points.txtN+I/N.,()HM/*ԃU$[&fqUe薥d&' ʹPK!Hu)GTUcorpy-0.1.2.dist-info/WHEEL HM K-*ϳR03rOK-J,/R(O-)$qzd&Y)r$UV&UrPK!H_M]corpy-0.1.2.dist-info/METADATAUv6+fхl9Ni]N rD7 벿u@a>63wE.H~F}{J؇تf]M(Ka>Lʚ %tD19ɔTӴ~&<3e0Gn.dڑ"K륱k Cu.F8}8 E[%X Ŵsg1AZtɨ&k;eC%T"b~ߜqEL8CV(zJ-E#ؑ5evNaUkWxM %L$5vJ`tܕ*^gRJ$^'f:,xM{oxgT(^f&^onjr˔ ykjeE3|5VIol_MˬHdI~9w$܃8oE4%5,g2E&fJA#.5 B ׹aGq3d=+dC>o<;hp1b a") ,9'NюXP9LU0z tN b\DDb45hʝnOSNQaQk+ 1PBh:bK̢NbL TZE?͌ rXSPT{jV*B&k"ndU1~ ʖLҧF5 9आΕٽfpMO%/T4**1]s#LMns F@zӍuD/εD{hSryIhd @%%z#׃$/0kʙ,mѾֳE(M9:YhEaj*ß $h|E_ջMQuG}"8Q@靊:p*u.-*m|O_PK!Hеcorpy-0.1.2.dist-info/RECORDK:.! EEaC wI_qN͆7Ka t^Up I[;#Tּev;/زWgظ-M GZ9ϱ$1+(УrM3ZH ↮:o` Sj8.uFr-H&<}G0bZ>$(ۥe+Z^$E QZm6pM`n?9*y{ yY_ayԌRSOlR cn2Z52춡>NA +I .w]N0+}"ҡ -.ZV+>>4Zt&Sc;7FyXp$'h'.ðmFÍ2 8SN "XZ:1s.M~3l,js)Ӷ6 vK$EiID= K.?ip >estd!ϩc ca7?$n [)ІoMtaϴaQnWyui(6iH 8/ u#pL4C%Wre}U3;P;ՠ( ? Wm<h;$[--5d/c<>{^|ySƙ25SUN7ȚD,p@,_|n33)*J^k |e/2"hpu&hνR;+w{W誡}^lW7H\oP l{8ǙrPĊiD)HUUfEɾdc*V, ҟrMpU,?Pnd 8h ۳3)[DVMbI0' (I%oPK!corpy/__init__.pyPK!:W#/corpy/morphodita/README.rstPK! corpy/morphodita/__init__.pyPK!5ocorpy/morphodita/tagger.pyPK!&yj.corpy/morphodita/tokenizer.pyPK!rl Bcorpy/phonetics/README.rstPK!Ncorpy/phonetics/__init__.pyPK!/ߺ"7"7 Ocorpy/phonetics/cs.pyPK!nL''`corpy/phonetics/exceptions.tsvPK!ccorpy/phonetics/phones.tsvPK!.zee!icorpy/phonetics/substr2phones.tsvPK!LL! corpy/phonetics/voicing_pairs.tsvPK!corpy/scripts/__init__.pyPK!x. ϴcorpy/scripts/xc.pyPK!ڒcorpy/scripts/zip_verticals.pyPK!]^}}.corpy/util/__init__.pyPK!Ƥmcorpy/vertical/README.rstPK!_1]]corpy/vertical/__init__.pyPK!)~}corpy/vis/README.rstPK!AECcorpy/vis/__init__.pyPK!H b>[&5corpy-0.1.2.dist-info/entry_points.txtPK!Hu)GTUcorpy-0.1.2.dist-info/WHEELPK!H_M]Dcorpy-0.1.2.dist-info/METADATAPK!Hеcorpy-0.1.2.dist-info/RECORDPK