{ "info": { "author": "Gavin Falconer", "author_email": "gfalconer@expressivelogic.co.uk", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3 :: Only", "Topic :: Text Processing :: Linguistic" ], "description": ".. title:: sfm-utils README\n\n..\n ###################\n Roles\n ###################\n\n.. role:: cmd(code)\n\n\n..\n ###################\n Replacement Strings\n ###################\n\n.. |sfm-sniffer| replace::\n :cmd:`sfm-sniffer`\n.. |sfm-struct-sniffer| replace::\n :cmd:`sfm-struct-sniffer`\n\n\n..\n ###################\n Links\n ###################\n\n.. _python v3:\n https://www.python.org/\n.. _installed via pip:\n https://packaging.python.org/tutorials/installing-packages/\n.. _online jupyter notebook:\n https://mybinder.org/\n.. _SIL FLEx:\n.. _SIL Fieldworks Language Explorer (FLEx):\n https://software.sil.org/fieldworks/\n.. _SOLID:\n https://software.sil.org/solid/\n.. _Making Dictionaries\\: A guide to lexicography and the Multi-Dictionary Formatter:\n https://downloads.sil.org/legacy/shoebox/MDF_2000.pdf\n.. _Technical Notes on SFM Database Import:\n https://software.sil.org/fieldworks/wp-content/\n uploads/sites/38/2016/10/Technical-Notes-on-SFM-Database-Import.pdf\n\n.. _example walkthrough:\n.. _sfm-sniffer-walkthrough:\n docs/sfm_sniffer_walkthrough.md\n\n\n=========\nSFM Utils\n=========\n\n`sfm_utils` is a collection of python utilities to quickly and easily\nsummarise content and identify inconsistencies in lexicography data\nencoded using Standard Format Markers (SFM data files).\nPrimarily these utilities are intended to provide assistance when\ncleaning SFM data before converting to another format\nor importing into a tool such as\n`SIL Fieldworks Language Explorer (FLEx)`_.\n\nSFM files contain lexicographical data\nstructured using tags (backslash codes). For example::\n\n \\lx de\u0301la\u0301me\n \\ps n\n \\gn petite calebasse\n \\ps v\n \\gn sorte de verre\n \\ge drinking bowl\n \\gr \u0253i loonde\n\n`sfm_utils` scripts do not attribute meaning to the tags and are\ntherefore independent of the set of tags used in an SFM\ndata file. The intent of `sfm_utils` is to ensure that tags are\nused consistently throughout the data file.\n\nAuthor: Gavin Falconer (gfalconer@expressivelogic.co.uk)\n\n\nInstallation\n============\n\n`sfm_utils` is distributed as a python package, so can be\n`installed via pip`_ (or your package manager of choice).\nRequires `python v3`_ or above::\n\n > pip install sfm_utils\n\n.. admonition:: Future Suggestion\n\n Use a hosted version of sfm_utils within an `online jupyter notebook`_\n See, for example: https://jvns.ca/blog/2017/11/12/binder--an-awesome-tool-for-hosting-jupyter-notebooks/\n\n\nIntroduction\n============\n\nUse |sfm-sniffer| to quickly get an insight into the content of any\nSFM file. |sfm-sniffer| lists the tags used in the file, giving the number\nof occurrences of each tag. It also deduces a type for each tag, and shows\nthe number of 'exceptions', where the tag value did not match the expected type. ::\n\n > sfm-sniffer --summary my_lexicon.sfm\n \\gn : gloss (national) : occurrences=2480 : type=text : exceptions=26\n \\lx : lexeme : occurrences=2474 : type=word : exceptions=7\n \\sn : sense number : occurrences=2456 : type=enumeration : exceptions=28\n \\ps : part of speech : occurrences=2450 : type=enumeration : exceptions=79\n \\ge : gloss (english) : occurrences= 511 : type=optional word : exceptions=12\n \\gr : gloss (regional) : occurrences= 500 : type=optional phrase : exceptions=11\n \\glo: ??? : occurrences= 354 : type=text : exceptions=0\n\nRunning |sfm-sniffer| in full mode gives line references to pinpoint\nexceptions::\n\n > sfm-sniffer my_lexicon.sfm\n glo: gloss (other) : occurrences= 354: type=text : exceptions=0\n ===================================\n \\lx : lexeme : occurrences=2474: type=word\n 7 exceptions for \\lx of type 'word':\n line 1: \\lx \n line 2335: \\lx eptsa\u0301 - v. int. fatsa\n line 2470: \\lx e\u0301kse\u0301\u0253e\u0301, e\u0301sse\u0301\u0253a\u0301\n line 2474: \\lx e\u0301ksla\u0301, ala\u0301\n line 2712: \\lx fa\u0301 we\u0301...\n line 4025: \\lx ica\u0301 - v.int. \u0257atsa\n line 11051: \\lx \u014ba\u0301 (v.int. \u014b\u025b\u014ba)\n ====================================\n \\ps : part of speech : occurrences=2451: type=enumeration\n Example values:\n adj,adj adv,adj num,adj poss,adj poss.,adj?,adv,adv inter,adv tm,...\n 79 exceptions for \\ps of type 'enumeration':\n line 855: \\ps v. int\n line 1875: \\ps v. int.\n line 1879: \\ps \n line 1947: \\ps \n ...\n\nThe results indicate\nthe consistency of usage (or otherwise) for each tag. See the\n`example walkthrough`_ for more details.\n\nTag Type Deduction\n------------------\n\nTag type deduction works by examining the set of values used for each\ntag. If the majority of values conform to a known type then the tag\nis deduced to be of that type. (The threshold applied to determine\nan acceptable majority can be varied by selecing a 'strictness' option.)\n\nThe types are checked in order, with more specific types being\nchecked first. Therefore a tag will be deduced to be of the most\nspecific type that can be applied to the set of values used for that tag.\n\nTag types may be one of the following (ordered from most specific to\nleast specific):\n\n.. csv-table::\n :header: \"Order\", \"Type\", \"Description\"\n :widths: 3, 12, 40\n\n 1, ``NULL type``, \"Tag never has a value.\"\n 2, ``number``, \"Numeric value, e.g. 1, 2, 3. The tag must have a value.\"\n 3, ``optional number``, \"Numeric value, or may be empty.\"\n 4, ``enumeration``, \"A single word or phrase drawn from a limited\n set of possible values. A typical example could be \\\\ps (part of speech)\n accepting one of: noun, verb, adjective, adverb,... The tag must have a value.\"\n 5, ``optional enumeration``, \"As above, or may be empty.\"\n 6, ``word``, \"A single-word value. A word may include\n non-alphanumeric characters, but must include at least one alphanumeric\n character. It may not include any whitespace, period, comma or semicolon\n within the value. A trailing period, comma or semicolon is acceptable.\n The following are all valid words: ``e\u0301sse\u0301\u0253a\u0301``, ``up!``, ``abbrev.``.\n The tag must have a value.\"\n 7, ``optional word``, \"As above, or may be empty.\"\n 8, ``phrase``, \"A single-phrase value. Like ``word`` but may\n contain whitespace. May not contain a period, comma or semicolon except\n as a trailing character. ``up and away!`` is a valid phrase.\n ``up; away!`` is not (it is assumed to be a list value).\n The tag must have a value.\"\n 9, ``optional phrase``, \"As above, or may be empty.\"\n 10, ``enumeration list``, \"A list of words or phrases (separated by\n commas or semicolons) where each word or phrase is drawn from a\n limited set of possible values. The tag must have a value.\"\n 11, ``text``, \"Any combination of characters, words or\n phrases. The tag must have a value.\"\n 12, ``optional text``, \"Any combination of characters, words or\n phrases, or may be empty. The ``optional text`` type is generic, and\n indicates that no consistent pattern of usage could be deduced for the\n tag.\"\n\n\nComing Soon...\n--------------\n\nUse |sfm-struct-sniffer| to analyse the tree structure of the SFM\nfile and generate a proposed schema::\n\n > sfm-struct-sniffer my_lexicon.sfm > my_lexicon.schema\n\nThen use |sfm-struct-sniffer| to verify the integrity of the SFM\ndata against the schema::\n\n > sfm-struct-sniffer --verify --schema=my_lexicon.schema my_lexicon.sfm\n ...\n\nThe generated schema is a simple text file so can easily be modified::\n\n \\lx\n \\ps\n \\ge\n \\go?\n \\sn?\n \\ge\n \\go?\n\nWhen it becomes necessary to edit or correct the SFM file by hand, the\ndata can be formatted by |sfm-struct-sniffer| to apply indentation\nthat shows the tree structure::\n\n > sfm-struct-sniffer --format -schema=my_lexicon.schema my_lexicon.sfm\n \\lx de\u0301la\u0301me\n \\ps n\n \\gn petite calebasse\n \\ps v\n \\gn sorte de verre\n \\ge drinking bowl\n \\gr \u0253i loonde\n \\lx deremke\n \\ps num\n \\gn cent\n \\ge one hundred\n \\gr temerre\n\nThis also makes it easier to reason about the outcomes of importing\nthe data into `SIL Fieldworks Language Explorer (FLEx)`_\n\n.. admonition:: Future Suggestion\n\n |sfm-struct-sniffer| could embed comments in the file to\n highlight exceptions or ambiguous tree elements, e.g::\n\n \\lx de\u0301la\u0301me\n \\ps n\n # >>> unexpected \\sn\n \\sn 1\n # <<<\n\n\nFeatures\n========\n\n* Works with any SFM file. Inferred types are the result of statistical\n analysis on the SFM file contents. No semantics are assumed, no\n a priori knowledge is ncessary.\n\n\nUsage\n=====\n\nUsage information for |sfm-sniffer| can be shown by using the --help\noption. See also the `example walkthrough`_.\n\nUsage::\n\n sfm-sniffer [--tags=] [--summary] [--normal|--stricter|--strictest] \n sfm-sniffer --dumptags\n sfm-sniffer (-h | --help)\n sfm-sniffer --version\n\nOptions:\n -t, --tags=file Read a dictionary file that maps tags to labels.\n If unspecified, the default `MDF`_ tag labels will be used.\n `[1] `_\n -s, --summary Output a summary report only.\n -1, --normal Apply normal type deduction rules.\n -2, --stricter Apply stricter type deduction rules.\n -3, --strictest Apply strictest type deduction rules.\n -d, --dumptags Print the default SFM tag dictionary in the format\n used by --tags\n -h, --help Show this screen.\n --version Show version.\n\nApplying stricter type deduction rules will generate a report that\nprefers more specific types (such as ``number`` or ``word``) over more\ngeneral types (such as ``optional text``). However, stricter type\ndeduction rules are more likely to generate a large number of exceptions.\n\nSimilarily, for |sfm-struct-sniffer|:\n\nUsage::\n\n sfm-struct-sniffer [--tags=] \n sfm-struct-sniffer --dumptags\n sfm-struct-sniffer (-h | --help)\n sfm-struct-sniffer --version\n\nOptions:\n -t, --tags=file Read a dictionary file that maps tags to labels.\n If unspecified, the default `MDF`_ tag labels will be used.\n `[1] `_\n -d, --dumptags Print the default SFM tag dictionary in the format\n used by --tags\n -h, --help Show this screen.\n --version Show version.\n\n\nRepository contents\n===================\n\nTODO\n\n..\n See [doc/index.md](doc/index.md) for more explanation. See\n [doc/impl.md](doc/impl.md) for a brief overview of the implementation.\n\n\nSee Also\n========\n\n* `SOLID`_ is an existing graphical utility provided by SIL\n to check, clean up, and convert SDF files.\n\n\nReferences\n==========\n\n.. _`ref1`:\n\n#. `Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter`_\n (Coward & Grimes, 2000): a description of the _`MDF` (Multi-Dictionary Formatter) and the\n defined set of SFM backslash codes that are commonly recognised.\n\n .. _`ref2`:\n\n#. `Technical Notes on SFM Database Import`_ (Ken Zook, 2010):\n provides further information on issues that are likely to be encountered\n when working with SFM files.\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://gitlab.com/bbsg/sfm_utils", "keywords": "SFM MDF SIL lexicon lexicography expressivelogic", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "sfm-utils", "package_url": "https://pypi.org/project/sfm-utils/", "platform": "", "project_url": "https://pypi.org/project/sfm-utils/", "project_urls": { "Homepage": "https://gitlab.com/bbsg/sfm_utils" }, "release_url": "https://pypi.org/project/sfm-utils/0.1.0rc1.post1/", "requires_dist": [ "docopt", "six" ], "requires_python": ">=3", "summary": "utilities for working with lexicography data encoded using Standard Format Markers (SFM data files).", "version": "0.1.0rc1.post1" }, "last_serial": 3941069, "releases": { "0.1.0rc1.post1": [ { "comment_text": "", "digests": { "md5": "fc9c90bb2c6eb07e103f98d990f08265", "sha256": "f4201d7b211e720164f7e3bf499c5c0c8d820d1262b1ef8946a17b6f3cd2d637" }, "downloads": -1, "filename": "sfm_utils-0.1.0rc1.post1-py3-none-any.whl", "has_sig": false, "md5_digest": "fc9c90bb2c6eb07e103f98d990f08265", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 15683, "upload_time": "2018-06-07T20:54:19", "url": "https://files.pythonhosted.org/packages/38/02/0b412bb1fc95c6ff1bc88f9c4d1a08e06dd781b46dc02737f3c5f0776a86/sfm_utils-0.1.0rc1.post1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9e20a3dc6f9ecae3e7873ba5bb18366a", "sha256": "159e55c9232f77a99127fc618faabd8ae36adc1a5667bb84e88d9710709163d1" }, "downloads": -1, "filename": "sfm-utils-0.1.0rc1.post1.tar.gz", "has_sig": false, "md5_digest": "9e20a3dc6f9ecae3e7873ba5bb18366a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 17593, "upload_time": "2018-06-07T20:54:21", "url": "https://files.pythonhosted.org/packages/e2/8c/95051f6391b98205c2153260a2b3a5a45509ebf9c61b9df90aa4722ae52b/sfm-utils-0.1.0rc1.post1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fc9c90bb2c6eb07e103f98d990f08265", "sha256": "f4201d7b211e720164f7e3bf499c5c0c8d820d1262b1ef8946a17b6f3cd2d637" }, "downloads": -1, "filename": "sfm_utils-0.1.0rc1.post1-py3-none-any.whl", "has_sig": false, "md5_digest": "fc9c90bb2c6eb07e103f98d990f08265", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 15683, "upload_time": "2018-06-07T20:54:19", "url": "https://files.pythonhosted.org/packages/38/02/0b412bb1fc95c6ff1bc88f9c4d1a08e06dd781b46dc02737f3c5f0776a86/sfm_utils-0.1.0rc1.post1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9e20a3dc6f9ecae3e7873ba5bb18366a", "sha256": "159e55c9232f77a99127fc618faabd8ae36adc1a5667bb84e88d9710709163d1" }, "downloads": -1, "filename": "sfm-utils-0.1.0rc1.post1.tar.gz", "has_sig": false, "md5_digest": "9e20a3dc6f9ecae3e7873ba5bb18366a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 17593, "upload_time": "2018-06-07T20:54:21", "url": "https://files.pythonhosted.org/packages/e2/8c/95051f6391b98205c2153260a2b3a5a45509ebf9c61b9df90aa4722ae52b/sfm-utils-0.1.0rc1.post1.tar.gz" } ] }