{ "info": { "author": "Egor Urvanov", "author_email": "hedgehogues@bk.ru", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python :: 3.7" ], "description": "# Ho Chi Minh\n\n## Overview\n\nHo Chi Minh is designed to extract textual information from tables presented in PDF, pictures or other format.\nIf the table is strongly tilted, the data will not be extracted correctly. PDF is transformed into pictures (page by page),\nand then each image is processed separately.\n\nThe utility ignores the text layer and treats any pdf as a picture.\n\nIt is assumed that on one page there is exactly one table, which is not transferred to the next page. If\non the page there are several tables, then most likely the recognition will be incorrect.\n\nHo Chi Minh City does not know how to work with complex tables, in which several cells are united together:\n\n![Complex table](data/README/clever_table.png)\n\nThen, in the majority of cases, simple tables are correctly recognized:\n\n![Simple table](data/README/simple_table.png) \n\nIf the table occupies only a small part of the image or does not have\nobvious boundaries, then it probably will not be is recognized.\n\nHo Chi Minh allows to get the frame of the table in the form of a list of cells, each of which is represented by four points\n(cell angles). Cell is a class:\n\n class Cell:\n def __init__(self, x_min=0, x_max=0, y_min=0, y_max=0, prob=0.0, text=''):\n # Coordinates of the cell in the picture\n self.x_min = x_min\n self.x_max = x_max\n self.y_min = y_min\n self.y_max = y_max\n\n # Probability of correct extraction of cell coordinates\n self.prob = prob\n # Text inside the cell\n self.text = text\n\nTo extract the text, you can use the following code:\n\n templator = TableRecognizer(\n reader=ImageFromPDFReader(\n PDFConverter(in_path=path + 'pdf/', out_path=path + 'pdf/images/', resolution=130)\n ),\n hough_transformer=SobelDirector(),\n connected_components=ConnectedComponents(),\n cross_detector=CrossDetector(max_steps=15, detected_steps=8, line_width=8),\n ocr=TesseractWrapper(),\n binarization=210\n )\n table = []\n for i in range(4):\n table.append(templator.next_points())\n\n# How to install\n\n pip install opencv-python=3.4.0.12 scikit-learn=0.19.1 numpy=1.14.0 scikit-image=0.13.1 pytesseract=0.2.0 scipy=1.0.0\n pip install pdf2image=0.1.8 pillow=5.0.0 xlrd=1.1.0\n apt-get install tesseract-ocr\n apt-get install tesseract-ocr-rus\n apt-get install poppler-utils\n\n# NOTICE\n\nIf the assembly fails, then there is probably a problem with the dependency versions and you should fix them yourself. \n\nIf you have a problem with openCV, you can study the workpiece for the docker file, which describes how to set \ndependencies for openCV, as well as openCV itself.\n\n## Other objects\n\nHoChiMinh consists of several parts (objects), which it must be initialized. More details about most of these parts can be found in the section ** Algorithm **. There are other objects that are not directly related to the algorithm, but, when this is important enough. This, for example, is the ImagePDFReader class, which provides downloading images from PDF. Its component is the PDFConverter.\n\nPDFConverter processes all files that lie in a specific directory. ImagePDFReader takes one file from the PDFConverter.\n\n## Algorithm\n\nThe principle of operation of the algorithm can be described with the help of the following scheme:\n\n![Flowchart](data/README/block_scheme.png)\n\n* Reader. Reading the next picture from the pool using the reader.\n* Detector of Horisontal and vertical lines. The image shows the most obvious and long horizontal and vertical lines.\n\n![\u0418\u0441\u0445\u043e\u0434\u043d\u043e\u0435 \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u0435](data/README/etalon_image.png)\n![\u0412\u0435\u0440\u0442\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0435 \u0433\u0440\u0430\u043d\u0438\u0446\u044b](data/README/horisontal_edge.png)\n![\u0413\u043e\u0440\u0438\u0437\u043e\u043d\u0442\u0430\u043b\u044c\u043d\u044b\u0435 \u0433\u0440\u0430\u043d\u0438\u0446\u044b](data/README/vertical_edge.png)\n\nFor each line, its coordinate is memorized (for horizontal lines - $ y $, for vertical ones - $ x $). Then, all \npossible pairs of points $ (x, y) $ are constructed. All of them are candidates for the corners of the cells of the\n table.\n\n![\u0422\u043e\u0447\u043a\u0438-\u043a\u0430\u043d\u0434\u0438\u0434\u0430\u0442\u044b](data/README/detected_points.png)\n\n* Detector the largest connected components. We select the largest (we consider the area of \u200b\u200bthe rectangle in which it\nis) a connected component. All candidate points that belong to it are left, we forget about the others.\n\n* Detect cross of table. Some of the points highlighted in the table render superfluous. Some of them lie not at the intersection\nparties, and in an arbitrary place on the side of the table. All extra points are deleted.\n\n![Selected points](data/README/connected_components_points.png)\n\n* Uniques points. As a result of the operation performed in the previous step, only those points that lie in the nodes remain\ntables. In this case, in some nodes there are more than 1 point. This stage is aimed at obtaining a single point in each\nnode.\n\n* Building table. The table is constructed. In connection with the fact that part of the node points $ (x, y) $ may not be\nis recognized, we construct a table as the Cartesian product of the set $\\{x\\}\\times\\{y\\} $. In this case, if some cells\ntables were unified, then there is an erroneous recognition.\n\n![Table](data/README/table.png)\n\n* OCR. Each cell is considered separately. Recognition is carried out.\n\n## Links\n\n[1] http://www.machinelearning.ru/wiki/index.php?title=\u041b\u043e\u0433\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0430\u044f_\u0444\u0443\u043d\u043a\u0446\u0438\u044f\n\n[2] http://www.sibran.ru/upload/iblock/0e1/0e15d76aff003e3879db51f21ae7f694.pdf\n\n[3] https://www.osapublishing.org/DirectPDFAccess/0CF96649-F614-A5CD-5F118AF53F660BAB_383046/copp-2-1-69.pdf?da=1&id=383046&seq=0&mobile=no\n\n[4] https://www.hindawi.com/journals/cmmm/2015/607407/\n\n[5] http://research.ijcaonline.org/volume55/number4/pxc3882634.pdf\n\n[6] http://article.sapub.org/pdf/10.5923.j.ajis.20120206.01.pdf\n\n[7] https://pdfs.semanticscholar.org/f07f/ec524e5a6dd08163e9f56599ea712cbd0a31.pdf\n\n[8] https://arxiv.org/abs/1505.04597\n\n[9] https://tyvik.ru/posts/pillow-almighty/\n\n[10] https://www.youtube.com/watch?v=jzZ3WVhgi5w\n\n[11] http://lab.alexkuk.ru/ocr/\n\n[12] https://github.com/ulikoehler/OTR\n\n[13] https://github.com/inuyasha2012/answer-sheet-scan\n\n[14] https://github.com/pavitrakumar78/Python-Custom-Digit-Recognition\n\n[15] https://github.com/enigma-ml/textify\n\n[17] https://www.ph4.ru/fonts_fonts.php?fn=all&l=rus&id=2\n\n[18] https://arxiv.org/pdf/1706.07422.pdf\n\n[19] https://github.com/Belval/pdf2image\n\n[20] https://habrahabr.ru/post/351266/\n\n[21] https://habrahabr.ru/post/181580/\n\n# Other projects\n\n[1] http://tabula.technology/\n\n[2] https://github.com/mpasternak/pdf-table-extractor\n\n[3] https://github.com/DustBytes/PdfTableExtraction\n\n[4] https://github.com/ferhatelmas/docker-pdf-table-extract\n\n[5] https://github.com/jsvine/pdfplumber\n\n[6] https://github.com/tfmorris/pdf2table\n\n[7] https://github.com/ashima/pdf-table-extract\n\n[8] https://github.com/BIDS/agri_tables\n\n[9] https://github.com/BIDS/agri_tables\n\n[10] https://github.com/nsi-iff/pypdf2table\n\n[11] https://github.com/mgedmin/pdf2html\n\n[12] https://github.com/maxbelyanin/pdf-tbl-extract\n\n# \u0425\u043e\u0448\u0438\u043c\u0438\u043d\n\n## \u041e\u0431\u0437\u043e\u0440\n\n\u0425\u043e\u0448\u0438\u043c\u0438\u043d \u043f\u0440\u0435\u0434\u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d \u0434\u043b\u044f \u0438\u0437\u0432\u043b\u0435\u0447\u0435\u043d\u0438\u044f \u0442\u0435\u043a\u0441\u0442\u043e\u0432\u043e\u0439 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438 \u0438\u0437 \u0442\u0430\u0431\u043b\u0438\u0446, \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u0435\u043d\u043d\u044b\u0445 \u0432 PDF, \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0430\u0445 \u0438\u043b\u0438 \u0438\u043d\u043e\u043c \u0444\u043e\u0440\u043c\u0430\u0442\u0435.\n\u0415\u0441\u043b\u0438 \u0442\u0430\u0431\u043b\u0438\u0446\u0430 \u0441\u0438\u043b\u044c\u043d\u043e \u043d\u0430\u043a\u043b\u043e\u043d\u0435\u043d\u0430, \u0442\u043e \u0434\u0430\u043d\u043d\u044b\u0435 \u0438\u0437\u0432\u043b\u0435\u043a\u0443\u0442\u0441\u044f \u043d\u0435\u043a\u043e\u0440\u0440\u0435\u043a\u0442\u043d\u043e. PDF \u043f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u043e \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u0443\u0435\u0442\u0441\u044f \u0432 \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0438 (\u043f\u043e\u0441\u0442\u0440\u0430\u043d\u0438\u0447\u043d\u043e),\n\u0430 \u0437\u0430\u0442\u0435\u043c \u043a\u0430\u0436\u0434\u043e\u0435 \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u0435 \u043e\u0431\u0440\u0430\u0431\u0430\u0442\u044b\u0432\u0430\u0435\u0442\u0441\u044f \u043e\u0442\u0434\u0435\u043b\u044c\u043d\u043e.\n\n\u0423\u0442\u0438\u043b\u0438\u0442\u0430 \u0438\u0433\u043d\u043e\u0440\u0438\u0440\u0443\u0435\u0442 \u0442\u0435\u043a\u0441\u0442\u043e\u0432\u044b\u0439 \u0441\u043b\u043e\u0439 \u0438 \u0440\u0430\u0441\u0441\u043c\u0430\u0442\u0440\u0438\u0432\u0430\u0435\u0442 \u043b\u044e\u0431\u043e\u0439 pdf \u043a\u0430\u043a \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0443.\n\n\u041f\u0440\u0435\u0434\u043f\u043e\u043b\u0430\u0433\u0430\u0435\u0442\u0441\u044f, \u0447\u0442\u043e \u043d\u0430 \u043e\u0434\u043d\u043e\u0439 \u0441\u0442\u0440\u0430\u043d\u0438\u0446\u0435 \u0441\u0443\u0449\u0435\u0441\u0442\u0432\u0443\u0435\u0442 \u0440\u043e\u0432\u043d\u043e \u043e\u0434\u043d\u0430 \u0442\u0430\u0431\u043b\u0438\u0446\u0430, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u043d\u0435 \u043f\u0435\u0440\u0435\u043d\u043e\u0441\u0438\u0442\u0441\u044f \u043d\u0430 \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0443\u044e \u0441\u0442\u0440\u0430\u043d\u0438\u0446\u0443. \u0415\u0441\u043b\u0438\n\u043d\u0430 \u0441\u0442\u0440\u0430\u043d\u0438\u0446\u0435 \u0441\u0443\u0449\u0435\u0441\u0442\u0432\u0443\u0435\u0442 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u0442\u0430\u0431\u043b\u0438\u0446, \u0442\u043e \u0432\u0435\u0440\u043e\u044f\u0442\u043d\u0435\u0435 \u0432\u0441\u0435\u0433\u043e \u0440\u0430\u0441\u043f\u043e\u0437\u043d\u043e\u0432\u0430\u043d\u0438\u0435 \u043f\u0440\u043e\u0438\u0437\u043e\u0439\u0434\u0451\u0442 \u043d\u0435\u043a\u043e\u0440\u0440\u0435\u043a\u0442\u043d\u043e.\n\n\u0425\u043e\u0448\u0438\u043c\u0438\u043d \u043d\u0435 \u0443\u043c\u0435\u0435\u0442 \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441\u043e \u0441\u043b\u043e\u0436\u043d\u044b\u043c\u0438 \u0442\u0430\u0431\u043b\u0438\u0446\u0430\u043c\u0438, \u0432 \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u044f\u0447\u0435\u0435\u043a \u043e\u0431\u044a\u0435\u0434\u0435\u043d\u0435\u043d\u044b \u0432\u043e\u0435\u0434\u0438\u043d\u043e:\n\n![\u0421\u043b\u043e\u0436\u043d\u0430\u044f \u0442\u0430\u0431\u043b\u0438\u0446\u0430](data/README/clever_table.png)\n\n\u041f\u0440\u0438 \u044d\u0442\u043e\u043c, \u0432 \u0431\u043e\u043b\u044c\u0448\u0438\u043d\u0441\u0442\u0432\u0435 \u0441\u043b\u0443\u0447\u0430\u0435 \u043a\u043e\u0440\u0440\u0435\u043a\u0442\u043d\u043e \u0440\u0430\u0441\u043f\u043e\u0437\u043d\u0430\u044e\u0442\u0441\u044f \u043f\u0440\u043e\u0441\u0442\u044b\u0435 \u0442\u0430\u0431\u043b\u0438\u0446\u044b:\n\n![\u041f\u0440\u043e\u0441\u0442\u0430\u044f \u0442\u0430\u0431\u043b\u0438\u0446\u0430](data/README/simple_table.png)\n\n\u0415\u0441\u043b\u0438 \u0442\u0430\u0431\u043b\u0438\u0446\u0430 \u0437\u0430\u043d\u0438\u043c\u0430\u0435\u0442 \u043b\u0438\u0448\u044c \u043d\u0435\u0431\u043e\u043b\u044c\u0448\u0443\u044e \u0447\u0430\u0441\u0442\u044c \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u044f \u0438\u043b\u0438 \u0436\u0435 \u043d\u0435 \u0438\u043c\u0435\u0435\u0442 \u044f\u0432\u043d\u044b\u0445 \u0433\u0440\u0430\u043d\u0438\u0446, \u0442\u043e, \u0432\u0435\u0440\u043e\u044f\u0442\u043d\u043e, \u043e\u043d\u0430 \u043d\u0435 \u0431\u0443\u0434\u0435\u0442\n\u0440\u0430\u0441\u043f\u043e\u0437\u043d\u0430\u043d\u0430.\n\n\u0425\u043e\u0448\u0438\u043c\u0438\u043d \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u043e\u043b\u0443\u0447\u0438\u0442\u044c \u043a\u0430\u0440\u043a\u0430\u0441 \u0442\u0430\u0431\u043b\u0438\u0446\u044b \u0432 \u0432\u0438\u0434\u0435 \u0441\u043f\u0438\u0441\u043a\u0430 \u044f\u0447\u0435\u0435\u043a, \u043a\u0430\u0436\u0434\u0430\u044f \u0438\u0437 \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u0435\u043d\u0430 \u0447\u0435\u0442\u044b\u0440\u044c\u043c\u044f \u0442\u043e\u0447\u0435\u043a\u0430\u043c\u0438\n(\u0443\u0433\u043b\u044b \u044f\u0447\u0435\u0439\u043a\u0438). \u042f\u0447\u0435\u0439\u043a\u0430 (Cell) \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u0441\u043e\u0431\u043e\u0439 \u043a\u043b\u0430\u0441\u0441:\n\n class Cell:\n def __init__(self, x_min=0, x_max=0, y_min=0, y_max=0, prob=0.0, text=''):\n # \u041a\u043e\u043e\u0440\u0434\u0438\u043d\u0430\u0442\u044b \u044f\u0447\u0435\u0439\u043a\u0438 \u043d\u0430 \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0435\n self.x_min = x_min\n self.x_max = x_max\n self.y_min = y_min\n self.y_max = y_max\n\n # \u0412\u0435\u0440\u043e\u044f\u0442\u043d\u043e\u0441\u0442\u044c \u043a\u043e\u0440\u0440\u0435\u043a\u0442\u043d\u043e\u0433\u043e \u0438\u0437\u0432\u043b\u0435\u0447\u0435\u043d\u0438\u044f \u043a\u043e\u043e\u0440\u0434\u0438\u043d\u0430\u0442 \u044f\u0447\u0435\u0439\u043a\u0438\n self.prob = prob\n # \u0422\u0435\u043a\u0441\u0442 \u0432\u043d\u0443\u0442\u0440\u0438 \u044f\u0447\u0435\u0439\u043a\u0438\n self.text = text\n\n\u0414\u043b\u044f \u0438\u0437\u0432\u043b\u0435\u0447\u0435\u043d\u0438\u044f \u0442\u0435\u043a\u0441\u0442\u043e\u0432\u043e\u0439 \u043c\u043e\u0436\u043d\u043e \u0432\u043e\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u043c \u043a\u043e\u0434\u043e\u043c:\n\n templator = TableRecognizer(\n reader=ImageFromPDFReader(\n PDFConverter(in_path=path + 'pdf/', out_path=path + 'pdf/images/', resolution=130)\n ),\n hough_transformer=SobelDirector(),\n connected_components=ConnectedComponents(),\n cross_detector=CrossDetector(max_steps=15, detected_steps=8, line_width=8),\n ocr=TesseractWrapper(),\n binarization=210\n )\n table = []\n for i in range(4):\n table.append(templator.next_points())\n\n# How to install without pip\n\n apt-get install tesseract-ocr\n apt-get install tesseract-ocr-rus\n apt-get install poppler-utils\n pip install opencv-python=3.4.0.12 scikit-learn=0.19.1 numpy=1.14.0 scikit-image=0.13.1 pytesseract=0.2.0 scipy=1.0.0\n pip install pdf2image=0.1.8 pillow=5.0.0 xlrd=1.1.0\n\n# How to install WITH pip\n\n apt-get install tesseract-ocr\n apt-get install tesseract-ocr-rus\n apt-get install poppler-utils\n pip install HoChiMinh\n\n## \u0417\u0430\u043c\u0435\u0447\u0435\u043d\u0438\u0435\n\n\u0415\u0441\u043b\u0438 \u0441\u0431\u043e\u0440\u043a\u0430 \u0437\u0430\u0432\u0435\u0440\u0448\u0430\u0435\u0442\u0441\u044f \u0441 \u043e\u0448\u0438\u0431\u043a\u043e\u0439, \u0442\u043e, \u0432\u0435\u0440\u043e\u044f\u0442\u043d\u043e, \u043f\u0440\u043e\u0431\u043b\u0435\u043c\u0430 \u0441 \u0432\u0435\u0440\u0441\u0438\u044f\u043c\u0438 \u0437\u0430\u0432\u0438\u0441\u0438\u043c\u043e\u0441\u0442\u0435\u0439 \u0438 \u0412\u0430\u043c \u0441\u0442\u043e\u0438\u0442 \u0438\u0445 \u0441\u0430\u043c\u043e\u0441\u0442\u043e\u044f\u0442\u0435\u043b\u044c\u043d\u043e \n\u043f\u043e\u043f\u0440\u0430\u0432\u0438\u0442\u044c.\n\n\u0415\u0441\u043b\u0438 \u0443 \u0412\u0430\u0441 \u0432\u043e\u0437\u043d\u0438\u043a\u0430\u0435\u0442 \u043f\u0440\u043e\u0431\u043b\u0435\u043c\u0430 \u0441 openCV, \u0412\u044b \u043c\u043e\u0436\u0435\u0442\u0435 \u0438\u0437\u0443\u0447\u0438\u0442\u044c \u0437\u0430\u0433\u043e\u0442\u043e\u0432\u043a\u0443 \u0434\u043b\u044f \u0434\u043e\u043a\u0435\u0440\u0444\u0430\u0439\u043b\u0430, \u0433\u0434\u0435 \u043e\u043f\u0438\u0441\u0430\u043d \u0441\u043f\u043e\u0441\u043e\u0431 \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043a\u0438 \n\u0437\u0430\u0432\u0438\u0441\u0438\u043c\u043e\u0441\u0442\u0435\u0439 \u0434\u043b\u044f openCV, \u0430 \u0442\u0430\u043a\u0436\u0435 \u0441\u0430\u043c\u043e\u0433\u043e openCV.\n\n## \u0421\u043e\u043f\u043e\u0442\u0441\u0442\u0432\u0443\u044e\u0449\u0438\u0435 \u0443\u0442\u0438\u043b\u0438\u0442\u044b\n\n\u0425\u043e\u0448\u0438\u043c\u0438\u043d \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u0438\u0445 \u0447\u0430\u0441\u0442\u0435\u0439 (\u043e\u0431\u044a\u0435\u043a\u0442\u043e\u0432), \u043a\u043e\u0442\u043e\u0440\u044b\u043c\u0438 \u0435\u0433\u043e \u043d\u0435\u043e\u0431\u0445\u043e\u0434\u0438\u043c\u043e \u0438\u043d\u0438\u0446\u0438\u0430\u043b\u0438\u0437\u0438\u0440\u043e\u0432\u0430\u0442\u044c. \n\u041f\u043e\u0434\u0440\u043e\u0431\u043d\u0435\u0435, \u043e \u0431\u043e\u043b\u044c\u0448\u0438\u043d\u0441\u0442\u0432\u0435 \u0438\u0437 \u044d\u0442\u0438\u0445 \u0447\u0430\u0441\u0442\u0435\u0439, \u043c\u043e\u0436\u043d\u043e \u043f\u0440\u043e\u0447\u0438\u0442\u0430\u0442\u044c \u0432 \u0440\u0430\u0437\u0434\u0435\u043b\u0435 **\u0410\u043b\u0433\u043e\u0440\u0438\u0442\u043c**. \n\u0415\u0441\u0442\u044c \u0438 \u0434\u0440\u0443\u0433\u0438\u0435 \u043e\u0431\u044a\u0435\u043a\u0442\u044b, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043d\u0435 \u0438\u043c\u0435\u044e\u0442 \u043d\u0435\u043f\u043e\u0441\u0440\u0435\u0434\u0441\u0442\u0432\u0435\u043d\u043d\u043e\u0433\u043e \u043e\u0442\u043d\u043e\u0448\u0435\u043d\u0438\u044f \u043a \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u0443, \u043d\u043e, \u043f\u0440\u0438 \n\u044d\u0442\u043e\u043c \u044f\u0432\u043b\u044f\u044e\u0442\u0441\u044f \u0434\u043e\u0441\u0442\u0430\u0442\u043e\u0447\u043d\u043e \u0432\u0430\u0436\u043d\u044b\u043c\u0438 \u0447\u0430\u0441\u0442\u044f\u043c\u0438. \u042d\u0442\u043e, \u043d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u043a\u043b\u0430\u0441\u0441 ImagePDFReader, \u043a\u043e\u0442\u043e\u0440\u044b\u0439\n\u043e\u0431\u0435\u0441\u043f\u0435\u0447\u0438\u0432\u0430\u0435\u0442 \u0437\u0430\u0433\u0440\u0443\u0437\u043a\u0443 \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u0439 \u0438\u0437 PDF. \u0415\u0433\u043e \u0441\u043e\u0441\u0442\u0430\u0432\u043d\u043e\u0439 \u0447\u0430\u0441\u0442\u044c\u044e \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f PDFConverter.\n\nPDFConverter \u043e\u0431\u0440\u0430\u0431\u0430\u0442\u044b\u0432\u0430\u0435\u0442 \u0432\u0441\u0435 \u0444\u0430\u0439\u043b\u044b, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u043b\u0435\u0436\u0430\u0442 \u0432 \u043e\u043f\u0440\u0435\u0434\u0435\u043b\u0451\u043d\u043d\u043e\u0439 \u0434\u0438\u0440\u0435\u043a\u0442\u043e\u0440\u0438\u0438. ImagePDFReader\n\u0431\u0435\u0440\u0451\u0442 \u043f\u043e \u043e\u0434\u043d\u043e\u043c\u0443 \u0444\u0430\u0439\u043b\u0443 \u043e\u0442 PDFConverter.\n\n## \u0410\u043b\u0433\u043e\u0440\u0438\u0442\u043c\n\n\u041f\u0440\u0438\u043d\u0446\u0438\u043f \u0440\u0430\u0431\u043e\u0442\u044b \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u0430 \u043c\u043e\u0436\u043d\u043e \u043e\u043f\u0438\u0441\u0430\u0442\u044c \u043f\u0440\u0438 \u043f\u043e\u043c\u043e\u0449\u0438 \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0435\u0439 \u0441\u0445\u0435\u043c\u044b:\n\n![\u0411\u043b\u043e\u043a-\u0441\u0445\u0435\u043c\u0430](data/README/block_scheme.png)\n\n* Reader. \u041f\u0440\u043e\u0447\u0442\u0435\u043d\u0438\u0435 \u043e\u0447\u0435\u0440\u0435\u0434\u043d\u043e\u0439 \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0438 \u0438\u0437 \u043f\u0443\u043b\u0430 \u043f\u0440\u0438 \u043f\u043e\u043c\u043e\u0449\u0438 \u0440\u0438\u0434\u0435\u0440\u0430.\n\n* Detector of Horisontal and vertical lines. \u041d\u0430 \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u0438 \u0432\u044b\u0434\u0435\u043b\u044f\u044e\u0442\u0441\u044f \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u044f\u0432\u043d\u044b\u0435 \u0438 \u043f\u0440\u043e\u0442\u044f\u0436\u0451\u043d\u043d\u044b\u0435 \u0433\u043e\u0440\u0438\u0437\u043e\u043d\u0442\u0430\u043b\u044c\u043d\u044b\u0435 \u0438\n\u0432\u0435\u0440\u0442\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0435 \u043b\u0438\u043d\u0438\u0438.\n\n![\u0418\u0441\u0445\u043e\u0434\u043d\u043e\u0435 \u0438\u0437\u043e\u0431\u0440\u0430\u0436\u0435\u043d\u0438\u0435](data/README/etalon_image.png)\n![\u0412\u0435\u0440\u0442\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0435 \u0433\u0440\u0430\u043d\u0438\u0446\u044b](data/README/horisontal_edge.png)\n![\u0413\u043e\u0440\u0438\u0437\u043e\u043d\u0442\u0430\u043b\u044c\u043d\u044b\u0435 \u0433\u0440\u0430\u043d\u0438\u0446\u044b](data/README/vertical_edge.png)\n\n\u0414\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0439 \u043b\u0438\u043d\u0438\u0438 \u0437\u0430\u043f\u043e\u043c\u0438\u043d\u0430\u0435\u0442\u0441\u044f \u0435\u0451 \u043a\u043e\u043e\u0440\u0434\u0438\u043d\u0430\u0442\u0430 (\u0434\u043b\u044f \u0433\u043e\u0440\u0438\u0437\u043e\u043d\u0442\u0430\u043b\u044c\u043d\u044b\u0445 - $y$, \u0434\u043b\u044f \u0432\u0435\u0440\u0442\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0445 - $x$). \u0417\u0430\u0442\u0435\u043c,\n\u0441\u0442\u0440\u043e\u044f\u0442\u0441\u044f \u0432\u0441\u0435 \u0432\u043e\u0437\u043c\u043e\u0436\u043d\u044b\u0435 \u043f\u0430\u0440\u044b \u0442\u043e\u0447\u0435\u043a $(x, y)$. \u0412\u0441\u0435 \u043e\u043d\u0438 \u044f\u0432\u043b\u044f\u044e\u0442\u0441\u044f \u043a\u0430\u043d\u0434\u0438\u0434\u0430\u0442\u0430\u043c\u0438 \u0434\u043b\u044f \u0443\u0433\u043b\u043e\u0432 \u044f\u0447\u0435\u0435\u043a \u0442\u0430\u0431\u043b\u0438\u0446\u044b.\n\n![\u0422\u043e\u0447\u043a\u0438-\u043a\u0430\u043d\u0434\u0438\u0434\u0430\u0442\u044b](data/README/detected_points.png)\n\n* Detector the largest connected components. \u0412\u044b\u0434\u0435\u043b\u044f\u0435\u043c \u0441\u0430\u043c\u0443\u044e \u0431\u043e\u043b\u044c\u0448\u0443\u044e (\u0441\u0447\u0438\u0442\u0430\u0435\u043c \u043f\u043b\u043e\u0449\u0430\u0434\u044c \u043f\u0440\u044f\u043c\u043e\u0443\u0433\u043e\u043b\u044c\u043d\u0438\u043a\u0430, \u0432 \u043a\u043e\u0442\u043e\u0440\u043e\u043c \u043e\u043d\u0430\n\u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f) \u0441\u0432\u044f\u0437\u043d\u0443\u044e \u043a\u043e\u043c\u043f\u043e\u043d\u0435\u043d\u0442\u0443. \u0412\u0441\u0435 \u0442\u043e\u0447\u043a\u0438-\u043a\u0430\u043d\u0434\u0438\u0434\u0430\u0442\u044b, \u043f\u0440\u0435\u043d\u0430\u0434\u043b\u0435\u0436\u0430\u0449\u0438\u0435 \u0435\u0439, \u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u043c, \u043f\u0440\u043e \u043e\u0441\u0442\u0430\u043b\u044c\u043d\u044b\u0435 \u0437\u0430\u0431\u044b\u0432\u0430\u0435\u043c.\n\n* Detect cross of table. \u0427\u0430\u0441\u0442\u044c \u0442\u043e\u0447\u0435\u043a, \u0432\u044b\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0445 \u0432 \u0442\u0430\u0431\u043b\u0438\u0446\u0435 \u043e\u043a\u0430\u0437\u044b\u0432\u0430\u0435\u0442 \u0438\u0437\u043b\u0438\u0448\u043d\u0438\u043c\u0438. \u041d\u0435\u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u0438\u0437 \u043d\u0438\u0445 \u043b\u0435\u0436\u0430\u0442 \u043d\u0435 \u043d\u0430 \u043f\u0435\u0440\u0435\u0441\u0435\u0447\u0435\u043d\u0438\u0438\n\u0441\u0442\u043e\u0440\u043e\u043d, \u0430 \u0432 \u043f\u0440\u043e\u0438\u0437\u0432\u043e\u043b\u044c\u043d\u043e\u043c \u043c\u0435\u0441\u0442\u0435 \u043d\u0430 \u0441\u0442\u043e\u0440\u043e\u043d\u0435 \u0442\u0430\u0431\u043b\u0438\u0446\u044b. \u041d\u0430 \u0432\u0441\u0435 \u043b\u0438\u0448\u043d\u0438\u0435 \u0442\u043e\u0447\u043a\u0438 \u0443\u0434\u0430\u043b\u044f\u044e\u0442\u0441\u044f.\n\n![\u041e\u0442\u043e\u0431\u0440\u0430\u043d\u043d\u044b\u0435 \u0442\u043e\u0447\u043a\u0438](data/README/connected_components_points.png)\n\n* Uniques points. \u0412 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u0435 \u043f\u0440\u043e\u0432\u0435\u0434\u0451\u043d\u043d\u043e\u0439 \u043e\u043f\u0435\u0440\u0430\u0446\u0438\u0438 \u043d\u0430 \u043f\u0440\u0435\u0434\u044b\u0434\u0443\u0449\u0435\u043c \u0448\u0430\u0433\u0435, \u043e\u0441\u0442\u0430\u044e\u0442\u0441\u044f \u043b\u0438\u0448\u044c \u0442\u0435 \u0442\u043e\u0447\u043a\u0438, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043b\u0435\u0436\u0430\u0442 \u0432 \u0443\u0437\u043b\u0430\u0445\n\u0442\u0430\u0431\u043b\u0438\u0446\u044b. \u041f\u0440\u0438 \u044d\u0442\u043e\u043c, \u0432 \u043d\u0435\u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u0443\u0437\u043b\u0430\u0445 \u043b\u0435\u0436\u0438\u0442 \u0431\u043e\u043b\u0435\u0435 1 \u0442\u043e\u0447\u043a\u0438. \u0414\u0430\u043d\u043d\u044b\u0439 \u044d\u0442\u0430\u043f \u043d\u0430\u043f\u0440\u0430\u0432\u043b\u0435\u043d \u043d\u0430 \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u0438\u0435 \u0435\u0434\u0438\u043d\u0441\u0442\u0432\u0435\u043d\u043d\u043e\u0439 \u0442\u043e\u0447\u043a\u0438 \u0432 \u043a\u0430\u0436\u0434\u043e\u043c\n\u0443\u0437\u043b\u0435.\n\n* Building table. \u041f\u0440\u043e\u0438\u0437\u0432\u043e\u0434\u0438\u0442\u0441\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u0435 \u0442\u0430\u0431\u043b\u0438\u0446\u044b. \u0412 \u0441\u0432\u044f\u0437\u0438 \u0441 \u0442\u0435\u043c, \u0447\u0442\u043e \u0447\u0430\u0441\u0442\u044c \u0443\u0437\u043b\u043e\u0432\u044b\u0445 \u0442\u043e\u0447\u0435\u043a $(x, y)$ \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043d\u0435\n\u0440\u0430\u0441\u043f\u043e\u0437\u043d\u0430\u043d\u043e, \u043f\u043e\u0441\u0442\u0440\u043e\u0438\u043c \u0442\u0430\u0431\u043b\u0438\u0446\u0443, \u043a\u0430\u043a \u0434\u0435\u043a\u0430\u0440\u0442\u043e\u0432\u043e \u043f\u0440\u043e\u0438\u0437\u0432\u0435\u0434\u0435\u043d\u0438\u0435 \u043c\u043d\u043e\u0436\u0435\u0441\u0442\u0432\u0430 $\\{x\\} \\times \\{y\\}$. \u041f\u0440\u0438 \u044d\u0442\u043e\u043c, \u0435\u0441\u043b\u0438 \u043d\u0435\u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u044f\u0447\u0435\u0439\u043a\u0438\n\u0442\u0430\u0431\u043b\u0438\u0446\u044b \u0431\u044b\u043b\u0438 \u043e\u0431\u044a\u0435\u0434\u0435\u043d\u0435\u043d\u044b, \u0442\u043e \u0432\u043e\u0437\u043d\u0438\u043a\u0430\u0435\u0442 \u043e\u0448\u0438\u0431\u043e\u0447\u043d\u043e\u0435 \u0440\u0430\u0441\u043f\u043e\u0437\u043d\u043e\u0432\u0430\u043d\u0438\u0435.\n\n![\u0422\u0430\u0431\u043b\u0438\u0446\u0430](data/README/table.png)\n\n* OCR. \u0420\u0430\u0441\u0441\u043c\u0430\u0442\u0440\u0438\u0432\u0430\u0435\u0442 \u043a\u0430\u0436\u0434\u0430\u044f, \u043e\u0442\u0434\u0435\u043b\u044c\u043d\u043e \u0432\u0437\u044f\u0442\u0430\u044f \u044f\u0447\u0435\u0439\u043a\u0430. \u041f\u0440\u043e\u0438\u0437\u0432\u043e\u0434\u0438\u0442\u0441\u044f \u0440\u0430\u0441\u043f\u043e\u0437\u043d\u043e\u0432\u0430\u043d\u0438\u0435.\n\n## \u0421\u0441\u044b\u043b\u043a\u0438\n[1] http://www.machinelearning.ru/wiki/index.php?title=\u041b\u043e\u0433\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0430\u044f_\u0444\u0443\u043d\u043a\u0446\u0438\u044f\n\n[2] http://www.sibran.ru/upload/iblock/0e1/0e15d76aff003e3879db51f21ae7f694.pdf\n\n[3] https://www.osapublishing.org/DirectPDFAccess/0CF96649-F614-A5CD-5F118AF53F660BAB_383046/copp-2-1-69.pdf?da=1&id=383046&seq=0&mobile=no\n\n[4] https://www.hindawi.com/journals/cmmm/2015/607407/\n\n[5] http://research.ijcaonline.org/volume55/number4/pxc3882634.pdf\n\n[6] http://article.sapub.org/pdf/10.5923.j.ajis.20120206.01.pdf\n\n[7] https://pdfs.semanticscholar.org/f07f/ec524e5a6dd08163e9f56599ea712cbd0a31.pdf\n\n[8] https://arxiv.org/abs/1505.04597\n\n[9] https://tyvik.ru/posts/pillow-almighty/\n\n[10] https://www.youtube.com/watch?v=jzZ3WVhgi5w\n\n[11] http://lab.alexkuk.ru/ocr/\n\n[12] https://github.com/ulikoehler/OTR\n\n[13] https://github.com/inuyasha2012/answer-sheet-scan\n\n[14] https://github.com/pavitrakumar78/Python-Custom-Digit-Recognition\n\n[15] https://github.com/enigma-ml/textify\n\n[17] https://www.ph4.ru/fonts_fonts.php?fn=all&l=rus&id=2\n\n[18] https://arxiv.org/pdf/1706.07422.pdf\n\n[19] https://github.com/Belval/pdf2image\n\n[20] https://habrahabr.ru/post/351266/\n\n[21] https://habrahabr.ru/post/181580/\n\n# \u0414\u0440\u0443\u0433\u0438\u0435 \u043f\u0440\u043e\u0435\u043a\u0442\u044b\n\n[1] http://tabula.technology/\n\n[2] https://github.com/mpasternak/pdf-table-extractor\n\n[3] https://github.com/DustBytes/PdfTableExtraction\n\n[4] https://github.com/ferhatelmas/docker-pdf-table-extract\n\n[5] https://github.com/jsvine/pdfplumber\n\n[6] https://github.com/tfmorris/pdf2table\n\n[7] https://github.com/ashima/pdf-table-extract\n\n[8] https://github.com/BIDS/agri_tables\n\n[9] https://github.com/BIDS/agri_tables\n\n[10] https://github.com/nsi-iff/pypdf2table\n\n[11] https://github.com/mgedmin/pdf2html\n\n[12] https://github.com/maxbelyanin/pdf-tbl-extract\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Hedgehogues/HoChiMinh", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "HoChiMinh", "package_url": "https://pypi.org/project/HoChiMinh/", "platform": "", "project_url": "https://pypi.org/project/HoChiMinh/", "project_urls": { "Homepage": "https://github.com/Hedgehogues/HoChiMinh" }, "release_url": "https://pypi.org/project/HoChiMinh/1.0.0/", "requires_dist": [ "opencv-python (==3.4.2.17)", "scikit-learn (==0.21.0)", "scikit-image (==0.16.2)", "pytesseract (==0.2.0)", "scipy (==1.3.1)", "pdf2image (==0.1.9)", "xlrd (==1.1.0)", "pillow (==5.2.0)" ], "requires_python": "", "summary": "Ho Chi Minh is designed to extract textual information from tables presented in PDF, pictures or other format. \u0425\u043e\u0448\u0438\u043c\u0438\u043d \u043f\u0440\u0435\u0434\u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d \u0434\u043b\u044f \u0438\u0437\u0432\u043b\u0435\u0447\u0435\u043d\u0438\u044f \u0442\u0435\u043a\u0441\u0442\u043e\u0432\u043e\u0439 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438 \u0438\u0437 \u0442\u0430\u0431\u043b\u0438\u0446, \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u0435\u043d\u043d\u044b\u0445 \u0432 PDF, \u043a\u0430\u0440\u0442\u0438\u043d\u043a\u0430\u0445 \u0438\u043b\u0438 \u0438\u043d\u043e\u043c \u0444\u043e\u0440\u043c\u0430\u0442\u0435.", "version": "1.0.0", "yanked": false, "yanked_reason": null }, "last_serial": 6028750, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "8b44a8017ee443a6ece4e837b14c36a1", "sha256": "e439551f759f8657a3c8a9f15157a0683bdd194a687f23df1f811b00856fa878" }, "downloads": -1, "filename": "HoChiMinh-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "8b44a8017ee443a6ece4e837b14c36a1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 29381, "upload_time": "2019-10-25T10:42:19", "upload_time_iso_8601": "2019-10-25T10:42:19.961329Z", "url": "https://files.pythonhosted.org/packages/ed/30/4b249c1888e39f42dd7af8251897b497abef91f4266df92dd90bf72f4c58/HoChiMinh-1.0.0-py3-none-any.whl", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8b44a8017ee443a6ece4e837b14c36a1", "sha256": "e439551f759f8657a3c8a9f15157a0683bdd194a687f23df1f811b00856fa878" }, "downloads": -1, "filename": "HoChiMinh-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "8b44a8017ee443a6ece4e837b14c36a1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 29381, "upload_time": "2019-10-25T10:42:19", "upload_time_iso_8601": "2019-10-25T10:42:19.961329Z", "url": "https://files.pythonhosted.org/packages/ed/30/4b249c1888e39f42dd7af8251897b497abef91f4266df92dd90bf72f4c58/HoChiMinh-1.0.0-py3-none-any.whl", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }