{"id":22843424,"url":"https://github.com/milahu/reverse-template-engine","last_synced_at":"2025-04-14T06:53:04.559Z","repository":{"id":63145145,"uuid":"565525841","full_name":"milahu/reverse-template-engine","owner":"milahu","description":"find a template of many similar html files","archived":false,"fork":false,"pushed_at":"2022-11-26T17:08:01.000Z","size":163,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T20:41:06.653Z","etag":null,"topics":["data-extraction","grammar-generation","grammar-generator","parser-generator","reverse-template","reverse-template-engine","schema-generation","schema-generator","structured-data-extraction","structured-text","template-generator","template-induction","tree-automata-induction"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/milahu.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-13T17:26:20.000Z","updated_at":"2025-03-03T15:26:03.000Z","dependencies_parsed_at":"2022-11-13T21:33:47.785Z","dependency_job_id":null,"html_url":"https://github.com/milahu/reverse-template-engine","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milahu%2Freverse-template-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milahu%2Freverse-template-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milahu%2Freverse-template-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milahu%2Freverse-template-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/milahu","download_url":"https://codeload.github.com/milahu/reverse-template-engine/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248837278,"owners_count":21169373,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","grammar-generation","grammar-generator","parser-generator","reverse-template","reverse-template-engine","schema-generation","schema-generator","structured-data-extraction","structured-text","template-generator","template-induction","tree-automata-induction"],"created_at":"2024-12-13T02:14:41.199Z","updated_at":"2025-04-14T06:53:04.520Z","avatar_url":"https://github.com/milahu.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# reverse template engine\n\nreverse-engineer a JSX template of many similar HTML files\n\n## similar projects\n\n- https://github.com/sng2c/tmplrev\n- https://github.com/wimpyprogrammer/strings-to-regex\n\n### commercial\n\nmost commercial scraping services offer this feature\n\n## status\n\n- [x] it can compare two files\n- [ ] it can compare multiple files\n- [ ] it can find arrays of similar items\n- [ ] it can find trees of similar items\n- [ ] it can find optional blocks (conditional blocks)\n- [ ] the resulting parser is aware of html syntax\n\n## what\n\na template engine\n\n- takes one template (page.html.tpl) and an array of data (page.json)\n- returns an array of texts (page.html)\n\n```\n               ┌────────────┐\nTemplate  ────►│            │\n               │  Template  │\n               │            │\n               │   Engine   │\n  Data[]  ────►│            ├────►  Text[]\n               └────────────┘\n```\n\na **reverse** template engine does the opposite\n\n- takes an array of texts (page.html)\n- returns one template (page.html.tpl) and an array of data (page.json)\n\n```\n               ┌────────────┐\n               │            ├────►  Template\n               │  Reverse   │\n               │  Template  │\n               │   Engine   │\n  Text[]  ────►│            ├────►  Data[]\n               └────────────┘\n```\n\n## based on\n\n### common substrings\n\n- https://github.com/hanwencheng/CommonSubstrings\n- https://stackoverflow.com/questions/2892931/longest-common-substring-from-more-than-two-strings\n\n#### suffix tree\n\n- https://github.com/maclandrol/SuffixTreeJS\n- https://github.com/jayrbolton/node-suffix-tree\n- https://github.com/nyxtom/text-tree\n\n### common pattern\n\n- https://stackoverflow.com/questions/72591638/how-to-find-common-patterns-in-thousands-of-strings\n  - https://en.wikipedia.org/wiki/Phrasal_template\n  - https://en.wikipedia.org/wiki/Collocation_extraction\n  - https://en.wikipedia.org/wiki/Text_mining - parsing, deriving patterns, evaluation\n- https://datascience.stackexchange.com/questions/111739/how-to-find-common-patterns-in-thousands-of-strings\n  - sequence alignment algorithms, usually found in bioinformatics\n    - https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm\n    - https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm\n- https://datascience.stackexchange.com/questions/27058/finding-repeating-string-patterns-in-thousands-of-files\n- Algorithms for Finding Patterns in Strings https://sci-hub.ru/https://doi.org/10.1016/B978-0-444-88071-0.50010-2\n\n### learning algorithms\n\nhttp://libalf.informatik.rwth-aachen.de/\n\n\u003cblockquote\u003e\n\nAlgorithm | offline | online | target model\n-- | -- | -- | --\nAngluin's L* |   | X | DFA\nL* (adding counter-examples to columns) |   | X | DFA\nKearns / Vazirani |   | X | DFA\nRivest / Schapire |   | X | DFA\nNL* |   | X | NFA\nRegular positive negative inference (RPNI) | X |   | DFA\nDeLeTe2 | X |   | NFA\nBiermann \u0026 Feldman's algorithm | X |   | NFA\nBiermann \u0026 Feldman's algorithm (using SAT-solving) | X |   | DFA\n\nGlossary\n\nDFA: a deterministic finite-state automaton\n\nNFA: a nondeterministic finite-state automaton\n\nOffline learning: offline learning algorithms are learning algorithms that passively receive a set of classified data. Their goal then is to generalize this set of positive and negative words to some kind of explanation H (e.g. a DFA) which is in conformance with the input. aka non-supervised learning, passive learning.\n\nOnline learning: In contrast to offline learning algorithms, online learning algorithms are capable of actively asking certain kinds of queries to some teacher who is able to classify these queries. This ability lets them infer explanations for the underlying set of already classified words. aka supervised learning, active learning.\n\n\u003c/blockquote\u003e\n\n\n### grammar inference\n\n- grammar learning, grammar inference, grammatical inference, grammar induction\n- https://en.wikipedia.org/wiki/Grammar_induction\n- https://datascience.stackexchange.com/questions/78377/learn-common-grammar-pattern-from-set-of-sample-strings\n- https://stackoverflow.com/questions/15512918/grammatical-inference-of-regular-expressions-for-given-finite-list-of-representa\n  - DFA Learning algorithm\n    - DFA = deterministic finite automaton\n  - libalf - automata learning algorithm framework in C++ \n    - http://libalf.informatik.rwth-aachen.de/\n    - https://github.com/libalf/libalf\n- https://github.com/topics/grammatical-inference\n- https://github.com/topics/grammar-induction-algorithms\n  - https://github.com/asaparov/parser - 20 stars - C++ - Semantic parser induction using a generative model of grammar.\n- https://github.com/topics/inductive-learning\n- https://github.com/topics/semantic-parser - NLP, natural language processing, human language processing\n- https://libgen.rs/search.php?req=grammatical+inference\n  - Wojciech Wieczorek - Grammatical Inference. Algorithms Routines and Applications (2017)\n  - Colin de la Higuera - Grammatical Inference. Learning Automata and Grammars (2010)\n- https://libgen.rs/search.php?req=grammar+inference\n\n\u003cblockquote\u003e\n  \n1.1 The Problem and Its Various Formulations\n\nLet us start with the presentation of how many variants of a grammatical inference\nproblem we may be faced with.\n\nInformally, we are given a sequence of words and\nthe task is to find a rule that lies behind it.\n\nDifferent models and goals are given\nby response to the following questions:\n\n- Is the sequence finite or infinite?\n- Does the sequence contain only examples (positive words)\n  or also counter-examples (negative words)?\n- Is the sequence of the form: all positive and negative words up to a certain length n?\n- What is meant by the rule: are we satisfied with\n  - regular acceptor\n  - contextfree grammar\n  - context-sensitive grammar\n  - other tool?\n- Among all the rules that match the input, should the obtained one be of a minimum size?\n\n\u0026mdash; Wojciech Wieczorek - Grammatical Inference. Algorithms Routines and Applications (2017)\n\n\u003c/blockquote\u003e\n\n\u003cblockquote\u003e\n\nGrammatical Inference\n\nThe problem of inducing, learning or inferring grammars has been studied for decades,\nbut only in recent years has grammatical inference emerged as an independent field with\nconnections to many scientific disciplines, including bio-informatics,\ncomputational linguistics and pattern recognition.\n\nThis book meets the need for a comprehensive and unified\nsummary of the basic techniques and results, suitable for researchers working in these\nvarious areas.\n\nIn Part I, the objects of use for grammatical inference are studied in detail: strings\nand their topology, automata and grammars, whether probabilistic or not.\n\nPart II carefully\nexplores the main questions in the field: what does learning mean? How can we associate\ncomplexity theory with learning?\n\nIn Part III the author describes a number of techniques\nand algorithms that allow us to learn from text, from an informant, or through interaction\nwith the environment. These concern automata, grammars, rewriting systems, pattern\nlanguages and transducers.\n\n\u0026mdash; Colin de la Higuera - Grammatical Inference. Learning Automata and Grammars (2010)\n\n\u003c/blockquote\u003e\n\n#### Information Extraction\n\n- https://github.com/topics/information-extraction\n  - https://github.com/tstanislawek/awesome-document-understanding - OCR, computer vision, image analysis\n  - https://github.com/gkiril/oie-resources - NLP\n- https://github.com/topics/data-extraction\n- https://github.com/hi-primus/optimus - 1K stars - Python - Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark. Optimus is an opinionated python library to easily load, process, plot and create ML models that run over pandas, Dask, cuDF, dask-cuDF, Vaex or Spark.\n- https://github.com/hermit-crab/ScrapeMate - 100 stars - JavaScript - Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages. Available as a Chrome/Chromium and a Firefox extensions.\n- https://github.com/scopashq/typestream - 50 stars - JavaScript - data transformation framework for TypeScript\n\n\u003cblockquote\u003e\n\n2.4.2 Information extraction: automatic wrapper generation\n\nThe quantity of structured data available today due to the exponential growth of the World\nWide Web introduces a number of challenges to scientists interested in grammatical inference.\nHTML and XML data appear as text,\nbut the text is well bracketed through a number of tags\nthat are either syntactic (HTML) and give indications as to how the file should be represented,\nor semantic (XML).\n\nHere is a small piece from an XML file:\n\n```xml\n\u003cbook\u003e\n  \u003cchapter\u003e\n    \u003cname\u003eIntroduction\u003c/name\u003e\n    \u003clength\u003e25 pages\u003c/length\u003e\n    \u003cdescription\u003e\n      Motivations about the book\n    \u003c/description\u003e\n    \u003cexercises\u003e0\u003c/exercises\u003e\n  \u003c/chapter\u003e\n  \u003cchapter\u003e\n    \u003cname\u003eThe Data\u003c/name\u003e\n    \u003clength\u003e18 pages\u003c/length\u003e\n    \u003cdescription\u003e\n      Describe some cases where the data is made of strings\n    \u003c/description\u003e\n    \u003cexercises\u003e0\u003c/exercises\u003e\n  \u003c/chapter\u003e\n  \u003cchapter\u003e\n    \u003cname\u003eStrings and Languages\u003c/name\u003e\n    \u003clength\u003e35 pages\u003c/length\u003e\n    \u003cdescription\u003e\n      Definitions of strings and stringology\n    \u003c/description\u003e\n    \u003cexercises\u003e23\u003c/exercises\u003e\n  \u003c/chapter\u003e\n\u003c/book\u003e\n```\n\nBetween the many problems when working with these files,\none can aim to **find the grammar corresponding to a set of XML files**.\n\nOne very nice application in which grammatical inference has been helpful is that of\nbuilding a wrapper automatically (or semi-automatically).\nA wrapper is supposed to take a web page and extract from it the information for which it has been designed.\n\nFor instance, if we need to build a mailing list, the wrapper would find in a web page the information that is needed.\nObviously, the wrapper will work on the code of the web page: the HTML or XML file.\nTherefore, **grammatical inference of tree automata** is an obvious candidate.\n\nAnother feature of the task is that labelling the examples is cumbersome and can be noisy.\nThe proposal is to do this on the fly, through an interaction between the system and the user.\nThis will justify in part the rising interest in **active learning** methods.\n\n\u0026mdash; Colin de la Higuera - Grammatical Inference. Learning Automata and Grammars (2010)\n\n\u003c/blockquote\u003e\n\n##### Information Extraction from Structured Text\n\nhttps://www.google.com/search?q=xml+information+extraction+automatic+wrapper+generation+grammatical+inference+of+tree+automata\n\n- [A Survey of Web Information Extraction Systems. CH Chang, M Kayed, R Girgis, KF Shaalan, IEEE Transactions 2006](http://scholar.cu.edu.eg/sites/default/files/shaalan/files/iesurvey2006.pdf) - 1200 quotes\n- [Roadrunner: Towards automatic data extraction from large web sites. V Crescenzi, G Mecca, P Merialdo, VLDB 2001](https://www.vldb.org/conf/2001/P109.pdf) - 1500 quotes\n- [Information Extraction in Structured Documents using Tree Automata](https://alpha.uhasselt.be/jan.vandenbussche/ta_paper.pdf)\n- [Semantic Wrappers for Semi-Structured Data Extraction](http://atc1.aut.uah.es/~mdolores/Docs/2008/cole07_WebMantic.pdf) - obtain the semantic generators for a particular Web site\n- [Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference. R Kosala, M Bruynooghe, J Van den Bussche 2003](https://www.ijcai.org/Proceedings/03/Papers/060.pdf) - 50 quotes\n- [Information Extraction in Structured Documents using Tree Automata Induction](https://users.dcc.uchile.cl/~cgallegu/ie/kosala02information.pdf)\n- [Automatic information extraction from large websites. V Crescenzi, G Mecca 2004](http://www.inf.uniroma3.it/wp-content/uploads/2015/03/2003-76.pdf) - 250 quotes\n- [Information extraction from structured documents using k-testable tree automaton inference. R Kosala, H Blockeel, M Bruynooghe 2006](https://www.academia.edu/download/49968352/gltestable.pdf) - 50 quotes\n- [Information extraction in structured documents using tree automata induction. R Kosala, JV Bussche, M Bruynooghe. 2002](https://sci-hub.ru/https://doi.org/10.1007/3-540-45681-3_25) - 60 quotes\n\n##### Information extraction in structured documents using tree automata induction\n\n[Information extraction in structured documents using tree automata induction. R Kosala, JV Bussche, M Bruynooghe. 2002](https://sci-hub.ru/https://doi.org/10.1007/3-540-45681-3_25) - 60 quotes\n\n\u003cblockquote\u003e\n\nA problem, however, in directly applying tree automata to tree-structured\ndocuments such as HTML or XML documents, is that the latter trees are “unranked”:\nthe number of children of a node is not fixed by the label, but is varying.\n\nThere are two approaches to deal with this situation:\n\n1. The first approach is to use a generalized notion of tree automata towards\nunranked tree formalisms (e.g., [17,23]). In such formalisms, the transition\nrules are of the form δ(v, e) → q, where e is a regular expression over Q that\ndescribes a sequence of states.\n2. The second approach is to encode unranked trees into ranked trees, specifically, binary trees, and to use existing tree automata inference algorithms\nfor inducing the tree automaton.\n\nIn this paper we follow the second approach, because it seems less complicated. An advantage is that we can use existing learning methods that work\non ranked trees. A disadvantage is that we have to preprocess the trees before\napplying the algorithm.\n\n17:\nC. Pair and A. Quere. D´efinition et etude des bilangages r´eguliers. Information\nand Control, 13(6):565–593, 1968.\n\n23:\nM. Takahashi. Generalizations of regular sets and their application to a study of\ncontext-free languages. Information and Control, 27:1–36, 1975.\n\n\u003c/blockquote\u003e\n\n#### Relation extraction\n\nhttps://medium.com/@andreasherman/different-ways-of-doing-relation-extraction-from-text-7362b4c3169e\n\n\u003cblockquote\u003e\n\nDifferent ways of doing Relation Extraction from text\n\nRelation Extraction (RE) is the task of extracting semantic relationships from text, which usually occur between two or more entities. These relations can be of different types. E.g “Paris is in France” states a “is in” relationship from Paris to France. This can be denoted using triples, (Paris, is in, France).\n\nInformation Extraction (IE) is the field of extracting structured information from natural language text. This field is used for various NLP tasks, such as creating Knowledge Graphs, Question-Answering System, Text Summarization, etc. Relation extraction is in itself a subfield of IE.\n\n\u003c/blockquote\u003e\n\nhttps://github.com/ekalgolas/Relation-extraction-using-Semantic-Web\n\n\u003cblockquote\u003e\n\nWe will process unstructured data from web (obtained by crawling some sample websites) by maybe: having a Apache SolR installation locally and manually feeding it web pages. We can use Stanford NLP API to extract semantics from the unstructured text. After we extract some semantics, we can construct a structured data format, probably RDF/XML/OWL and also have a visual representation of the graph data using Gruff.\n\n\u003c/blockquote\u003e\n\nhttps://www.slideshare.net/butest/information-extraction-from-html-general-machine-learning\n\n#### search engines\n\n- apache lucene?\n- solr?\n\n\n\n###### A Survey of Web Information Extraction Systems\n\n[A Survey of Web Information Extraction Systems. CH Chang, M Kayed, R Girgis, KF Shaalan, IEEE Transactions 2006](http://scholar.cu.edu.eg/sites/default/files/shaalan/files/iesurvey2006.pdf) - 1200 quotes\n\n\u003cblockquote\u003e\n\n4.2 Supervised WI systems\n\nAs shown in the left-bottom of Figure 5, supervised WI systems take a set of web pages labeled with examples of the\ndata to be extracted and output a wrapper. The user provides an initial set of labeled examples and the system\n(with a GUI) may suggest additional pages for the user to\nlabel. For such systems, general users instead of programmers can be trained to use the labeling GUI, thus reducing\nthe cost of wrapper generation. Such systems are SRV, RAPIER, WHISK, WIEN, STALKER, SoftMealy, NoDoSE, DEByE.\n\nSRV is a top-down relational algorithm that generates single-slot extraction rules [8].\nIt regards IE as a kind of classification problem. The input documents are tokenized and\nall substrings of continuous tokens (i.e. text fragments) are\nlabeled as either extraction target (positive examples) or not\n(negative examples). The rules generated by SRV are logic\nrules that rely on a set of token-oriented features (or predicates).\nThese features have two basic varieties: simple and\nrelational. A simple feature is a function that maps a token\ninto some discrete value such as length, character type (e.g.,\nnumeric), orthography (e.g., capitalized) and part of speech\n(e.g., verb). A relational feature maps a token to another\ntoken, e.g. the contextual (previous or next) tokens of the\ninput tokens. The learning algorithm proceeds as FOIL,\nstarting with entire set of examples and adds predicates\ngreedily to cover as many positive examples and as few\nnegative examples as possible. For example, to extract the\nrating score for our running example, SRV might return\nrule like Figure 9(a), which says rating is a single numeric\nword and occurs within a HTML list tag.\n\nhttps://www.cs.utexas.edu/~ml/rapier.html - ftp://ftp.cs.utexas.edu/pub/mooney/rapier - RAPIER is a bottom-up inductive learning system for learning information extract rules. It has been tested on several domains and performs comparably to or slightly better than other recent learning system for this task.\n\nRAPIER also focuses on field-level extraction but uses bottom-up (compression-based) relational learning algorithm\n[7], i.e. it begins with the most specific rules and then replacing them with more general rules.\nRAPIER learns single slot extraction patterns that make use of syntactic and\nsemantic information including part-of-speech tagger or a\nlexicon (WordNet). The extraction rules consist of three distinct patterns. The first one is the pre-filler pattern that\nmatches text immediately preceding the filler, the second\none is the pattern that match the actual slot filler, finally the\nlast one is the post-filler pattern that match the text immediately following the filler. As an example, Figure 9(b)\nshows the extraction rule for the book title, which is immediately preceded by words “Book”, “Name”, and “\u003c/b\u003e”,\nand immediately followed by the word “\u003cb\u003e”. The “Filler\npattern” specifies that the title consists of at most two\nwords that were labeled as “nn” or “nns” by the POS tagger\n(i.e., one or two singular or plural common nouns).\n\nWIEN: Kushmerick identified a family of six wrapper\nclasses, LR, HLRT, OCLR, HOCLRT, N-LR and N-HLRT for\nsemi-structured Web data extraction [9]. WIEN focuses on\nextractor architectures. The first four wrappers are used for\nsemi-structured documents, while the remaining two\nwrappers are used for hierarchically nested documents. The\nLR wrapper is a vector of 2K delimiters for a site containing\nK attributes. For example, the vector (‘Reviewer name\n\u003c/b\u003e’, ‘\u003cb\u003e’, ‘Rating \u003c/b\u003e’, ‘\u003cb\u003e’, ‘Text \u003c/b\u003e’, ‘\u003c/li\u003e’) can\nbe used to extract 3-slot book reviews for our running example. The HLRT class uses two additional delimiters to\nskip over potentially-confusing text in either the head or\ntail of the page. The OCLR class uses two additional delimiters to identify an entire tuple in the document, and then\nuses the LR strategy to extract each attribute in turn. The\nHOCLRT wrapper combines the two classes OCLR and\nHLRT. The two wrappers N-LR and N-HLRT are extension\nof LR and HLRT and designed specifically for nested data\nextraction. Note that, since WIEN assumes ordered attributes in a data record, missing attributes and permutation of\nattributes can not be handled. \n\nWHISK uses a covering learning algorithm to generate\nmulti-slot extraction rules for a wide variety of documents\nranging from structured to free text [6]. When applying to\nfree text, WHISK works best with input that has been annotated by a syntactic analyzer and a semantic tagger. WHISK\nrules are based on a form of regular expression patterns\nthat identify the context of relevant phrases and the exact\ndelimiters of those phrases. It takes a set of hand-tagged\ntraining instances to guide the creation of rules and to test\nthe performance of the proposed rules. WHISK induces\nrules top-down, starting from the most general rule that\ncovers all instances, and then extending the rule by adding\nterms one at a time. For example, to generate 3-slot book\nreviews, it start with empty rule “*(*)*(*)*(*)*”, where each\nparenthesis indicates a phrase to be extracted. The phrase\nwithin the first set of parentheses is bound to the first variable $1, and the second to $2, and forth. Thus, the rule in\nFigure 10 can be used to extract our 3-slot book reviews for\nour running example. If part of the input remains after the\nrule has succeeded, the rule is re-applied to the rest of the\ninput. Thus, the extraction logic is similar to the LR wrapper for WIEN.\n\nhttps://dl.acm.org/doi/abs/10.1145/276305.276330 - NoDoSE — a tool for semi-automatically extracting structured and semistructured data from text documents. B. Adelberg, SIGMOD 1998\n\nNoDoSE: Opposed to WIEN, where training examples are\nobtained from some oracles that can identify interesting\ntypes of fields within a document, NoDoSE provides an\ninteractive tool for users to hierarchically decompose semistructured documents (including plain text or HTML pages)\n[23]. Thus, NoDoSE is able to handle nested objects. The\nsystem attempts to infer the format/grammar of the input\ndocuments by two heuristic-based mining components: one\nthat mines text files and the other parses HTML code. Similar to WIEN, the mining algorithms try to find common\nprefix and suffix as delimiters for various attributes. Although it does not assume the order of attributes within a\nrecord to be fixed, it seeks to find a totally consistent ordering for various attributes in a record. The result of this task\nis a tree that describes the structure of the document. For\nexample, to generate a wrapper for the running example,\nthe user can interact with the NoDoSE GUI to decompose\nthe document as a record with two fields: a book title (an\nattribute of type string) and a list of Reviewer, which is in\nturn a record of the three fields RName (string), Rate (integer), and Text (string). Next, NoDoSE then automatically\nparses them and generates the extraction rules.\n\nSoftMealy: In order to handle missing attributes and attribute permutations in input, Hsu and Dung introduce the\nidea of finite-state transducer (FST) to allow more variation\non extractor structures [10]. A FST consists of two different\nparts: the body transducer, which extract the part of the page\nthat contains the tuples (similar to HLRT in WIEN), and the\ntuple transducer which iteratively extracts the tuples from\nthe body. The tuple transducer accepts a tuple and returns\nits attributes. Each distinct attribute permutation in the\npage can be encoded as a successful path from start state to\nthe end state of the tuple transducer; and the state transitions are determined by matching contextual rules that describe the context delimiting two adjacent attributes. Contextual rules consist of individual separators that represent\ninvisible borderlines between adjacent tokens; and an inductive generalization algorithm is used to induce these\nrules from training examples. Figure 11 shows an example\nof FST that can be used to extract the attributes of the book\nreviews: the reviewer name (N), the rating (R), and the\ncomment (T). In addition to the begin and end states, each\nattribute, A , is followed by a dummy state, A . Each arc is\nlabeled with the contextual rule that enables the transition\nand the tokens to output. For example, when the state transition reaches to the R state, the transducer will extract the\nattribute R until it matches the contextual rules s\u003cR, R \u003e\n(which is composed of s\u003cR, R \u003eL\n and s\u003cR, R \u003eR\n). The state\nR and the end state are connected if we assume no comment can occur.\n\nSTALKER is a WI system that performs hierarchical data\nextraction [11]. It introduces the concept of embedded catalog (EC) formalism to describe the structure of a wide range\nof semi-structured documents. The EC description of a page\nis a tree-like structure in which the leaves are the attributes\nto be extracted and the internal nodes are lists of tuples. For\neach node in the tree, the wrapper needs a rule to extract\nthis node from its parent. Additionally, for each list node,\nthe wrapper requires a list iteration rule that decomposes\nthe list into individual tuples. Therefore, STALKER turns\nthe difficult problem of extracting data from an arbitrary\ncomplex document into a series of easier extraction tasks\nfrom higher level to lower level. Moreover, the extractor\nuses multi-pass scans to handle missing attributes and multiple permutations. The extraction rules are generated by\nusing of a sequential covering algorithm, which starts from\nlinear landmark automata to cover as many positive examples as possible, and then tries to generate new automata\nfor the remaining examples. A Stalker EC tree that describes\nthe data structure of the running example is shown in Figure 12(a), where some of the extraction rules are shown in\nFigure 12(b). For example, the reviewer ratings can be extracted by first applying the List(Reviewer) extraction rule\n(which begins with “\u003col\u003e” and ends with “\u003c/ol\u003e”) to the\nwhole document, and then the Rating extraction rule to\neach individual reviewer, which is obtained by applying the\niteration rule for List(Reviewer). In a way, STALKER is\nequivalent to multi-pass Softmealy [30]. However, the extraction patterns for each attribute can be sequential as opposed to the continuous patterns used by Softmealy. \n\nDEByE (Data Extraction By Example): Like NoDoSE, DEByE provides an interactive GUI for wrapper generation\n[24], [25]. The difference is that in DEByE the user marks\nonly atomic (attribute) values to assemble nested tables,\nwhile in NoDoSE the user decomposes the whole document\nin a top-down fashion. In addition, DEByE adopts a bottom-up extraction strategy which is different from other\napproaches. The main feature of this strategy is that it extracts atomic components first and then assembles them\ninto (nested) objects. The extraction rules, called attributevalue pair patterns (AVPs), for atomic components are identified by context analysis: starting with context length 1, if\nthe number of matches exceeds the estimated number of\noccurrences provided by the user, it adds additional terms\nto the pattern until the number of matches is less than the\nestimated one. For example, DEByE generates AVP patterns, “Name\u003c/b\u003e* \u003cb\u003eReviews”, “Name\u003c/b\u003e*\u003cb\u003e Rating”, “Rating\u003c/b\u003e*\u003cb\u003eText” and “\u003c/b\u003e*\u003cli\u003e” for book\nname, reviewer name, rating and comment respectively (*\ndenotes the data to be extracted). The resulting AVPs are\nthen used to compose an object extraction pattern (OEPs).\nOEPs are trees containing information on the structure of\nthe document. The sub-trees of an OEP are themselves\nOEPs, modeling the structure of component objects. At the\nbottom of the hierarchy lie the AVPs that used to identify\natomic components. The assemble of atomic values into\nlists or tuples is based on the assumption that various occurrences of objects do not overlap each other. For nonhomogeneous objects, the user can specify more than one\nexample object, thus creating a distinct OEP for each example.\n\n4.3 Semi-Supervised IE systems\n\nThe systems that we categorize as semi-supervised IE systems include IEPAD, OLERA and Thresher. As opposed to\nsupervised approach, OLERA and Thresher accept a rough\n(instead of a complete and exact) example from users for\nextraction rule generation, therefore they are called semisupervised. IEPAD, although requires no labeled training\npages, post-effort from the user is required to choose the\ntarget pattern and indicate the data to be extracted. All\nthese systems are targeted for record-level extraction tasks.\nSince no extraction targets are specified for such systems, a\nGUI is required for users to specify the extraction targets\nafter the learning phase. Thus, users’ supervision is involved. \n\nIEPAD is one of the first IE systems that generalize extraction patterns from unlabeled Web pages [31]. This method\nexploits the fact that if a Web page contains multiple (homogeneous) data records to be extracted, they are often\nrendered regularly using the same template for good visualization. Thus, repetitive patterns can be discovered if the\npage is well encoded. Therefore, learning wrappers can be\nsolved by discovering repetitive patterns. IEPAD uses a\ndata structure called PAT trees which is a binary suffix tree\nto discover repetitive patterns in a Web page. Since such a\ndata structure only records the exact match for suffixes,\nIEPAD further applies center star algorithm to align multiple strings which start from each occurrence of a repeat and\nend before the start of next occurrence. Finally, a signature\nrepresentation is used to denote the template to comprehend all data records. For our running example, only page\npe2 can be used as input to IEPAD. By encoding each tag as\nan individual token and any text between two adjacent tags\nas a special token “T”, IEPAD discover the pattern\n“\u003cli\u003e\u003cb\u003eT\u003c/b\u003eT\u003cb\u003eT\u003c/b\u003eT \u003cb\u003eT\u003c/b\u003eT\u003c/li\u003e” with two\noccurrences. The user then has to specify, for example, the\n2nd, 4th and 6th “T” tokens, as the relevant data (denoting\nreviewer name, rating and comment, respectively).\n\nOLERA is a semi-supervised IE system that acquires a\nrough example from the user for extraction rule generation\n[32]. OLERA can learn extraction rules for pages containing\nsingle data records, a situation where IEPAD fails. OLERA\nconsists of 3 main operations. (1) Enclosing an information\nblock of interest: where the user marks an information block\ncontaining a record to be extracted for OLERA to discover\nother similar blocks (using approximate matching technique) and generalize them to an extraction pattern (using\nmultiple string alignment technique). (2) Drilling-down/rollingup an information slot: drilling-down allows the user to\nnavigate from a text fragment to more detailed components, whereas rolling-up combines several slots to form a\nmeaningful information unit. (3) Designating relevant information slots for schema specification as in IEPAD.\n\nThresher [33] is also a semi-supervised approach that is\nsimilar to OLERA. The GUI for Thresher is built in the Hay\nstack browser which allows users to specify examples of\nsemantic contents by highlighting them and describing\ntheir meaning (labeling them). However, it uses tree edit\ndistance (instead of string edit distance as in OLERA) between the DOM subtrees of these examples to create a\nwrapper. Then it allows the user to bind the semantic web\nlanguage RDF (Resource Description Framework) classes\nand predicates to the nodes of these wrappers.\n\nDEPTA (Data Extraction based on Partial Tree Alignment):\nLike IEPAD and DeLa, DEPTA can be only applicable to\nWeb pages that contain two or more data records in a data\nregion. However, instead of discovering repeat substring\nbased on suffix trees, which compares all suffixes of the\nHTML tag strings (as the encoded token string described in\nIEPAD), it compares only adjacent substrings with starting\ntags having the same parent in the HTML tag tree (similar\nto HTML DOM tree but only tags are considered). The insight is that data records of the same data region are reflected in the tag tree of a Web page under the same parent\nnode. Thus, irrelevant substrings do not need to be compared together as that in suffix-based approaches. Furthermore, the substring comparison can be computed by string\nedit distance instead of exact string match when using suffix trees where only completely similar substrings are identified. The described algorithm, called MDR [38], works in\nthree steps. First, it builds an HTML tag tree for the Web\npage as shown in Figure 14 where text strings are disregarded.\nSecond, it compares substrings for all children under the same parent. For example, we need to make two\nstring comparison, (b1, b2) and (b2, ol), under parent node\n\u003cbody\u003e, where the tag string node \u003col\u003e is represented by\n“\u003cli\u003e\u003cb\u003e\u003cb\u003e\u003cb\u003e\u003cli\u003e\u003cb\u003e\u003cb\u003e\u003cb\u003e”. If the similarity is\ngreater than a predefined threshold (as shown in the\nshaded nodes in Figure 14), the nodes are recorded as data\nregions. The third step is designed to handle situations\nwhen a data record is not rendered contiguously as assumed in previous works. Finally, the recognition of data\nitems or attributes in a record is accomplished by partial\ntree alignment [39]. Tree alignment is better than string\nalignment for it considers tree structure, thus, reducing the\nnumber of possible alignments. The algorithm first chooses\nthe record tree with the largest number of data items as\ncenter and then matches other record trees to the center\ntree. However, DEPTA only adds tag nodes to the center\ntree when the positions of the tag nodes can be uniquely\ndetermined in the center tree. For remained nodes, they are\nprocessed in the next iteration after all tag trees are processed. Note that DEPTA assumes that non-tag tokens are\ndata items to be extracted, thus, it extracts not only the reviewer name, rating and comments, but also the labels\n“Reviewer Name”, “Rating”, and “Text” for page pe2\n in our\nrunning example. Further, DEPTA is limited to handle\nnested data records. So, a new algorithm, NET, is developed to handle such data records by performing a postorder traversal of the visual-based tag tree of a Web page\nand matching subtrees in the process using a tree edit distance method and visual cues [40]. \n\nOf the unsupervised WI approaches, one important issue\nis to differentiate the role of each token: either a data token\nor template token. Some assume that every HTML tag is\ngenerated by the template and other tokens are data items\nto simplify the issue (as in DeLa and DEPTA). However, the\nassumption does not hold for many collections of pages\n(therefore, IEPAD and OLERA simply leave the issue to\ndistinguish between data and template tokens to the users).\nRoadRunner also assumes that every HTML tag is generated by the template, but other matched string tokens are\nalso considered as part of the template. In comparison,\nEXALG has the most detailed tokenization method while\nmore flexible assumption where each token can be a template token if there are enough tokens to form frequently\noccurring equivalence class.\n\nOn the other hand, DEPTA conducts the mining process\nfrom single Web pages, while RoadRunner and EXALG do\nthe analysis from multiple Web pages (While DeLa takes\nadvantages of multiple input pages for data-rich section\nextraction and generalized pattern construction, it discovers\nC-repeat patterns from single Web pages.). The later, in our\nviewpoint, is the key point that is used to differentiate the\nrole of each token. Thus, multiple pages of the same class is\nalso used to discover data rich section (as in DeLa) or\neliminate noisy information (as in [41]). Meanwhile, the\nadaptation of tree matching in DEPTA (as well as Thresher)\nalso provides better result than string matching techniques\nused in IEPAD and RoadRunner. EXALG similarly does not\nmake full use of the tree structure although the DOM tree\npath information is used for differentiating token roles. Finally, since information extraction is only a part of a\nwrapper program or information integration systems, additional\ntasks like page fetching, label assignment, and mapping\nwith other web data sources are remained to be processed.\n\nDue to space limitation, we are not able to compare all\nresearches here. For example, ViNTs [42] is a record-level\nwrapper generation system which exploits visual information to find separators between data regions from search\nresult pages. However, the algorithm can be only applicable\nto pages that contain at least four data records. Another\nrelated approach that has been applied on Web sites for\nextracting information from tables is [43]. The technique\nrelies on the use of additional links to a detail page containing additional information about that item. In parallel to the\nefforts to detect Web tables, other researchers have worked\nin detecting tables in plain text documents (such as government statistical reports) and segmenting them into records [44]. Since these approaches do not address the problem of distinguish data tokens from template tokens, we\nconsider them as semi-supervised approaches.\n\n\u003c/blockquote\u003e\n\n#### An XML-enabled data extraction toolkit for web sources\n\nAn XML-enabled data extraction toolkit for web sources. Ling Liu, C. Pu, Wei Han, 2001\n\nhttp://maaz.ihmc.us/rid=1228288297429_1842851864_16163/An%20XML%20enabled%20data%20extraction%20toolkit.pdf\n\n### Template Induction\n\n#### Discovering Textual Structures: Generative Grammar Induction using Template Trees\n\nDiscovering Textual Structures: Generative Grammar Induction using Template Trees. Thomas Winters, L. D. Raedt, 2020\n\nhttps://arxiv.org/abs/2009.04530\n\n#### Latent Template Induction with Gumbel-CRFs\n\nLatent Template Induction with Gumbel-CRFs. Yao Fu, Chuanqi Tan, Alexander M. Rush, 2020\n\nhttps://arxiv.org/abs/2011.14244\n\n#### Html Tag Based Web Data Extraction and Tree Merging From Template Page\n\nHtml Tag Based Web Data Extraction and Tree Merging From Template Page. A. Chandrasekhar, P. V. S. Readdy, 2014\n\nhttps://www.academia.edu/download/34193222/V2I3-0067.pdf\n\n#### Tree Automata\n\nhttps://github.com/topics/tree-automata\n\n\u003cblockquote\u003e\n\nWhat about trees and graphs?\n\nThe original goal was to cover extensively the field of grammatical inference.\n\nThis of\ncourse meant discussing in detail **tree automata** and grammars, giving the main adaptation\nof classical string algorithms to the case of trees, and even dealing with those works specific\nto trees. As work progressed it became clear that **learning tree automata** and grammars was\ngoing to involve at least as much material as with strings.\n\nThe conclusion was reached to\nonly sketch the specificities here, leaving the matter largely untouched, with everything to\nbe written. This of course is not justified by the importance of the question, but only by the\neditorial difficulty and the necessity to stop somewhere. Of course, after trees will come\nthe question of graphs...\n\n...\n\nExtensions of the above mechanisms (automata, grammars) to deal with trees and\ngraphs have been proposed. For the case of **tree automata** a general survey is (Comon\net al., 1997) and for graph grammars there are a number of possible sources (Courcelle,\n1991).\n\n...\n\nAlgorithm RPNI has been successfully adapted to **tree automata** (García \u0026 Oncina,\n1993), and infinitary languages (de la Higuera \u0026 Janodet, 2004).\n\n...\n\nIn the field of computational linguistics, efforts have been made to learn context-free\ngrammars from more informative data, such as trees (Charniak, 1996),\nfollowing theoretical results by Yasubumi Sakakibara (Sakakibara, 1992). Learning from structured data has\nbeen a line followed by many: learning tree automata (Fernau, 2002, Habrard, Bernard \u0026\nJacquenet, 2002, Knuutila \u0026 Steinby, 1994), or context-free grammars from bracketed data\n(Sakakibara, 1990) allows to obtain better results, either with queries (Sakakibara, 1992),\nregular distributions (Carrasco, Oncina \u0026 Calera-Rubio, 2001, Kremer, 1997, Rico-Juan,\nCalera-Rubio \u0026 Carrasco, 2002), or negative information (García \u0026 Oncina, 1993). This\nhas also led to different studies concerning the probability estimation of such grammars\n(Calera-Rubio \u0026 Carrasco, 1998, Lari \u0026 Young, 1990).\n\n...\n\n19.3 About trees and graphs and more structure\n\nWe have left untouched (or nearly untouched) the question of learning from data that would\nbe more structured than strings. There are many researchers working on learning tree grammars and tree automata.\nIn some cases the work consists of adapting a string language\ninference algorithm to suit the tree case, but in many others the problems are new and\nnovel algorithms are needed. Furthermore, in practice, in many cases the tree structures\nallow us to model the data in a much more accurate fashion.\n\n...\n\n19.5 About learning itself\n\nA view defended by some is that learning is about compressing; a compression with loss,\nwhere the loss itself corresponds to the gain in learning.\n\nThroughout the book we have viewed algorithms whose chief goal was to get hold of\nenormous amounts of data and somehow digest this into a simple set of rules which in turn\nallowed us to somehow replace the data by the grammar. In other words, the feeling we\nhave reached is that learning is all about forgetting.\n\n\u0026mdash; Colin de la Higuera - Grammatical Inference. Learning Automata and Grammars (2010)\n\n\u003c/blockquote\u003e\n\n### automata learning\n\n- https://github.com/topics/automata-learning\n- https://github.com/topics/dfa-learning\n- https://automata.cs.ru.nl/Tools - Automata Wiki\n- https://github.com/DES-Lab/AALpy - 100 stars - Python - Active Automata Learning Library\n- https://github.com/steynvl/inferrer - 30 stars - Python - automata learning library\n- https://gitlab.lis-lab.fr/dev/scikit-splearn/ - Python scikit toolbox for spectral learning algorithms. These algorithms aim at learning Weighted Automata (WA) using what is named a Hankel matrix.\n- https://pypi.org/project/pylstar/ - Python implementation of the LSTAR Grammatical inference algorithm\n- https://github.com/LearnLib/learnlib - 150 stars - Java - Library for Automata Learning and Experimentation.\n  - https://bitbucket.org/learnlib/ralib/ - active learning algorithms for register automata\n  - https://learnlib.de/\n  - https://github.com/Learnlib/learnlib/wiki\n- https://github.com/lorisdanto/symbolicautomata - 50 stars - Java - Library for symbolic automata and symbolic visibly pushdown automata\n- http://www.italia.cs.ru.nl/tomte/ - tool for learning register automata. The tool uses counterexample guided abstraction refinement to automatically construct abstractions, and uses a Mealy machine learner (such as LearnLib) as a back-end.\n- https://github.com/mvcisback/dfa-identify - identifying (learning) minimal DFAs from labeled examples by reduction to SAT\n- https://github.com/ctlab/DFA-Inductor-py - passive inference via reduction to SAT\n- https://gitlab.science.ru.nl/rick/z3gi - Satisfiability Modulo Theories (SMT) backed passive learning algorithm. Z3GI is a Python tool and library that uses the [Z3 SMT solver](https://github.com/Z3Prover/z3) for learning minimal consistent state machine models from labeled strings or input/output taces. The ideas the tool is based on and the experiments conducted are described in the publication [Model Learning as a Satisfiability Modulo Theories Problem](https://gitlab.science.ru.nl/rick/z3gi/-/blob/master/extended.pdf) due to appear at the LATA 2018 conference.\n- https://pypi.org/project/lstar/ - Active learning algorithm based L* derivative\n- https://wcventure.github.io/Active-Automata-Learning/ - A Quick Survey of Active Automata Learning\n- https://blog.csdn.net/wcventure/article/details/79144074 - Angluin L* algorithm\n- https://arxiv.org/pdf/2209.14031.pdf Active vs. Passive: A Comparison of Automata Learning Paradigms for Network Protocols\n\n### Substring-Based Algorithms\n\n#### Alignment-Based Learning\n\nauthor: van Zaanen 2000\n\nvan Zaanen M (2000) ABL: alignment-based learning. In:Proceedings of the 18th international\nconference on computational linguistics (COLING), association for computational linguistics,\nassociation for computational linguistics, pp 961–967\n\nhttps://ilk.uvt.nl/menno/research/software/abl\n\n\u003e ABL learns structure from plain sequences (for example natural language sentences) by comparing them. Based on the parts of the sequences that are the same and parts that are not the same in two sequences, structure is inserted in the sequences.\n\n#### Grammatical Inference\n\n##### Error-Correcting Grammatical Inference\n\nauthor: Rulot and Vidal 1987\n\nRulot H, Vidal E (1987) Modelling (sub)string-length based constraints through a grammatical\ninference method. In: Kittler J, Devijver P (eds) Proceedings of the NATO advanced study institute\non pattern recognition theory and applications. Springer, pp 451–459\n\n##### ADIOS\n\nauthor: Solan et al. 2005\n\nSolan Z, Horn D, Ruppin E, Edelman S (2005) Unsupervised learning of natural languages. Proc\nNat Acad Sci USA 102(33):11,629–11,634\n\n##### Data-Oriented Parsing\n\nauthor: Bod 2006\n\nBod R (2006) An all-subtrees approach to unsupervised parsing. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, association\nfor computational linguistics, pp 865–872\n\n### tree diff\n\n- https://github.com/search?l=JavaScript\u0026p=2\u0026q=tree+diff\u0026type=Repositories\n- https://github.com/Matt-Esch/virtual-dom\n- https://github.com/syntax-tree/unist-diff - part of unifiedjs\n- https://github.com/Tchanders/treeDiffer.js\n\n### data transformers\n\n- https://github.com/scopashq/typestream\n\n### template-based parsing\n\ndifferent from \"template finding\"\n\naka: parser-generators\n\n#### javascript\n\n- https://github.com/blerik/razor.js\n- https://github.com/mbrevoort/reverse-string-template\n- https://github.com/akushnikov/template-parser\n- https://github.com/lezer-parser\n- https://github.com/coderaiser/putout\n\n#### python\n\n- https://github.com/arshaw/scrapemark\n- https://pypi.org/project/fado/ - manipulation of automata, manipulation of regular languages, high-level programming, prototyping of algorithms\n\n#### java\n\n- https://github.com/juzraai/reverse-template-engine\n\n#### C++\n\n- https://github.com/html-extract/hext - Domain-specific language for extracting structured data from HTML documents\n\n## see also\n\n- https://stackoverflow.com/questions/18168707/javascript-templating-language-in-reverse\n- https://softwareengineering.stackexchange.com/questions/241485/do-input-template-languages-exist\n- https://reverseengineering.stackexchange.com/questions/1331/automated-tools-for-file-format-reverse-engineering\n- https://stackoverflow.com/questions/42727092/find-similar-branches-in-multiple-trees\n- Rajkovic, M., Stankovic, M., \u0026 Marković, I. (2011). A Template Engine for Parsing Objects from Textual Representations. (https://doi.org/10.1063/1.3636860) ([semanticscholar.org](https://www.semanticscholar.org/paper/A-Template-Engine-for-Parsing-Objects-from-Textual-Rajkovic-Stankovic/4c5ffbe2fae274e1cdacb8b39d564c6a2e91cf5c)) ([sci-hub.ru](https://sci-hub.ru/https://doi.org/10.1063/1.3636860)) - nothing new there. captain obvious stuff\n\n## keywords\n\n- tree compression of many similar trees\n- reverse-engineer the template of many similar html files\n- find template from sample data\n- template-finding\n- template-detection\n- template-generation\n- template-generator\n- grammar generator\n- generate schema from data\n- approximate schema from data\n- find common structure of many similar html files\n- generate parser of many similar input files\n- reverse-engineering a JSX template from many rendered pages\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilahu%2Freverse-template-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmilahu%2Freverse-template-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilahu%2Freverse-template-engine/lists"}