{"id":13707958,"url":"https://github.com/petersonjr/MetadataCrawler","last_synced_at":"2025-05-06T07:31:15.470Z","repository":{"id":40998462,"uuid":"201326745","full_name":"petersonjr/MetadataCrawler","owner":"petersonjr","description":"A simple tool to extract metadata from relational databases","archived":false,"fork":false,"pushed_at":"2023-06-14T22:33:05.000Z","size":80,"stargazers_count":7,"open_issues_count":3,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-13T17:45:22.553Z","etag":null,"topics":["avro","crawler","database-schemas","java","jdbc","metadata","rdms","relational-databases"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/petersonjr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-08T19:47:16.000Z","updated_at":"2024-06-25T22:20:58.000Z","dependencies_parsed_at":"2024-11-16T04:02:28.897Z","dependency_job_id":null,"html_url":"https://github.com/petersonjr/MetadataCrawler","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersonjr%2FMetadataCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersonjr%2FMetadataCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersonjr%2FMetadataCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersonjr%2FMetadataCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/petersonjr","download_url":"https://codeload.github.com/petersonjr/MetadataCrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252639965,"owners_count":21780848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","crawler","database-schemas","java","jdbc","metadata","rdms","relational-databases"],"created_at":"2024-08-02T22:01:50.686Z","updated_at":"2025-05-06T07:31:15.205Z","avatar_url":"https://github.com/petersonjr.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# MetadataCrawler\r\nA simple tool to extract metadata from relational databases\r\n\r\n## General Info\r\n\r\nMetadataCrawler is yet another tool to extract metadata from relational database management systems. It connects to a RDMS and extracts metadata via JDBC, making use of [Apache Metamodel](https://metamodel.apache.org/). MetadataCrawler collects information about all catalogs, schemas, tables and collumns of a RDMS and produce output in avro or json files.\r\n\r\n## Why?\r\n\r\nThere are great tools that do the same thing(and much more) as MC, such as [SchemaCrawler](https://www.schemacrawler.com/) or [SchemaSpy](http://schemaspy.org/). In fact, I recommend you to take a look at them before trying MetadataCrawler. The motivations to build MC were:\r\n\r\n- Crawl all catalogs from a RDMS at once. Mostly, the tools available only gets information from a catalog at a time.\r\n- Produce output that is easy to process. I found avro format a good option for this. Avro forces you to define a schema, which serves not only as documentation, but also as a input  to [generate code to process data in several languages](https://avro.apache.org/docs/1.8.2/index.html).\r\n\r\nIn short, MetadataCrawler has two main goals: to easily collect metadata from all catalogs of a RDMS and produce information in a simple unified schema that is easy to process.\r\n\r\nAs a bonus, check how you can use [hive to process data generated with MetadataCrawler](hive.md).\r\n\r\n## How to use\r\n\r\nMetadataCrawler is run as a standalone java application. Just download the [latest release](https://github.com/petersonjr/MetadataCrawler/releases/latest) and run:\r\n\r\n```\r\njava -jar simple-metadata-crawler.jar --url  \"jdbc:mysql://localhost:33306\" -u root -p \"metadata\" --format json -o mysql.json\r\n```\r\n\r\n**Make sure that the database user has reading(select) permissions on the catalogs you want to get metadata from. See more in [here](https://github.com/schemacrawler/SchemaCrawler/issues/256#issuecomment-509652994).**\r\n\r\nYou may also filter catalogs using a regex expression:\r\n\r\n```\r\njava -jar simple-metadata-crawler.jar --url jdbc:sqlserver://localhost:1433 --catalogs \"EMPLOYEES|SALES\" -u root -p \"pass\"  --format json -o sqlserver_f.json\r\n```\r\n\r\nTo see the full list of options just run:\r\n\r\n```\r\njava -jar simple-metadata-crawler-0.1.0.jar --help\r\n```\r\n\r\nThe tool is bundled with some JDBC drivers for convenience, namely SQL Server, Mysql and PostgreSQL drivers. **If you want to run it with other JDBC drivers such as Oracle or Informix, you have to generate a jar with those bundled dependencies. Just follow the instructions at [build-session].**\r\n\r\n\r\n### Password param\r\n\r\nYou can inform the user password via command line(as the previous example), interactively or via environment variable:\r\n\r\nInteractively:\r\n\r\n```\r\njava -jar simple-metadata-crawler.jar --url  \"jdbc:mysql://localhost:33306\" -u root -p --format json -o mysql.json\r\n```\r\n \r\nVia environment variable(which is [safer](https://picocli.info/#_interactive_password_options)):\r\n\r\n```\r\nexport JDBC_PASSWORD=\"pass\" \u0026\u0026 java -jar simple-metadata-crawler.jar \"jdbc:mysql://localhost:33306\" -u root --format json -o mysql.json\r\n```\r\n\r\nDefining the environment variable:\r\n\r\n```\r\nexport MYSQLPASS=\"pass\" \u0026\u0026 java -jar simple-metadata-crawler.jar --url \"jdbc:mysql://localhost:33306\" -u root --password:env MYSQLPASS --format avro -o mysql.avro\r\n```\r\n\r\n### SQL Server execution with Integrated Authentication On Windows\r\n\r\nIt is possible to connect to SQL Server  with integrated authentication on windows. Following the [official instructions](https://docs.microsoft.com/pt-br/sql/connect/jdbc/building-the-connection-url?view=sql-server-2017):\r\n\r\n* Copy sqljdbc_auth.dll file to a directory on your system, or find where it is located.\r\n* Run MetadataCrawler setting java.library.path property as the directory where the dll file is located. Also make sure to use IntegratedSecurity=true on the jdbc url:\r\n\r\n```\r\njava -Djava.library.path=\"C:\\Microsoft JDBC Driver 6.4 for SQL Server\\sqljdbc_\u003cversion\u003e\\enu\\auth\\x86\" -jar simple-metadata-crawler.jar --url  \"jdbc:sqlserver://server:port;IntegratedSecurity=true\" -u root -p \"pass\" --format json -o sqlserver.json\r\n```\r\n\r\n## Batch run\r\n\r\nTo easily get started on running a crawl for several servers in batch, you can make use of **mcrawl_batch.sh** file, which define a folder structure and facilitates the execution of MetadataCrawler for more than one source.\r\n\r\nJust follow the steps:\r\n\r\n* Edit the following line of **mcrawl_batch.sh** file to match your environment:\r\n\r\n```\r\nCMD=\"java -jar target/simple-metadata-crawler-0.1.0.jar\"\r\n```\r\n\r\nCreate a **run_crawl.sh** file, as the example below:\r\n\r\n```\r\n#!/bin/bash\r\n\r\n\r\nDATADIR=\"./data\"\r\nsource mcrawl_batch.sh\r\n\r\n# The params for crawl function are: jdbc_url, user, password, output_file_name, catalogs_filter\r\n\r\n# Mysql\r\ncrawl \"jdbc:mysql://localhost:3306\" \"user\" \"password\" \"mysql_localhost\"\r\n\r\n# SQl server\r\ncrawl \"jdbc:sqlserver://localhost:1433\" \"root\" \"passwd\" \"sqlserver_localhost\" \"EMPLOYEES|EVENTS\"\r\n```\r\nThis file just separates the params of crawl executions from the crawling code. Run it:\r\n\r\n```\r\n./run_crawl.sh\r\n```\r\nCheck the output folder:\r\n```\r\ndata/\r\n  2019-08-02/\r\n    mysql_localhost.avro\r\n    mysql_localhost.log\r\n    sqlserver_localhost.avro\r\n    sqlserver_localhost.log\r\n```\r\n\r\n## The output schema\r\n\r\nTake a look at the avro [schema](src/main/resources/avro/JdbcDatasource.avsc) used by MetadataCrawler. In short, the schema was inspired by Apache Metamodel and JDBC definitions. A high level description of the schema is:\r\n\r\n```\r\nDataSource\r\n  Catalog\r\n    Schema\r\n      Tables\r\n        Columns\r\n      Relationships\r\n```\r\n\r\nA DataSource is defined by a JDBC Url, and has properties such as url, server_name, user, db_product_name, etc. Each DataSource has one or more catalogs, a catalog has one or more schemas, and so on. Notice that [for each database vendor, there are different meanings of JDBC definitions and how to identify an object in a database](https://stackoverflow.com/questions/7942520/relationship-between-catalog-schema-user-and-database-instance). Next, an explanation of what a Catalog and Schema means to MetadataCrawler.\r\n\r\n### Catalogs and Schemas\r\n\r\nThere are [different meanings for what catalog and schema means for each database vendor](https://stackoverflow.com/questions/7942520/relationship-between-catalog-schema-user-and-database-instance). For instance, see the differences between Oracle and SQLServer:\r\n\r\n**In Oracle:**\r\n\r\n* server instance == database == catalog == all data managed by same execution engine\r\n* schema == namespace within database, identical to user account\r\n* user == schema owner == named account, identical to schema, who can connect to database, who owns the schema and use objects possibly in other schemas\r\n* to identify any object in running server, you need (schema name + object name)\r\n\r\n**In Microsoft SQL Server:**\r\n\r\n\r\n* server instance == set of managed databases\r\ndatabase == namespace qualifier within the server, rarely referred to as catalog\r\n* schema == owner == namespace within the database, tied to database roles, by default only dbo is used\r\n* user == named account, who can connect to server and use (but can not own - schema works as owner) objects in one or more databases\r\n* to identify any object in running server, you need (database name + owner + object name)\r\n\r\nTherefore, the following decisions were made:\r\n\r\n* For RDMS like Oracle or Mysql, in which an object may be identified by (schema name + object name), the objects will be outputed **as part of a schema and of a catalog with the same name**. For instance, a table EVENT of schema COMPETITIONS will be outputed as COMPETITIONS(catalog) -\u003e COMPETITIONS(schema) -\u003e EVENT(table).\r\n* For RDMS like SqlServer or PostgreSQL, in which an object may be identified by (catalog name + schema name + object name), the objects will be outputed **as part of a catalog and schema with the respective names**. For instance, a table EVENT of schema DBO and catalog COMPETITIONS will be outputed as COMPETITIONS(catalog) -\u003e DBO(schema) -\u003e EVENT(table).\r\n* An object qualified name **respects the qualified name informed by the database vendor**. For instance, a column qualifed name in SQL Server will be (database name + schema + table name + column name), where in Oracle it will be (database name + table name + column name).\r\n\r\n## Hive analysis\r\n\r\nYou can use hive to [quickly process data obtained with MetadataCrawler](hive.md).\r\n\r\n## Build Instructions\r\n\r\nMake sure you have Apache Maven. There are two profiles in the maven project:\r\n\r\n* open-drivers: profile with just open source jdbc drivers dependencies (mysql, postgresql, sqlserver)\r\n* all-drivers: profile all jdbc drivers dependencies (+ oracle, inforimix)\r\n\r\nTo build, just choose the profile you want:\r\n```\r\nmvn package -P open-drivers\r\n```\r\n\r\nor\r\n\r\n```\r\nmvn package -P all-drivers\r\n```\r\n\r\nTo build with all drivers, you need to install oracle jdbc driver as a maven dependency: [How to add Oracle JDBC driver in your Maven local repository](https://www.mkyong.com/maven/how-to-add-oracle-jdbc-driver-in-your-maven-local-repository/).\r\n\r\n``` \r\nmvn install:install-file -Dfile=path/to/your/ojdbc7.jar -DgroupId=com.oracle \r\n\t-DartifactId=ojdbc7 -Dversion=12.2.0.1 -Dpackaging=jar\r\n```\r\n\r\n## Development\r\n\r\n### Eclipse project\r\n\r\nTo create an Eclipse project:\r\n\r\n```\r\nmvn eclipse:clean\r\nmvn eclipse:eclipse -P all-drivers -DdownloadSources=true -DdownloadJavadocs=true\r\n```\r\n\r\nMake sure to set [UTF-8 enconding](https://stackoverflow.com/questions/9180981/how-to-support-utf-8-encoding-in-eclipse) [in eclipse](https://stackoverflow.com/questions/4043634/define-eclipse-project-encoding-as-utf-8-from-maven).\r\n\r\n### Avro schema\r\n\r\nThe avro schema file is generated with the help of [AvroSchemaBuilder](src/main/java/io/github/petersonjr/metadatacrawler/model/AvroSchemaBuilder.java) class. No one deserves to manually write a json schema.\r\n\r\n### Generate avro classes\r\n\r\nTo generate [Java classes](src/main/java/io/github/petersonjr/metadatacrawler/model) based on the schema, just use maven:\r\n\r\n```\r\nmvn clean compile\r\n``` \r\n\r\n## License\r\n\r\nMetadataCrawler is distributed under [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.en.html).\r\n\r\nMetadataCrawler\r\nCopyright (C) 2019  Péterson Júnior \u003cpeterson.junior@gmail.com\u003e\r\n\r\nThis program is free software: you can redistribute it and/or modify\r\nit under the terms of the GNU General Public License as published by\r\nthe Free Software Foundation, either version 3 of the License, or\r\n(at your option) any later version.\r\n\r\nThis program is distributed in the hope that it will be useful,\r\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\r\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\nGNU General Public License for more details.\r\n\r\nYou should have received a copy of the GNU General Public License\r\nalong with this program.  If not, see \u003chttp://www.gnu.org/licenses/\u003e.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetersonjr%2FMetadataCrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpetersonjr%2FMetadataCrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetersonjr%2FMetadataCrawler/lists"}