{"id":49341691,"url":"https://github.com/redborder/camus-sync","last_synced_at":"2026-04-27T04:04:49.010Z","repository":{"id":33605788,"uuid":"37257959","full_name":"redBorder/camus-sync","owner":"redBorder","description":"Synchronices hdfs data from camus between hdfs clusters.","archived":false,"fork":false,"pushed_at":"2016-10-19T14:48:40.000Z","size":72,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":11,"default_branch":"master","last_synced_at":"2023-03-21T05:45:05.421Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/redBorder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-11T11:47:46.000Z","updated_at":"2016-03-21T09:53:40.000Z","dependencies_parsed_at":"2022-09-12T22:23:53.179Z","dependency_job_id":null,"html_url":"https://github.com/redBorder/camus-sync","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"purl":"pkg:github/redBorder/camus-sync","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redBorder%2Fcamus-sync","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redBorder%2Fcamus-sync/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redBorder%2Fcamus-sync/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redBorder%2Fcamus-sync/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/redBorder","download_url":"https://codeload.github.com/redBorder/camus-sync/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redBorder%2Fcamus-sync/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32321945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-27T04:04:48.797Z","updated_at":"2026-04-27T04:04:48.994Z","avatar_url":"https://github.com/redBorder.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Camus Sync\n[![Codacy Badge](https://api.codacy.com/project/badge/grade/968693b22b9f489ebfbe6dd15a509af9)](https://www.codacy.com/app/redBorder/camus-sync)\nSynchronices hdfs data from camus between hdfs clusters.\n\n## Usage\n\n```\n$ java -cp \"camus-sync-VERSION-selfcontained.jar:$(hadoop classpath)\" net.redborder.camus.CamusSync -h\nusage: java -cp CLASSPATH net.redborder.camus.CamusSync OPTIONS\n -c,--configFile \u003carg\u003e   path to a YAML config file\n -f,--offset \u003carg\u003e       offset\n -h,--help               print this help\n -m,--mode \u003carg\u003e         task to execute (synchronize, deduplicate)\n -n,--namenodes \u003carg\u003e    comma separated list of namenodes\n -d,--dryRun             do nothing\n -p,--camusPath \u003carg\u003e    HDFS path where camus saves its data\n -t,--topics \u003carg\u003e       comma separated list of topics\n -w,--window \u003carg\u003e       window hours\n```\n\nAll the options can be specified in the config file specified with the -c option. Options from the\ncommand line overwrite the options from the config file.\n\nIf the \"topic\" option is not specified on the command line, it will try to get the keys specified on the\nproperty \"topics\" in the config file.\n\n## Assumption / Notes\n\n* HDFS contains data in gzip'd files in [camus](https://github.com/linkedin/camus)-style [folders](https://github.com/liquidm/druid-dumbo/blob/master/lib/dumbo/firehose/hdfs.rb#L65)\n* The config file is a YAML file with a map, where the keys are the longer name of the option\n\n## Deduplicate mode\n\nThis mode runs a Pig job that loads the data from each namenode specified and merges all the data, deleting\nduplicated rows along the way. We expect the data to be json messages without depth.\n\nTo identify duplicated rows, you must specify a set of dimensions (properties) that will be used. If two or more\nmessages have the same value for each of these dimensions, all of those messages will be deleted, except one. Therefore,\nyou should use dimensions that can identify a message uniquely. We often include the timestamp of the message in this set\nof dimensions.\n\nYou can specify the dimensions that will be used for each topic on the config file, under the key \"topics\".\nThe value of \"topics\" should be a map, where the key is the topic name, and the value is an array of strings, where\neach string is the name of a dimension (property) on the JSON message.\n\n## Synchronize mode\n\nThis mode does the following for each hour:\n1. Finds out which namenode has the biggest number of events for that hour.\n2. Deletes all the data from that hour from every other namenode.\n3. Copies the data from (1) to every other namenode.\n\nIt uses the distCp hadoop job to copy the data directly between the clusters.\n\n## Contributing\n\n1. [Fork it](https://github.com/redborder/camus-sync/fork)\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create a new Pull Request\n\n## Credit\n\nBased on [liquidm/druid-dumbo](https://github.com/redborder/druid-dumbo).\nDumbo lets you index camus data from HDFS to a [druid](http://www.druid.io) cluster\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredborder%2Fcamus-sync","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fredborder%2Fcamus-sync","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredborder%2Fcamus-sync/lists"}