{"id":27834571,"url":"https://github.com/linkedin/datafu","last_synced_at":"2025-05-02T13:01:09.749Z","repository":{"id":1675370,"uuid":"2402401","full_name":"LinkedInAttic/datafu","owner":"LinkedInAttic","description":"Hadoop library for large-scale data processing, now an Apache Incubator project","archived":false,"fork":false,"pushed_at":"2014-07-08T17:00:26.000Z","size":32892,"stargazers_count":583,"open_issues_count":4,"forks_count":133,"subscribers_count":74,"default_branch":"master","last_synced_at":"2025-04-26T05:09:52.353Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://datafu.incubator.apache.org/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LinkedInAttic.png","metadata":{"files":{"readme":"README.md","changelog":"changes.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-09-16T22:32:31.000Z","updated_at":"2025-04-12T21:11:06.000Z","dependencies_parsed_at":"2022-09-07T12:13:24.874Z","dependency_job_id":null,"html_url":"https://github.com/LinkedInAttic/datafu","commit_stats":null,"previous_names":["linkedin/datafu"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LinkedInAttic%2Fdatafu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LinkedInAttic%2Fdatafu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LinkedInAttic%2Fdatafu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LinkedInAttic%2Fdatafu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LinkedInAttic","download_url":"https://codeload.github.com/LinkedInAttic/datafu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252043563,"owners_count":21685464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-02T13:00:46.393Z","updated_at":"2025-05-02T13:01:09.736Z","avatar_url":"https://github.com/LinkedInAttic.png","language":"Java","readme":"# Apache DataFu\n\n\u003ca href=\"https://twitter.com/apachedatafu\" class=\"twitter-follow-button\" data-show-count=\"false\" data-size=\"large\"\u003eFollow @apachedatafu\u003c/a\u003e\n\n[Apache DataFu](http://datafu.incubator.apache.org) is a collection of libraries for working with large-scale data in Hadoop.\nThe project was inspired by the need for stable, well-tested libraries for data mining and statistics.\n\nIt consists of two libraries:\n\n* **Apache DataFu Pig**: a collection of user-defined functions for [Apache Pig](http://pig.apache.org/)\n* **Apache DataFu Hourglass**: an incremental processing framework for [Apache Hadoop](http://hadoop.apache.org/) in MapReduce\n\nDataFu is currently undergoing incubation with Apache.  A mirror of the official git repository can be found on GitHub at [https://github.com/apache/incubator-datafu](https://github.com/apache/incubator-datafu).\n\nFor more information please visit the website:\n\n* [http://datafu.incubator.apache.org/](http://datafu.incubator.apache.org/)\n\nIf you'd like to jump in and get started, check out the corresponding guides for each library:\n\n* [Apache DataFu Pig - Getting Started](http://datafu.incubator.apache.org/docs/datafu/getting-started.html)\n* [Apache DataFu Hourglass - Getting Started](http://datafu.incubator.apache.org/docs/hourglass/getting-started.html)\n\n## Blog Posts\n\n* [Introducing DataFu](http://datafu.incubator.apache.org/blog/2012/01/10/introducing-datafu.html)\n* [DataFu: The WD-40 of Big Data](http://datafu.incubator.apache.org/blog/2013/01/24/datafu-the-wd-40-of-big-data.html)\n* [DataFu 1.0](http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html)\n* [DataFu's Hourglass: Incremental Data Processing in Hadoop](http://datafu.incubator.apache.org/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html)\n\n## Presentations\n\n* [A Brief Tour of DataFu](http://www.slideshare.net/matthewterencehayes/datafu)\n* [Building Data Products at LinkedIn with DataFu](http://www.slideshare.net/matthewterencehayes/building-data-products-at-linkedin-with-datafu)\n* [Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)](http://www.slideshare.net/matthewterencehayes/hourglass-a-library-for-incremental-processing-on-hadoop)\n* [DataFu @ ApacheCon 2014](http://www.slideshare.net/williamgvaughan/datafu-apachecon-33420740)\n\n## Videos\n\n* [Introduction to Apache DataFu @ ApacheCon 2014](http://www.youtube.com/watch?v=JWI9tVsQ1cY)\n\n## Other Resources\n\nAn interesting example of using Quantile from DataFu can be found in the [Hadoop Real-World Solutions Cookbook](http://packtlib.packtpub.com/library/hadoop-real-world-solutions-cookbook/ch06lvl1sec62).\n\n## From Around the Web\n\n* [DataFu Enters Incubation Status at Apache](http://www.infoq.com/news/2014/02/datafu-asf)\n* [DataFu: Open Source Apache Pig UDFs by LinkedIn](http://nosql.mypopescu.com/post/15734212877/datafu-open-source-apache-pig-udfs-by-linkedin)\n* [LinkedIn Opens DataFu: A Library for Working with Hadoop and Pig](http://readwrite.com/2012/01/12/linkedin-opens-datafu-a-librar)\n\n## Papers\n\n* [Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)](http://www.slideshare.net/matthewterencehayes/hourglass-27038297)\n\n## Getting Help\n\nPlease visit the website:\n\n* [http://datafu.incubator.apache.org/](http://datafu.incubator.apache.org/)\n","funding_links":[],"categories":["Java","II. Databases, search engines, big data and machine learning"],"sub_categories":["7. Big data"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fdatafu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkedin%2Fdatafu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fdatafu/lists"}