{"id":16599196,"url":"https://github.com/collabh/reasearch-bigdata","last_synced_at":"2025-03-21T13:32:31.211Z","repository":{"id":37108254,"uuid":"239325567","full_name":"collabH/reasearch-bigdata","owner":"collabH","description":"看书看源码看第三方学习视频","archived":false,"fork":false,"pushed_at":"2022-12-14T20:43:35.000Z","size":518,"stargazers_count":12,"open_issues_count":19,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-01T06:51:12.320Z","etag":null,"topics":["flink","hadoop","hive","spark"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/collabH.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-09T15:19:16.000Z","updated_at":"2022-11-11T02:32:38.000Z","dependencies_parsed_at":"2023-01-29T01:00:46.202Z","dependency_job_id":null,"html_url":"https://github.com/collabH/reasearch-bigdata","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/collabH%2Freasearch-bigdata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/collabH%2Freasearch-bigdata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/collabH%2Freasearch-bigdata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/collabH%2Freasearch-bigdata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/collabH","download_url":"https://codeload.github.com/collabH/reasearch-bigdata/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244141580,"owners_count":20404835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flink","hadoop","hive","spark"],"created_at":"2024-10-12T00:10:39.237Z","updated_at":"2025-03-21T13:32:30.825Z","avatar_url":"https://github.com/collabH.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 大数据成长之路\n\n## Hadoop\n### 历史之路\n[Hadoop十年解读](https://www.infoq.cn/article/hadoop-ten-years-interpretation-and-development-forecast)\n\n### HDFS JavaAPI\n#### 副本因子的坑\n```text\n如果通过hdfs shell上传的文件那么他的副本因子是根据 hdfs-site.xml中的配置,\n如果是通过Java API方式那么他会使用副本因子为3的配置\n\n```\n### 项目实践\n#### 用户行为日志分析\n\n**日志数据内容**\n* 访问的系统属性:操作系统、浏览器等等\n* 访问特征:点击的url、从哪个url跳转过的(referer)、页面停留时间等\n* 访问信息:session_id、访问ip\n\n**数据处理流程**\n* 数据采集 Flume:Web日志写入HDFS中\n* 数据清洗 脏数据清理:Spark、Hive、MapReduce\n* 数据处理 按照需求进行相应业务的统计和分析\n* 数据处理结果入库   结果可以存放到RDBMS、NoSQL等\n* 数据的可视化  通过图形化展示的方式展现出来:饼图、柱状图、地图等\n\n### HDFS文档\n* [官方文档](https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)\n* [石墨笔记](https://shimo.im/docs/RjGgVxDJ8KT96xr8/)\n* [HDFS文件读取写入流程](https://www.processon.com/view/link/5e40b7e4e4b085b5f21a193d)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollabh%2Freasearch-bigdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcollabh%2Freasearch-bigdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollabh%2Freasearch-bigdata/lists"}