{"id":16388976,"url":"https://github.com/luckyzxl2016/spark-example","last_synced_at":"2025-08-10T22:08:57.836Z","repository":{"id":105601680,"uuid":"116407291","full_name":"LuckyZXL2016/Spark-Example","owner":"LuckyZXL2016","description":"Spark1.6和spark2.2的示例，包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe","archived":false,"fork":false,"pushed_at":"2018-01-28T15:44:35.000Z","size":2159,"stargazers_count":15,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T12:35:53.486Z","etag":null,"topics":["dataframe","elasticsearch","jedis","kafka","mysql","spark","spark-example","spark-sql","spark-streaming","spark-structured-streaming"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LuckyZXL2016.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-05T16:56:03.000Z","updated_at":"2020-10-28T08:49:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"cef011e9-0588-4970-9a3b-fd1402147c93","html_url":"https://github.com/LuckyZXL2016/Spark-Example","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LuckyZXL2016/Spark-Example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuckyZXL2016%2FSpark-Example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuckyZXL2016%2FSpark-Example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuckyZXL2016%2FSpark-Example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuckyZXL2016%2FSpark-Example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LuckyZXL2016","download_url":"https://codeload.github.com/LuckyZXL2016/Spark-Example/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuckyZXL2016%2FSpark-Example/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269794186,"owners_count":24476765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataframe","elasticsearch","jedis","kafka","mysql","spark","spark-example","spark-sql","spark-streaming","spark-structured-streaming"],"created_at":"2024-10-11T04:30:35.023Z","updated_at":"2025-08-10T22:08:57.808Z","avatar_url":"https://github.com/LuckyZXL2016.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark-Example\r\ncom.zxl.spark2_2.kafka \r\n\t\r\n\tStreamingKafka8：\r\n\t\t\r\n\t\tSparkStreaming从kafka中读取数据\r\n  \t\t\r\n\t\tkafka版本0.8\r\n  \t\t\r\n\t\t采取直连方式\r\n  \t\r\n\tStreamingKafka10：\r\n  \t\t\r\n\t\tSparkStreaming从kafka中读取数据\r\n  \t\t\r\n\t\tkafka版本0.10\r\n  \t\t\r\n\t\t采取直连方式\r\n\r\n com.zxl.spark2_2.streaming \r\n \t\r\n\tStreamingToMysql：\r\n \t\t\r\n\t\tSparkStreaming读取数据，存储到Mysql中\r\n\r\n com.zxl.spark2_2.structured \r\n \t\r\n\tJDBCSink：\r\n \t\t\r\n\t\t处理从StructuredStreaming中向mysql中写入数据\r\n \t\r\n\tMySqlPool：\r\n \t\t\r\n\t\t从mysql连接池中获取连接\r\n \t\r\n\tStructuredStreamingKafka：\r\n \t\t\r\n\t\t结构化流从kafka中读取数据存储到关系型数据库mysql\r\n  \t\t\r\n\t\t目前结构化流对kafka的要求版本0.10及以上 \r\n\r\ncom.zxl.spark2_2.dataset\r\n\r\n\tcreateDataSet：\r\n\t\r\n\t\tDataSet创建的多种方式\r\n\r\n\tbasicAction：\r\n\t\r\n\t\tDataSet的基本操作\r\n\t\t\r\n\tactions：\r\n\t\r\n\t\tDataSet的Action操作\r\n\t\t\t1.map操作，flatMap操作\r\n\t\t\t2.filter操作，where操作\r\n\t\t\t3.去重操作\r\n\t\t\t4.加法/减法操作\r\n\t\t\t5.select操作\r\n\t\t\t6.排序操作\r\n\t\t\t7.分割抽样操作\r\n\t\t\t8.列操作\r\n\t\t\t9.join操作\r\n\t\t\t10.分组聚合操作\r\n\t\t\r\ncom.zxl.spark1_6.dataframe\r\n\t\r\n\tSQLDemo：\r\n\t\t\r\n\t\t从hdfs中读取数据，转化为DataFrame，执行简单操作\r\n\r\ncom.zxl.spark1_6.elastic \r\n\t\r\n\tElasticSpark：\r\n\t\t\r\n\t\tElasticsearch是一个基于Lucene的实时地分布式搜索和分析引擎。\r\n  \t\t\r\n\t\t设计用于云计算中，能够达到实时搜索，稳定，可靠，快速，安装使用方便。\r\n\r\ncom.zxl.spark1_6.flume\r\n\t\r\n\tFlumePushWordCount：\r\n\t\t\r\n\t\tflume向spark发送数据\r\n  \t\t\r\n\t\t添加三个jar包\r\n  \t\t\t\r\n\t\t\t- commons-lang3-3.3.2.jar\r\n  \t\t\t\r\n\t\t\t- scala-library-2.10.5.jar\r\n  \t\t\t\r\n\t\t\t- spark-streaming-flume-sink_2.10-1.6.1.jar\r\n  \t\t\r\n\t\t打成jar包上传到集群中运行\r\n  \t\t\r\n\t\t集群命令如下：\r\n  \t\t\r\n\t\tbin/spark-submit --master spark://node1:7077 --class com.zxl.spark1_6.flume.FlumePushWordCount /jar/____.jar 192.168.13.131 8888\r\n\r\ncom.zxl.spark1_6.jedis \r\n\t\r\n\tJedisConnectionPool：\r\n\t\t\r\n\t\t获得Jedis连接，进行简单操作\r\n\r\ncom.zxl.spark1_6.kafka \r\n\t\r\n\tDirectKafkaWordCount：\r\n\t\t\r\n\t\tSpark Streaming维护偏移量相关的信息，实现零数据丢失，保证不重复消费\r\n  \t\t\r\n\t\t采用直连的方式有一个缺点，就是不再向zookeeper中更新offset信息。\r\n  \t\t\r\n\t\t因此，在采用直连的方式消费kafka中的数据的时候，大体思路是首先获取保存在zookeeper中的偏移量信息，\r\n  \t\t\r\n\t\t根据偏移量信息去创建stream，消费数据后再把当前的偏移量写入zookeeper中\r\n \t\t\r\n\t\t在2.0以前的版本中KafkaManager这个类是private权限，需要把它拷贝到项目里使用。\r\n  \t\t\torg.apache.spark.streaming.kafka\r\n  \t\r\n\tKafkaWordCount：\r\n  \t\t\r\n\t\t从集群中的kafka读取数据操作\r\n  \t\t\r\n\t\t运行时参数：\r\n  \t\t\t\r\n\t\t\tnode1:2181,node2:2181,node3:2181 g1 test 2\r\n  \t\t\t\r\n\t\t\t其中g1为组名，此处随意写，test为topic名，kafka中的topic名要一致\r\n  \t\t\r\n\t\t集群命令(需先启动完成)：\r\n        \t\r\n\t\t1.启动kafak\r\n         \t\t\r\n\t\t\tbin/kafka-server-start.sh config/server.properties \u003e /dev/null 2\u003e\u00261 \u0026\r\n  \t     \t\r\n\t\t2.创建topic\r\n        \t\t\r\n\t\t\tbin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 3 --topic test\r\n  \t\t\r\n\t\t3.向topic中添加数据\r\n \t\t\t\r\n\t\t\tbin/kafka-console-producer.sh --broker-list localhost:9092 --topic test\r\n\r\ncom.zxl.spark1_6.my_partitioner \r\n\t\r\n\tUrlCountPartition：\r\n\t\t\r\n\t\t自定义分区\r\n  \t\t\r\n\t\t数据格式(时间点  url地址)，例如：\r\n  \t\t\t20160321101954\thttp://net.zxl.cn/net/video.shtml\r\n  \t\t\r\n\t\t处理成数据(k, v)\r\n  \t\t\r\n\t\t对于数据(k, v)\r\n  \t\t\r\n\t\t重写自己的 partitioner\r\n\r\ncom.zxl.spark1_6.my_sort \r\n\t\r\n\tCustomSort：自定义排序\r\n\r\ncom.zxl.spark1_6.mysql \r\n\t\r\n\tJdbcRDDDemo：简单连接数据库操作\r\n\r\ncom.zxl.spark1_6.simple \r\n\t\r\n\tAdvUrlCount：\r\n\t\t\r\n\t\t读取文本内容,根据指定的学科, 取出点击量前三的\r\n  \t\t\r\n\t\t文本内容为某广告链接点击量，格式为：(时间点  某学科url链接)\r\n  \t\t\r\n\t\t举例：(20160321101957\thttp://net.zxl.cn/net/course.shtml)\r\n  \t\r\n\tIpDemo：\r\n  \t\t\r\n\t\t数据格式如下：\r\n  \t\t\t(1.0.1.0|1.0.3.255|16777472|16778239|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302)\r\n  \t\t\r\n\t\t根据ip地址转换为数字，从数据集中找出详细信息.\r\n  \t\t\r\n\t\t为了简化查找速率，采用二分查找.\r\n  \t\r\n\tUserLocation：\r\n  \t\t\r\n\t\t根据日志统计出每个用户在站点所呆时间最长的前2个的信息\r\n  \t\t\r\n\t\t日志内容格式为(手机号,时间点,基站站点,事件类型),事件类型为1时是进入基站,0是出基站。\r\n  \t\t\t\r\n\t\t\t1, 先根据\"手机号_站点\"为唯一标识, 算一次进站出站的时间, 返回(手机号_站点, 时间间隔)\r\n  \t\t\t\r\n\t\t\t2, 以\"手机号_站点\"为key, 统计每个站点的时间总和, (\"手机号_站点\", 时间总和)\r\n  \t\t\t\r\n\t\t\t3, (\"手机号_站点\", 时间总和) --\u003e (手机号, 站点, 时间总和)\r\n  \t\t\t\r\n\t\t\t4, (手机号, 站点, 时间总和) --\u003e groupBy().mapValues(以时间排序,取出前2个) --\u003e (手机-\u003e((m,s,t)(m,s,t)))\r\n  \t\r\n\tWordCount：\r\n  \t\t\r\n\t\t简单WordCount实现\r\n  \t\t\r\n\t\t集群上执行示例，指定相关配置\r\n  \t\t\r\n\t\tbin/spark-submit --master spark://node1:7077 --class com.zxl.spark1_6.simple.WordCount --executor-memory 512m\t--total-executor-cores 2 /opt/soft/jar/hello-spark-1.0.jar hdfs://node1:9000/wc hdfs://node1:9000/out\r\n\r\ncom.zxl.spark1_6.streaming\r\n\t\r\n\tLoggerLevels：\r\n\t\t\r\n\t\t设置打印的log的级别\r\n\t\r\n\tStateFulWordCount：\r\n\t\t\r\n\t\tSpark Streaming累加器操作（updateStateByKey)\r\n\t\r\n\tStreamingWordCount：\r\n\t\t\r\n\t\t通过SparkStreaming简单实现WordCount\r\n\t\r\n\tWindowOpts：\r\n\t\t\r\n\t\tSparkStreaming窗口函数的实现\r\n\r\norg.apache.spark.streaming.kafka \r\n\t\r\n\tKafkaManager：\r\n\t\t\r\n\t\tSparkStreaming直连kafka获取数据，自己编写偏移量offset，用于spark2.0以前\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluckyzxl2016%2Fspark-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluckyzxl2016%2Fspark-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluckyzxl2016%2Fspark-example/lists"}