企业spark案例 —— 出租车轨迹分析(Python)

第1关:SparkSql 数据清洗

python 复制代码
# -*- coding: UTF-8 -*-
from pyspark.sql import SparkSession
if __name__ =='__main__':
    spark = SparkSession.builder.appName("demo").master("local").getOrCreate()
    #**********begin**********#
    df = spark.read.option("header",True).option("delimiter","\t").csv("/root/data.csv")
    df.createTempView("data")
    spark.sql("""
    select regexp_replace(TRIP_ID,'\\\W+','') as TRIP_ID ,
        regexp_replace(CALL_TYPE,'\\\W+','') as CALL_TYPE ,
        regexp_replace(ORIGIN_CALL,'\\\W+','') as ORIGIN_CALL ,
        regexp_replace(TAXI_ID,'\\\W+','') as TAXI_ID ,
        regexp_replace(ORIGIN_STAND,'\\\W+','') as ORIGIN_STAND ,
        regexp_replace(TIMESTAMP,'\\\W+','') as TIMESTAMP ,
        regexp_replace(POLYLINE,'\\\W+','') as POLYLINE
    from data
    """).show()
    #**********end**********#
    spark.stop()

第2关:SparkSql数据分析

python 复制代码
# -*- coding: UTF-8 -*-
from pyspark.sql import SparkSession
import json

if __name__ == '__main__' :
    spark = SparkSession.builder.master("local").appName("demo").getOrCreate()
    #**********begin**********#
    df = spark.read.option("header",True).option("delimiter","\t").csv("/root/data2.csv")
    df.createTempView("data")
    spark.sql("select TRIP_ID,CALL_TYPE,ORIGIN_CALL, TAXI_ID, ORIGIN_STAND, from_unixtime(TIMESTAMP,'yyyy-MM-dd') as TIME ,POLYLINE from data").show()
    spark.udf.register("timeLen", lambda x: {
        (len(json.loads(x)) - 1) * 15 if len(json.loads(x)) > 0 else 8
    })
    spark.udf.register("startLocation", lambda x: {
        str(json.loads(x)[0]) if len(json.loads(x)) > 0 else ""
    })
    spark.udf.register( "endLocation", lambda x: {
        str(json.loads(x)[len(json.loads(x)) - 1]) if len(json.loads(x)) > 0 else ""
    })
    df.createTempView("data2")
    res=spark.sql("select TRIP_ID,CALL_TYPE,ORIGIN_CALL,TAXI_ID,ORIGIN_STAND,from_unixtime(TIMESTAMP,'yyyy-MM-dd') as TIME, POLYLINE, timeLen(POLYLINE) as TIMELEN, startLocation(POLYLINE) as STARTLOCATION, endLocation(POLYLINE) as ENDLOCATION from data2")
    res.createTempView("data3")
    res.show()
    spark.sql("select CALL_TYPE,TIME,count(1) as NUM from data3 group by TIME,CALL_TYPE order by CALL_TYPE,TIME").show()
    #**********end**********#
相关推荐
g***B7381 天前
JavaScript在Node.js中的模块系统
开发语言·javascript·node.js
头发还在的女程序员1 天前
三天搞定招聘系统!附完整源码
开发语言·python
温轻舟1 天前
Python自动办公工具06-设置Word文档中表格的格式
开发语言·python·word·自动化工具·温轻舟
Z***25801 天前
JavaScript在Node.js中的Deno
开发语言·javascript·node.js
花酒锄作田1 天前
[python]FastAPI-Tracking ID 的设计
python·fastapi
AI-智能1 天前
别啃文档了!3 分钟带小白跑完 Dify 全链路:从 0 到第一个 AI 工作流
人工智能·python·自然语言处理·llm·embedding·agent·rag
cypking1 天前
Vue 3 + Vite + Router + Pinia + Element Plus + Monorepo + qiankun 构建企业级中后台前端框架
前端·javascript·vue.js
San30.1 天前
ES6+ 新特性解析:让 JavaScript 开发更优雅高效
开发语言·javascript·es6
d***95621 天前
爬虫自动化(DrissionPage)
爬虫·python·自动化
APIshop1 天前
Python 零基础写爬虫:一步步抓取商品详情(超细详解)
开发语言·爬虫·python