我有数据框中的数据如下:
datetime | userId | memberId | value | 2016-04-06 16:36:... | 1234 | 111 | 1 2016-04-06 17:35:... | 1234 | 222 | 5 2016-04-06 17:50:... | 1234 | 111 | 8 2016-04-06 18:36:... | 1234 | 222 | 9 2016-04-05 16:36:... | 4567 | 111 | 1 2016-04-06 17:35:... | 4567 | 222 | 5 2016-04-06 18:50:... | 4567 | 111 | 8 2016-04-06 19:36:... | 4567 | 222 | 9
我需要在userid,memberid中找到max(datetime)groupby.当我尝试如下:
df2 = df.groupBy('userId','memberId').max('datetime')
我收到的错误是:
org.apache.spark.sql.AnalysisException: "datetime" is not a numeric column. Aggregation function can only be applied on a numeric column.;
我想要的输出如下:
userId | memberId | datetime 1234 | 111 | 2016-04-06 17:50:... 1234 | 222 | 2016-04-06 18:36:... 4567 | 111 | 2016-04-06 18:50:... 4567 | 222 | 2016-04-06 19:36:...
有人可以帮助我如何使用PySpark数据帧获得给定数据中的最大日期?
解决方法
对于非数字但可订购类型,您可以直接使用agg with max:
from pyspark.sql.functions import col,max as max_ df = sc.parallelize([ ("2016-04-06 16:36",1234,111,1),("2016-04-06 17:35",5),]).toDF(["datetime","userId","memberId","value"]) (df.withColumn("datetime",col("datetime").cast("timestamp")) .groupBy("userId","memberId") .agg(max_("datetime"))) ## +------+--------+--------------------+ ## |userId|memberId| max(datetime)| ## +------+--------+--------------------+ ## | 1234| 111|2016-04-06 17:35:...| ## +------+--------+--------------------+