RHadoop 實作推薦系統

June 21, 2013

Outline

1. 簡介
2. 工具
3. 演算法與實作
4. Data preprocess
5. Result
6. Reference

簡介

課程： 101-2 NTU cloud computing special project
Proposal title: Large-scale big data mining in cloud system
Progress Report
source code: github.com/evenchange4/CCSP
datasets: movielen_dataset.csv (123Mb, preprocessed)
使用整合 R 與 Hadoop 的 package RHadoop (rmr2) 來實作 item-base 的推薦系統。

工具

RHadoop rmr2
演算法: Collaborative Filtering 協同過濾 (item-based)
安裝 Haddop 參考: Hadoop 第一次安裝就上手 CDH3
使用 Rhadoop 參考: R and Hadoop 整合初體驗

演算法與實作

本演算法參考粉丝日志 RHadoop实践系列來實作，特別感謝有這種中文的教學文章。
推薦結果 = 伴隨矩陣 (co-occurrence matrix) * 評分矩陣（score matrix）
map-reduce 實作步驟：
1. 建立 item’s co-occurrence matrix，然後算出 frequence

train.mr<-mapreduce(
	train.hdfs, 
	map = function(k, v) {
		keyval(k,v$item)
	},
	reduce = function(k,v){
		m<-merge(v,v)
		keyval(m$x,m$y)
	}
)
step2.mr<-mapreduce(
	train.mr,
	map = function(k, v) {
		d<-data.frame(k,v)
		d2<-ddply(d,.(k,v),count)
		key<-d2$k
		val<-d2
		keyval(key,val)
	}
)

2. 建立 user's 評分矩陣

train2.mr<-mapreduce(
	train.hdfs, 
	map = function(k, v) {
		#df<-v[which(v$user==3),]
		df<-v
		key<-df$item
		val<-data.frame(item=df$item,user=df$user,pref=df$pref)
		keyval(key,val)
	}
)

3. equijoin co-occurrence matrix and score matrix

eq.hdfs<-equijoin(
	left.input=step2.mr, 
	right.input=train2.mr,
	map.left=function(k,v){
		keyval(k,v)
	},
	map.right=function(k,v){
		keyval(k,v)
	},
	outer = c("left")
)

4. 計算推薦的結果

cal.mr<-mapreduce(
	input=eq.hdfs,
	map=function(k,v){
		val<-v
		na<-is.na(v$user.r)
		if(length(which(na))>0) val<-v[-which(is.na(v$user.r)),]
		keyval(val$k.l,val)
	},
	reduce=function(k,v){
		val<-ddply(v,.(k.l,v.l,user.r),summarize,v=freq.l*pref.r)
		keyval(val$k.l,val)
	}
)

5. output list and score

result.mr<-mapreduce(
	input=cal.mr,
	map=function(k,v){
		keyval(v$user.r,v)
	},
	reduce=function(k,v){
		val<-ddply(v,.(user.r,v.l),summarize,v=sum(v))
		val2<-val[order(val$v,decreasing=TRUE),]
		names(val2)<-c("user","item","pref")
		keyval(val2$user,val2)
	}
)

6. result

1	from.dfs(result.mr)

Data preprocess

input: csv file (user,item,rating ex: 1,101,5.0)
實際資料：MovieLens Data Sets
- GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.umn.edu). The data sets were collected over various periods of time, depending on the size of the set.
- origin data sets MovieLens 10M - Consists of 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users.
- MovieID index file

1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
2::Jumanji (1995)::Adventure|Children|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama|Romance
...

format convert:
- 原始格式：UserID::MovieID::Rating::Timestamp

1::122::5::838985046
1::185::5::838983525
1::231::5::838983392
1::292::5::838983421
...

- 轉換格式：`UserID,MovieID,Rating` save as `movielen_dataset.csv`

f = File.open("movielen_dataset.csv", "w")
File.open("ratings.dat").each do |l|
	temp = l.split("::")
	userID  = temp[0]
	movieID = temp[1]
	rating  = temp[2]
	f << "#{userID},#{movieID},#{rating}\n"
end
f.close

1,122,5
1,185,5
1,231,5
1,292,5
...

Result

…待補，vm localhost hdfs 跑好久

failed Reduce Tasks exceeded allowed limit

Reference

Read on Comments

Michael Hsu.tw

Try to Keep everything in Mind

RHadoop 實作推薦系統

簡介

工具

演算法與實作

Data preprocess

Result

Reference