簡介
工具
演算法與實作
本演算法參考 粉丝日志 RHadoop实践系列 來實作,特別感謝有這種中文的教學文章。
推薦結果 = 伴隨矩陣 (co-occurrence matrix) * 評分矩陣(score matrix)
map-reduce 實作步驟:
建立 item’s co-occurrence matrix,然後算出 frequence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
train.mr<-mapreduce(
train.hdfs,
map = function(k, v) {
keyval(k,v$item)
},
reduce = function(k,v){
m<-merge(v,v)
keyval(m$x,m$y)
}
)
step2.mr<-mapreduce(
train.mr,
map = function(k, v) {
d<-data.frame(k,v)
d2<-ddply(d,.(k,v),count)
key<-d2$k
val<-d2
keyval(key,val)
}
)
2 . 建立 user 's 評分矩陣
1
2
3
4
5
6
7
8
9
10
train2.mr<-mapreduce(
train.hdfs,
map = function (k, v) {
df<-v
key<-df$item
val<-data.frame(item=df$item ,user=df$user ,pref=df$pref )
keyval(key,val)
}
)
3. equijoin co-occurrence matrix and score matrix
1
2
3
4
5
6
7
8
9
10
11
12
eq.hdfs<-equijoin(
left .input =step2.mr,
right .input =train2.mr,
map .left =function (k,v) {
keyval(k ,v )
},
map .right =function (k,v) {
keyval(k ,v )
},
outer = c ("left" )
)
4. 計算推薦的結果
1
2
3
4
5
6
7
8
9
10
11
12
13
cal.mr<-mapreduce(
input=eq.hdfs,
map=function (k,v) {
val<-v
na<-is.na(v$user .r)
if (length(which(na))>0 ) val<-v[-which(is.na(v$user .r)),]
keyval(val$k .l,val)
},
reduce=function (k,v) {
val<-ddply(v,.(k.l,v.l,user.r),summarize,v=freq.l*pref.r)
keyval(val$k .l,val)
}
)
5 . output list and score
1
2
3
4
5
6
7
8
9
10
11
12
result .mr<-mapreduce(
input=cal.mr,
map=function(k,v){
keyval(v$user.r,v)
},
reduce=function(k,v){
val<-ddply(v,.(user.r,v.l),summarize,v=sum(v))
val2<-val[order(val$v,decreasing=TRUE ),]
names(val2)<-c("user" ,"item" ,"pref" )
keyval(val2$user,val2)
}
)
6 . result
Data preprocess
input: csv file (user,item,rating ex: 1,101,5.0
)
實際資料:MovieLens Data Sets
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.umn.edu ). The data sets were collected over various periods of time, depending on the size of the set.
origin data sets MovieLens 10M - Consists of 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users.
MovieID index file
1
2
3
4
5
1 : :Toy Story (1995 ): :Adventure|Animation|Children|Comedy|Fantasy
2 : :Jumanji (1995 ): :Adventure|Children|Fantasy
3 : :Grumpier Old Men (1995 ): :Comedy|Romance
4 : :Waiting to Exhale (1995 ): :Comedy|Drama|Romance
...
format convert:
原始格式:UserID::MovieID::Rating::Timestamp
1
2
3
4
5
1::122 ::5 ::838985046
1::185 ::5 ::838983525
1::231 ::5 ::838983392
1::292 ::5 ::838983421
...
- 轉換格式:`UserID,MovieID,Rating` save as `movielen_dataset.csv`
1
2
3
4
5
6
7
8
9
f = File.open ("movielen_dataset.csv" , "w" )
File.open ("ratings.dat" ).each do |l|
temp = l.split ("::" )
userID = temp[0 ]
movieID = temp[1 ]
rating = temp[2 ]
f << "#{userID},#{movieID},#{rating}\n"
end
f.close
1
2
3
4
5
1 ,122 ,5
1 ,185 ,5
1 ,231 ,5
1 ,292 ,5
...
Result
…待補,vm localhost hdfs 跑好久
failed Reduce Tasks exceeded allowed limit
Reference