RHadoop 實作推薦系統


Outline
  1. 1. 簡介
  2. 2. 工具
  3. 3. 演算法與實作
  4. 4. Data preprocess
  5. 5. Result
  6. 6. Reference

簡介

工具

演算法與實作

  • 本演算法參考 粉丝日志 RHadoop实践系列 來實作,特別感謝有這種中文的教學文章。
  • 推薦結果 = 伴隨矩陣 (co-occurrence matrix) * 評分矩陣(score matrix)
  • 演算法概念 from Mahout In Action
  • map-reduce 實作步驟:
    1. 建立 item’s co-occurrence matrix,然後算出 frequence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
train.mr<-mapreduce(
train.hdfs,
map = function(k, v) {
keyval(k,v$item)
},
reduce = function(k,v){
m<-merge(v,v)
keyval(m$x,m$y)
}
)
step2.mr<-mapreduce(
train.mr,
map = function(k, v) {
d<-data.frame(k,v)
d2<-ddply(d,.(k,v),count)
key<-d2$k
val<-d2
keyval(key,val)
}
)
2. 建立 user's 評分矩陣
1
2
3
4
5
6
7
8
9
10
train2.mr<-mapreduce(
train.hdfs,
map = function(k, v) {
#df<-v[which(v$user==3),]
df<-v
key<-df$item
val<-data.frame(item=df$item,user=df$user,pref=df$pref)
keyval(key,val)
}
)
3. equijoin co-occurrence matrix and score matrix
1
2
3
4
5
6
7
8
9
10
11
12
eq.hdfs<-equijoin(
left.input=step2.mr,
right.input=train2.mr,
map.left=function(k,v){
keyval(k,v)
},
map.right=function(k,v){
keyval(k,v)
},
outer = c("left")
)
4. 計算推薦的結果
1
2
3
4
5
6
7
8
9
10
11
12
13
cal.mr<-mapreduce(
input=eq.hdfs,
map=function(k,v){
val<-v
na<-is.na(v$user.r)
if(length(which(na))>0) val<-v[-which(is.na(v$user.r)),]
keyval(val$k.l,val)
},
reduce=function(k,v){
val<-ddply(v,.(k.l,v.l,user.r),summarize,v=freq.l*pref.r)
keyval(val$k.l,val)
}
)
5. output list and score
1
2
3
4
5
6
7
8
9
10
11
12
result.mr<-mapreduce(
input=cal.mr,
map=function(k,v){
keyval(v$user.r,v)
},
reduce=function(k,v){
val<-ddply(v,.(user.r,v.l),summarize,v=sum(v))
val2<-val[order(val$v,decreasing=TRUE),]
names(val2)<-c("user","item","pref")
keyval(val2$user,val2)
}
)
6. result
1
from.dfs(result.mr)

Data preprocess

  • input: csv file (user,item,rating ex: 1,101,5.0)
  • 實際資料:MovieLens Data Sets
    • GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.umn.edu). The data sets were collected over various periods of time, depending on the size of the set.
    • origin data sets MovieLens 10M - Consists of 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users.
    • MovieID index file
1
2
3
4
5
1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
2::Jumanji (1995)::Adventure|Children|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama|Romance
...
  • format convert:
    • 原始格式:UserID::MovieID::Rating::Timestamp
1
2
3
4
5
1::122::5::838985046
1::185::5::838983525
1::231::5::838983392
1::292::5::838983421
...
- 轉換格式:`UserID,MovieID,Rating` save as `movielen_dataset.csv`
1
2
3
4
5
6
7
8
9
f = File.open("movielen_dataset.csv", "w")
File.open("ratings.dat").each do |l|
temp = l.split("::")
userID = temp[0]
movieID = temp[1]
rating = temp[2]
f << "#{userID},#{movieID},#{rating}\n"
end
f.close
1
2
3
4
5
1,122,5
1,185,5
1,231,5
1,292,5
...

Result

…待補,vm localhost hdfs 跑好久

failed Reduce Tasks exceeded allowed limit

Reference