xýiųažå/
Discover popular music albums with cloud Computing
Natalie Han
Individual songs
PANDORA
Albums by the same artist
Hotel California .
_ m msnm . ... «,s. ... ... ...
inl...
“àt-'zêz-fí:
. l J
E.
jigíkųbýłfgiųiši . `:: -;*. '.: =
Discover popular music albums
Start by typing an album name o...
Raw Data Algorithm
Amazon ALbUm ratings: . ' ' Collaborative FiLtering:
o Ratings: 6, 396, 350
o Album: 565,743
o Us...
r. .
mr: n : Lidciiwg IgIZ-lh v. ,
a iTi žã EO li
web services"
t' 5 ` Madam
: E: MJ map FEE/ tage y
Data...
:r 4.1 'lå 3:31; '
? ia f'“r3;ii"l(3`i. lE-"* išaipiizraixęę;
You Tuhe
Front End
HTl'l1L
å
JïiiiLfiE *å 2315255361! S3
íęg. _
Natalie Han
Elasilc MapFieduce ~ Cluster List > Cluster Details
Add step Gone Yen-Inne? :
Cluster: muslcsimllariiynaialle.201:101222...
(- C Iocawlosl' 'l . _
_ , Quick Units
ip-10-29-217-5 Hadoop Map/ Reduce Administration
State: RUNNING
sinned: We...
Map Reduce Steps with MRJob
Step 2:
Step 1:
def palr_wise(user. values):
t: ñnd all muslc pairs with
t: at least on...
vir c
mees “li - | l: clam e lrl
E! !
[al
Training Set Test Set
of 12

Natalie insight

During my fellowship at Insight Data Science, I created dj-cloud.us, a music album recommendation engine based on over 6 million album ratings. I first merged album data from scraping Amazon, Google/Bing/Youtube APIs, and Stanford SNAP lab. And, I leveraged Hive on top of AWS Hadoop to preprocess data, then applied collaborative filtering with Map Reduce using MRJob to calculate pairwise Pearson and Cosine similarity between albums. Next, I calculated top ranked albums, filtered duplicate titles with NLTK, and stored data into MySQL. Lastly, I deployed this app with Flask on AWS, designed front end using HTML, Bootstrap CSS and JavaScript.
Published on: Mar 3, 2016
Published in: Data & Analytics      
Source: www.slideshare.net


Transcripts - Natalie insight

  • 1. xýiųažå/ Discover popular music albums with cloud Computing Natalie Han
  • 2. Individual songs PANDORA Albums by the same artist Hotel California . _ m msnm . ... «,s. ... ... ... inlas Eagles
  • 3. “àt-'zêz-fí: . l J E. jigíkųbýłfgiųiši . `:: -;*. '.: = Discover popular music albums Start by typing an album name or clicking on an album image on coma/ vm ' ` ` i l N l A* 'M " h: _ iwi nszu~xir; _nuu-L-_-: -s. “m _a *'“ “ - ~ ø-k; -; ~r--'
  • 4. Raw Data Algorithm Amazon ALbUm ratings: . ' ' Collaborative FiLtering: o Ratings: 6, 396, 350 o Album: 565,743 o Users: i, 173,354 v a. f o Pearson simiLarity o Cosine similarity 7 'f
  • 5. r. . mr: n : Lidciiwg IgIZ-lh v. , a iTi žã EO li web services" t' 5 ` Madam : E: MJ map FEE/ tage y Data Storage Data Cleaning Similarity Calculation
  • 6. :r 4.1 'lå 3:31; ' ? ia f'“r3;ii"l(3`i. lE-"* išaipiizraixęę; You Tuhe
  • 7. Front End HTl'l1L å JïiiiLfiE *å 2315255361! S3
  • 8. íęg. _ Natalie Han
  • 9. Elasilc MapFieduce ~ Cluster List > Cluster Details Add step Gone Yen-Inne? : Cluster: muslcsimllariiynaialle.201:10122223423181296 Flunnmg mnmng mo Meule: public DNS: Tage: -~ View N: v km ec2»23»23-20~i3d commune'. nmnzonnws com Conñguuuon Deun- AMl venlon: 7 4 2 Nedcor! Amazon 1 0 J Summary lD'. /~. J7J6CNJYR[P7 Creation dele: 2014-01-22 14 AU nou! Imo < UiCvlil Gllmblłllon: Elepeed time: 20 minutes Applications: -- Mno-leanlnne: Vee Log UNI: ssm/ much- , mmmm 0,, om, ” 2Jc2c7l23nf23Ib7nmp/ -ogs/ protection: r Monitoring v Steps Step! Hlter: Ain llcpl 3 : lem [nll named) ID Neme Diewe mentlrne v u. m. . nrc a; n. muxcmmilnruy nnmlre 701-101 r Q “Umwgxhme g? 223-229 181298 siapaoi Running ? Old-Ol-Q? 14-50 s musce-milarnyriaullełolalül D zšazeg , n YMQR i? 223429 18996. 5180 l o! » m' r""""`* 20ųjlv2? 14:45 muscsamxlayaynaullefzouol 22,223«$291B1296:S1ep 3 oil Persone 3 p O wo? : unovm 7 Security/ Network Availability zone: meul-ie Subtiel ID: ~- Key name: em: IAM role: ~ Vllible to el! None lłlllií Benue time T l Willi! !! 4 mirulos Log iiiee væw oo: View oo: V-owoøs EMR Help Hardware Meeteł: Runnno l ml email Core: ~ Tee . »- Action Denuggnng no! configured [It-begging mt conhoureo Dowgqmg no: COTIHQUIOU
  • 10. (- C Iocawlosl' 'l . _ _ , Quick Units ip-10-29-217-5 Hadoop Map/ Reduce Administration State: RUNNING sinned: Wed Jan 22 22:43:43 UTC 2014 Version: 1.0.3. r compiled: Wed Oct 2 12:17:06 PDT 2013 by Eiasxic MapReduce Identifier: 2ol4ol22224a Cluster Summary (Heap Size is 25.12 MB/556.81 MB) "'"'"'"9 "“"“'"9 Total Oocu led “mm” Reserved “'”“'“' 'M' “mm” Av Blacklleled Gra Il l a e: l dad Map Roduce Nodes P Reduca Fleduce Task Task 9' V 'e c” 'Task, Task! Submisslons Map Siots V s'en Map slote slow ýcapacw ýcapacny Taýsks/ Node Nodes Nodes Nodes i 2 o 2 | l 2 o o o 2 i 3.00 o o o Scheduling Information Running Jobs Job Scheduling inlormation Fleduce% Reduce Reduces “m” Total completed Complete Total completed NORMAL hadoop sireamjob279074ll3874l866235jar 100 2014101222243 _0002 completed Jobs Mnpö'. Map Maps Reduceàs Reduce Reduces Job Job” sun”, Pdomy "s" Name Complete Total completed Complete Total completed Wed Jan . 22 . nn 9nl4nl79994rl nnnl øø-asuzn NORMAI hadnnn ęlvnamaølųsøaløvutłmlslømnim 100.000: 4 4 1000096 l l NA
  • 11. Map Reduce Steps with MRJob Step 2: Step 1: def palr_wise(user. values): t: ñnd all muslc pairs with t: at least one common rating yleld (mllsicA, muslcB), (ratlngA, ratingB) def group_byg_user(key. Ilne): user, music, rating = line. split() yield user, (muslc, rating) def count_users_ratlng(key. values): def calculatejllllllanryøxey, values): li fol each usel, aggregate all ratings : x For each pall, rzalclllate Use" ('HUSKL latlnqs) tl PÉÊÊIIÊOIT, COSHJO Sllllllålllty yelld (mLlSlC/ X, muslcBl. [oearsorl coslne. mum)
  • 12. vir c mees “li - | l: clam e lrl E! ! [al Training Set Test Set

Related Documents