Naive application ofMachine Learning toSoftware Development
Naive application ofMachine Learning toSoftware Developmentor... what developers donttell :)
What and why42 Coffee Cups: completely distributed development team
What and why42 Coffee Cups: completely distributed development teamHard facts about how software is done
What and why42 Coffee Cups: completely distributed development teamHard facts about how software is doneLOTS OF THEM
What and whyFacts
What and whyFacts Profit
What and whyFacts ??? Profit
What and why???Toy problem: get ticket and predict how long it will take to close it
What and why???Toy problem: get ticket and predict how long it will take to close itBonus: learn scikit-learn :)
Install scikit-learn● sudo apt-get install python- dev
Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev
Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy
Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools ...
Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools ...
Data: closed ticketsimport urllib2url = https://code.djangoproject.com/query?format=csv+&col=id&col=time&col=changetime&co...
Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:...
Data: closed date and descriptiondef get_data(ticket): url = https://code.djangoproject.com/ticket/%s % ticket ticke...
Data: closed date and description# get closing dated = bs.find_all(div,date)[0]p = list(d.children)[3]href = p.find(a)[hre...
Data: closed date and descriptiondef get_data(ticket): [...] # get description and return de = bs.find_all(div, des...
Data: closed date and descriptiontickets_file = csv.reader(open(2012-10-09.csv))output = csv.writer(open(2012-10-09.cl...
Scoring: Train/Test set splitcross_validation.train_test_split(tickets_train, tickets_test, times_train,times_test) = cro...
Scoring: Mean squared errorsklearn.metrics.mean_squared_errortrain_error = metrics.mean_squared_error( times_train, tim...
Fun #1: just ticket number?for number, created, ... in tickets_file: row = [] created = dt.datetime.strptime(created...
Fun #1: just ticket number?import numpy as npfrom sklearn import preprocessingscaler = preprocessing.Scaler().fit( np.a...
Fun #1: just ticket number?clf = SVR()clf.fit(tickets_train, times_train)times_train_predict = clf.predict(tickets_train)t...
Fun #1: just ticket number?train_error = metrics.mean_squared_error(times_train, times_train_predict)test_error = metrics....
Fun #1: just ticket number?Train error: 363.4Test error: 361.41
Finding best parametersSVM C controls regularization:larger C leads to● closer fit to the train data● with the risk of ove...
Finding best parametersCs = np.logspace(-1, 10, 10)for c in Cs: learn(c)
Finding best parameters0.1: Train error: 363.4 Test error: 361.411.71: Train error: 363.4 Test error: 361.4127.8: Train er...
Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR(...
Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR(...
Fun #2: creation date? row = [] row.append(float(number)) row.append(float(time.mktime( created.timetuple()))) tickets...
Fun #2: creation date?Train error: 360.6 Test error: 358.39Best C: 1.0e+10
String vectorizer and Tfidf transformfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTran...
String vectorizer and Tfidf transformreporters = []for number, ... in tickets_file: [...] reporters.append(reporter)
String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) -> TfidfTransformer().fit_transform( … ) ...
String vectorizer and Tfidf transformimport scipy.sparse as sptickets = sp.hstack(( tickets, TfidfTransformer().fit_tran...
Fun #3: reporterTrain error: 338.7 Test error: 353.38Best C: 1.8e+07
Fun #3: subject subjects = [] for number, created, ... in tickets_file: [...] subjects.append(summary) [...] ticket...
Fun #3: subjectTrain error: 21.0 Test error: XXXXBest C: 1.0e+10
Fun #3: subjectTrain error: 21.0 Test error: 331.79Best C: 1.0e+10
Different SVM kernelsdef learn(kernel=rbf, param_grid=None,verbose=False):[...] clf = GridSearchCV( estimator=SVR(k...
Different SVM kernelsRBFTrain error: 21.0 Test error: 331.79Best C: 1.0e+10LinearTrain error: 343.1 Test error: 355.56Best...
Fun #5: account for theComponentcomponents = []for number, .. component, ... in tickets_file: [...] components.appen...
Fun #5: account for theComponentRBFTrain error: 18.9 Test error: 327.79Best C: 1.0e+10Linear:Train error: 342.2 Test error...
Fun #6: ticket Descriptiondescriptions = []for number, ... description in tickets_file: [...] descriptions.append(de...
Fun #6: ticket DescriptionRBFTrain error: 10.8 Test error: 328.44Best C: 1.0e+10LinearTrain error: 14.0 Test error: 331.52...
Conclusions● All steps of a simple machine learning algo
Conclusions● All steps of a simple machine learning algo● scikit-learn
Conclusions● All steps of a simple machine learning algo● scikit-learn● data, explicitly available in tickets is NOT ENOU...
Developers,what are you hiding? :)
Questions?Source code and dataset available athttps://github.com/42/django-trac-learning.gitContacts:● @akhavr● http://42c...
of 54

Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket. Facts that developers aren't putting RIGHT information into their tracking systems :)
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Naive application of Machine Learning to Software Development

  • 1. Naive application ofMachine Learning toSoftware Development
  • 2. Naive application ofMachine Learning toSoftware Developmentor... what developers donttell :)
  • 3. What and why42 Coffee Cups: completely distributed development team
  • 4. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is done
  • 5. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is doneLOTS OF THEM
  • 6. What and whyFacts
  • 7. What and whyFacts Profit
  • 8. What and whyFacts ??? Profit
  • 9. What and why???Toy problem: get ticket and predict how long it will take to close it
  • 10. What and why???Toy problem: get ticket and predict how long it will take to close itBonus: learn scikit-learn :)
  • 11. Install scikit-learn● sudo apt-get install python- dev
  • 12. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev
  • 13. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy
  • 14. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++
  • 15. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++● pip install -U scikit-learn
  • 16. Data: closed ticketsimport urllib2url = https://code.djangoproject.com/query?format=csv+&col=id&col=time&col=changetime&col=reporter + &col=summary&col=status&col=owner&col=type + &col=component&order=prioritytickets = urllib2.urlopen(url).read()open(2012-10-09.csv,w).write(tickets)
  • 17. Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous monthlinks close the popup in Safari,closed,jacob,defect,contrib.admin
  • 18. Data: closed date and descriptiondef get_data(ticket): url = https://code.djangoproject.com/ticket/%s % ticket ticket_html = urllib2.urlopen(url) bs = BeautifulSoup(ticket_html)
  • 19. Data: closed date and description# get closing dated = bs.find_all(div,date)[0]p = list(d.children)[3]href = p.find(a)[href]close_time_str = urlparse.parse_qs(href)[/timeline?from][0]close_time = datetime.datetime.strptime(close_time_str[:-6], %Y-%m-%dT%H:%M:%S)# ... more black magic, see code
  • 20. Data: closed date and descriptiondef get_data(ticket): [...] # get description and return de = bs.find_all(div, description)[0] return close_time, de.text
  • 21. Data: closed date and descriptiontickets_file = csv.reader(open(2012-10-09.csv))output = csv.writer(open(2012-10-09.close.csv,w))for id, time, changetime, reporter, summary, status, owner, type, component in tickets_file: closetime, descr = get_data(id) row = [id, time, changetime, closetime, reporter, summary, status, owner, type, component, descr.encode(utf-8), ],) output.writerow(row)
  • 22. Scoring: Train/Test set splitcross_validation.train_test_split(tickets_train, tickets_test, times_train,times_test) = cross_validation.train_test_split( tickets, times, test_size=0.2, random_state=0)
  • 23. Scoring: Mean squared errorsklearn.metrics.mean_squared_errortrain_error = metrics.mean_squared_error( times_train, times_train_predict)test_error = metrics.mean_squared_error( times_test, times_test_predict)
  • 24. Fun #1: just ticket number?for number, created, ... in tickets_file: row = [] created = dt.datetime.strptime(created, time_format) closetime = dt.datetime.strptime(closetime, time_format) time_to_fix = closetime - created row.append(float(number)) tickets.append(row) times.append(total_seconds(time_to_fix))
  • 25. Fun #1: just ticket number?import numpy as npfrom sklearn import preprocessingscaler = preprocessing.Scaler().fit( np.array(tickets))tickets = scaler.transform(tickets)
  • 26. Fun #1: just ticket number?clf = SVR()clf.fit(tickets_train, times_train)times_train_predict = clf.predict(tickets_train)times_test_predict = clf.predict(tickets_test)
  • 27. Fun #1: just ticket number?train_error = metrics.mean_squared_error(times_train, times_train_predict)test_error = metrics.mean_squared_error(times_test,times_test_predict)print Train error: %.1fn Test error: %.2f % ( math.sqrt(train_error)/(24*3600), math.sqrt(test_error)/(24*3600))# .. in days
  • 28. Fun #1: just ticket number?Train error: 363.4Test error: 361.41
  • 29. Finding best parametersSVM C controls regularization:larger C leads to● closer fit to the train data● with the risk of overfitting
  • 30. Finding best parametersCs = np.logspace(-1, 10, 10)for c in Cs: learn(c)
  • 31. Finding best parameters0.1: Train error: 363.4 Test error: 361.411.71: Train error: 363.4 Test error: 361.4127.8: Train error: 363.4 Test error: 361.39464.2: Train error: 363.2 Test error: 361.177742.6: Train error: 362.5 Test error: 360.41129155.0: Train error: 362.1 Test error: 360.002154434.7: Train error: 362.0 Test error: 359.8235938136.6: Train error: 361.7 Test error: 359.60599484250.3: Train error: 361.5 Test error: 359.3610000000000.0: Train error: 361.1 Test error:358.91
  • 32. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train)
  • 33. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train) Train error: 361.1 Test error: 358.91 Best C: 1.0e+10
  • 34. Fun #2: creation date? row = [] row.append(float(number)) row.append(float(time.mktime( created.timetuple()))) tickets.append(row)
  • 35. Fun #2: creation date?Train error: 360.6 Test error: 358.39Best C: 1.0e+10
  • 36. String vectorizer and Tfidf transformfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
  • 37. String vectorizer and Tfidf transformreporters = []for number, ... in tickets_file: [...] reporters.append(reporter)
  • 38. String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) -> TfidfTransformer().fit_transform( … ) -> hstack((tickets, …)note: TF-IDF matrix is sparse!
  • 39. String vectorizer and Tfidf transformimport scipy.sparse as sptickets = sp.hstack(( tickets, TfidfTransformer().fit_transform( CountVectorizer().fit_transform(reporters))))# remember to re-scale!scaler = preprocessing.Scaler(with_mean=False ).fit(tickets)tickets = scaler.transform(tickets)
  • 40. Fun #3: reporterTrain error: 338.7 Test error: 353.38Best C: 1.8e+07
  • 41. Fun #3: subject subjects = [] for number, created, ... in tickets_file: [...] subjects.append(summary) [...] tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3) ).fit_transform(subjects))))
  • 42. Fun #3: subjectTrain error: 21.0 Test error: XXXXBest C: 1.0e+10
  • 43. Fun #3: subjectTrain error: 21.0 Test error: 331.79Best C: 1.0e+10
  • 44. Different SVM kernelsdef learn(kernel=rbf, param_grid=None,verbose=False):[...] clf = GridSearchCV( estimator=SVR(kernel=kernel, verbose=verbose), param_grid=param_grid, n_jobs=-1)[...]
  • 45. Different SVM kernelsRBFTrain error: 21.0 Test error: 331.79Best C: 1.0e+10LinearTrain error: 343.1 Test error: 355.56Best C: 1.0e+02
  • 46. Fun #5: account for theComponentcomponents = []for number, .. component, ... in tickets_file: [...] components.append(component) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform(CountVectorizer().fit_transform(components))))
  • 47. Fun #5: account for theComponentRBFTrain error: 18.9 Test error: 327.79Best C: 1.0e+10Linear:Train error: 342.2 Test error: 354.89Best C: 1.0e+02
  • 48. Fun #6: ticket Descriptiondescriptions = []for number, ... description in tickets_file: [...] descriptions.append(description) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3)).fit_transform( descriptions))))
  • 49. Fun #6: ticket DescriptionRBFTrain error: 10.8 Test error: 328.44Best C: 1.0e+10LinearTrain error: 14.0 Test error: 331.52Best C: 3.2e+03
  • 50. Conclusions● All steps of a simple machine learning algo
  • 51. Conclusions● All steps of a simple machine learning algo● scikit-learn
  • 52. Conclusions● All steps of a simple machine learning algo● scikit-learn● data, explicitly available in tickets is NOT ENOUGH to predict closing date
  • 53. Developers,what are you hiding? :)
  • 54. Questions?Source code and dataset available athttps://github.com/42/django-trac-learning.gitContacts:● @akhavr● http://42coffeecups.com/

Related Documents