EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRES...
VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE A...
PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE ...
EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION ...
EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBL...
EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBL...
EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBL...
WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL...
NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS ...
NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN ...
NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) ...
NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) ...
NORMALIZE ALGORITHMFriday, October 7, 11
NORMALIZE ALGORITHM STRLEN STRING LENGTHFriday, October 7, 11
MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIAFriday, O...
MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA ...
MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA ...
MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA ...
MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA ...
NORMALIZE ALGORITHMFriday, October 7, 11
NORMALIZE ALGORITHM STRLENFriday, October 7, 11
NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN ...
NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN ST...
NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN ...
NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN ...
NORMALIZE ALGORITHM PART I STR VS STRING STEERFrid...
NORMALIZE ALGORITHM PART I STR VS LENDER ...
NORMALIZE ALGORITHM PART I STR VS LENDE...
NORMALIZE ALGORITHM PART I STR VS LENDER ...
NORMALIZE ALGORITHM PART I STR VS LENDER ...
NORMALIZE ALGORITHM PART I STR VS LENDER ...
NORMALIZE ALGORITHM PART I STR VS LENDER ...
NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLENFriday, Octob...
NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLE...
NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLE...
NORMALIZE ALGORITHM PART II VS STR-LEN ST...
NORMALIZE ALGORITHM PART II VS STR-LEN ST...
NORMALIZE ALGORITHM PART II VS STR-LEN ST...
ADDING CONTEXTFriday, October 7, 11
ADDING CONTEXT DIRFriday, October 7, 11
ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY}Friday, October 7, 11
ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD...
ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, ...
NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPL...
EVALUATION Program Loc SLoc Unique Ids which-2.20 3,6...
EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS ...
EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS ...
EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS ...
EXAMPLE EXPANSIONS id Top 10 Top Expansion ...
RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCAB...
ACCURACY ON DOMAIN IDSFriday, October 7, 11
SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION ...
BEST VOCABULARY SOURCE?Friday, October 7, 11
FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA ...
SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS ...
QUESTIONS? Need an identifier split? GenTest Splitter available at ...
of 56

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

Paper: Expanding Identifiers to Normalize Source Code VocabularyAuthors: Dave Binkley and Dawn LawrieSession: Research Track 4: Natural Language Analysis
Published on: Mar 3, 2016
Published in: Technology      Education      
Source: www.slideshare.net


Transcripts - Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

  • 1. EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEYFriday, October 7, 11
  • 2. VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS EXAMPLE REQUIREMENT - “FEATURE LOCATION” SOURCE CODE - “FEATURELOCATION” OR WORSE “FLOC”Friday, October 7, 11
  • 3. PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE OTHER SOFTWARE DOCUMENTSFriday, October 7, 11
  • 4. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION FLOCFriday, October 7, 11
  • 5. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FLOCFriday, October 7, 11
  • 6. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM F LOC SPLITTING PROBLEMFriday, October 7, 11
  • 7. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FEATURE LOCATION SPLITTING AND EXPANSION PROBLEMFriday, October 7, 11
  • 8. WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDSFriday, October 7, 11
  • 9. NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS FLOC FEATURE LOCATIONFriday, October 7, 11
  • 10. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  • 11. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  • 12. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN (3)Friday, October 7, 11
  • 13. NORMALIZE ALGORITHMFriday, October 7, 11
  • 14. NORMALIZE ALGORITHM STRLEN STRING LENGTHFriday, October 7, 11
  • 15. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIAFriday, October 7, 11
  • 16. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  • 17. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  • 18. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  • 19. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  • 20. NORMALIZE ALGORITHMFriday, October 7, 11
  • 21. NORMALIZE ALGORITHM STRLENFriday, October 7, 11
  • 22. NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 23. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 24. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN WILDCARD EXPANSION STR-LEN STRL_EN STRLE_N R*L*E*N* S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 25. NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN E(STR) = {STEER, STRING} STRL_EN E(LEN) = {LENDER, LENGTH} STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 26. NORMALIZE ALGORITHM PART I STR VS STRING STEERFriday, October 7, 11
  • 27. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTHFriday, October 7, 11
  • 28. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  • 29. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  • 30. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 31. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 32. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB STRING 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 33. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLENFriday, October 7, 11
  • 34. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMENFriday, October 7, 11
  • 35. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONSFriday, October 7, 11
  • 36. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 37. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 38. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN STRING LENGTH 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 39. ADDING CONTEXTFriday, October 7, 11
  • 40. ADDING CONTEXT DIRFriday, October 7, 11
  • 41. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY}Friday, October 7, 11
  • 42. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD}Friday, October 7, 11
  • 43. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS USED IN BOTH PART 1 AND PART 2Friday, October 7, 11
  • 44. NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPLE SPLITS GOOGLE 5-GRAM DATASETFriday, October 7, 11
  • 45. EVALUATION Program Loc SLoc Unique Ids which-2.20 3,670 2,293 487 a2ps-4.14 62,347 38,436 4,393 Program Selected Ids Hard Words Soft Words which-2.20 487 903 1214 a2ps-4.14 211 459 618Friday, October 7, 11
  • 46. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  • 47. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  • 48. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Program Filtered Ids Reported Ids which-2.20 152 335 a2ps-4.14 46 166Friday, October 7, 11
  • 49. EXAMPLE EXPANSIONS id Top 10 Top Expansion Expansion nextchar next_character next_character indfound index_found_need index_found optarg option_are_g optarg itemno i_them_not itemnoFriday, October 7, 11
  • 50. RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY? CAN THE EXPANDER INFORM THE SPLITTER? CAN THE SPLITTER INFORM THE EXPANDER?Friday, October 7, 11
  • 51. ACCURACY ON DOMAIN IDSFriday, October 7, 11
  • 52. SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION MANUALFriday, October 7, 11
  • 53. BEST VOCABULARY SOURCE?Friday, October 7, 11
  • 54. FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASKFriday, October 7, 11
  • 55. SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS DEGRADES PERFORMANCE OF IR TECHNIQUES NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLYFriday, October 7, 11
  • 56. QUESTIONS? Need an identifier split? GenTest Splitter available at splitit.cs.loyola.eduFriday, October 7, 11