# Learning: the hows and whys of machine learning Liam Wiltshire https://liam-wiltshire.github.io/talks/?talk=machinelearning&conference=phpuk https://joind.in/event/php-uk-conference-2019/learning-the-hows-and-whys-of-machine-learning ## Overivew Charge backs ## Supervised learning Training data Learning functions Categorisation / Classification Regression - Where do we sit on a line ## Naive Bayes classifier Standardise words - Un pluralise - Un gender - Un tense - etc More data == better ## Tokenisation https://en.wikipedia.org/wiki/Benford%27s_law https://php-ml.readthedocs.io Unique tokens for each unique context ## Imbalanced data One category has more database 99% data not charge back Just being accurate, not very helpful - Started by flagging 100% as fine. - Need to collect more data, change methods, resample data ## Understand data - context - Common data vs specific data - Continuous vs discrete data ## KNN K Nearest Number https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm - Distances - less sensitive to imbalance - Keep K odd (no draws) ## Handling nominal data Binary - Increase amounts of dimensions - normalisation required - equal scales ## Contextless data is meaningless Is it normal? ## Next to try Weighting Different dimensions Change K value (was 3NN) Remove outliers Diff distance function weighted distance # Useful links https://en.wikipedia.org/wiki/Benford%27s_law https://php-ml.readthedocs.io https://liam-wiltshire.github.io/talks/?talk=machinelearning&conference=phpuk https://joind.in/event/php-uk-conference-2019/learning-the-hows-and-whys-of-machine-learning https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm