MSc Substantial Coursework
Multi-label Classification of Products:
Given a set of product descriptions and three category labels, exploratory
analysis is performed to inform data cleaning steps. Customized text processing is applied to produce
term-frequency inverse document-frequency vectors. These are used to train cascading series of naïve bayes
classifiers to target the three levels of product category successively.
Keywords: NB Classification, NLP, multi-label, TF-IDF.
Distributed Insights for LMS Logs Analysis:
Hypotheses about the significance of platform features and user demographics in relation to consistent
engagement
are explored in LMS logs using Pyspark sql and MLlib. Analysis relies on significant filtering and binning
based
on aggregations of timestamps, joins to demographic data, cross-tabulation, as well as parameterized
grid search with random forest classifiers and feature importance inspection by average impurity.
Keywords: Spark, lazy-execution, feature importance.
Deep Neural Network Architecture Search:
A dataset of sentences extracted from reviews of Amazon products tagged for usefulness in purchase decision
making
is used to define a regression problem. An initial feed-forward model is developed to operate in a
low-information
environment using only syntactic token presence one-hot encoded and optimized to outperform a naïve baseline.
Using learned embeddings a gradient descent regressor and a single-layer recurrent neural net are defined as
baselines.
A functional hypermodel is then defined that facilitates parameter tuning and comparison of multi-layer and
multi-input
neural nets that optionally take product titles as a second input using pre-trained embeddings. Models are
tuned using
a Bayesian optimizer. Model performance is subjected to a custom ranking task scored by Kendall's tau metric,
simulating
real-world performance and differentiating where standard regressor scoring techniques do not.
Keywords: Recurrent neural networks, hyperparameter optimization, NLP.
Parallel Proof-of-Work:
Implementation of blockchain mining node. Includes transaction and block verification,
as well as a parallelized mining function using a stratified random nonce search.
Mining is completed as part of coursework requirements on multiple platforms using between 2 and 36
processing tasks. A 51% attack is forensically analyzed and a potential mitigation strategy
using anomaly detection in nonce search strategy is proposed.
Keywords: Parallelization, search strategy, blockchain, cryptography.
MSc Capstone Project:
Abstract: Transformer neural network architectures have become
increasingly used
in state-of-the-art performing neural network time-series forecasting mod-
els. An important component of modern transformers is an attention mech-
anism that provides a means for a model to learn and encode the relative
dependencies of elements in a sequence. This study will explore the poten-
tial of probabilistic Signal Diffusion Mapping (SDM) as an attention mech-
anism specifically for forecasting financial data. This implementation of the
SDM algorithm promises, with linear asymptotic complexity, to combat the
challenges of lag-length distortion in financial data by including the ability
to operationalize general propositions about scedasticity while calculating
time-varying optimal lag-length relationships.
Keywords: Transformer, Forecasting, Lag-length distortion, Deep neural networks.
Code samples: here