Previous Work
A look at publicly available code I've written, talks I've given, and projects I've lead
A look at publicly available code I've written, talks I've given, and projects I've lead
Every company deals with fraud, whether it’s signup fraud, transaction fraud, employee fraud, or a million other variants. However, there’s very limited literature and previous work outside of closed-door working groups.
During this standing-room-only talk, I walk through common approaches to build a machine learning infrastructure for capturing and blocking fraud.
Spoilers are a complicated concept (see this guideline), but avoiding movie and tv show spoilers is a common goal.
With this in mind, I build a model on that can determine in message board posts contain spoilers or not, using data I pulled from reddit. This model robustly handles edge-cases and new concepts (such as speculation and previously unseen characters), while generalizing well.
There are many areas of applied Machine Learning which require models optimized for rare occurrences (i.e. class imbalance), as well as users actively attempting to subvert the system (i.e. adversaries).
The approaches discussed will include ensemble models, deep learning, genetic algorithms, outlier detection via dimensionally reduction (PCA and neural network auto-encoders), time-decay weighting, and Synthetic Minority Over-sampling Technique (SMOTE sampling).
There aren't great batteries included examples for modeling text with deep learning, so I've built out this repo to contain starter code for:
Text processing: Processing text to be utilized with keras (text pre-processing, converting to indices, padding)
Pre-trained embedding: Using a pre-trained text embedding (GoogleNews 300) with keras (translating words to a point in \mathbb{R}^{300})
Convolutional architecture: Modeling text with a convolutional architecture (functionally similar to Ngrams)
RNN architecture: Modeling text with a Recurrent Neural Net (RNN) architecture (functionally similar to a rolling window)
Capital One receives thousands of legal requests every year, often as physical mail. During this talk, we'll dive into how the Center for Machine Learning at Capital One have built a self contained platform for summarizing, filtering and triaging these legal documents, utilizing open source projects.
A utility to make handling many resumes easier by automatically pulling contact information, required skills and custom text fields. These results are then surfaced as a convenient summary CSV.
This started as a side project in grad school, but has become a community project used at companies across the globe.