Toxicity Classification (Kaggle)
Bronze medal(Top 10%)
Jigsaw Unintended Bias in Toxicity Classification(Kaggle)
Challenge:
The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. This competition is aimed to solve the problem of identifying toxicity across diverse online conversations by building a NLP-based model to recognize toxicity and minimize this type of unintended bias with respect to mentions of identities.
Solution:
- Cleaned and regularized the data for model construction by using the deep model BERT and LSTM for a jointly learning and prediction.
- Preprocessed natural language data by leveraging word2vec and glove to transform them into text embeddings.
- Set up and experimented advanced NLP models, including BI-LSTM, BERT, GPT-2, and XLNet. Based on our experiment results, we chose the blend of BI-LSTM training and BERT fine-tuning in our method.