📝 Welcome to Tomy’s personal blog!

The content of this blog is consist of Data Science projects built using Python and R programming languages. Several domains covered are as follows:

data collection,
data visualization,
machine learning,
deep learning, and
many more!

Visit tags for complete categories of content, or search content that matches with your taste 🌟

This site is powered by fastpages.

Posts

Understanding Word2Vec with Gensim and Elang

A comprehensive material on Word2Vec, a prediction-based word embeddings developed by Tomas Mikolov (Google). The explanation begins with the drawbacks of word embedding, such as one-hot vectors and count-based embedding. Word vectors produced by the prediction-based embedding have interesting properties that can capture the semantic meaning of a word. Therefore, we are interested in deep dive the Word2Vec in terms of its architecture, training optimization method, and how to do the hyperparameter tuning. We also tried to create Word2Vec embedding from scraped Wikipedia articles with the help of gensim and visualized it using elang - a Python package developed by Samuel Chan and me. We also present some other non-NLP use cases and developments from Word2Vec. At the end of this post, there are five multiple choice questions to test your understanding.

Apr 24, 2020
Product2Vec: Product Recommender System using Word2Vec

A Word2Vec implementation on simple product recommender system using Online Retail Dataset. We discuss how the classical use of Word2Vec can be applied to other non-NLP use cases. The trained Product2Vec model will be used to recommend new product based on last bought product and also based on multiple previous purchases. At the end of this post, there are two multiple choice questions to test your understanding.

Apr 24, 2020
Liver Disease Classification using Logistic Regression and k-Nearest Neighbors

Machine learning workflow on a binary classification problem of Indian liver patient records using Logistic Regression and K-Nearest Neighbor (KNN) algorithm. Linear regression is used to handle missing values. Imbalanced data is addressed using SMOTE. Logarithm and square root transformations are applied to predictors with outliers. Both models are evaluated using the confusion matrix, Matthew's Correlation Coefficient (MCC), and ROC Curve. Logistic regression is further improved using threshold tuning, while KNN is enhanced by comparing the scaling method (min-max or standardization) and manual tuning of the number of neighbors (k)

Apr 7, 2020
Medical Cost Analysis: Smoking Is Bad for You (and Your Insurance Charges)

Machine learning workflow to analyze the factor of medical cost for personal insurance using Multiple Linear Regression. Several models are compared by handling outlier, feature selection using step-wise regression (backward and forward), and regularized linear regression (Ridge and Lasso). Four assumptions of linear regression are tested - linearity, normality of residual, homoscedasticity, and no multicollinearity.

Mar 19, 2020
Text Mining for Indonesian Online News Articles about Corona

A workflow on scraping a list of news articles from tirto.id and detik.com about the Coronavirus using the BeautifulSoup package. The contents will be saved to individual tab-separated value (TSV) files, which will be cleaned and loaded again for further analysis. We then analyze the posting pattern for each site and train a Word2Vec model using the gensim package to analyze the semantic and syntactic similarity between each preprocessed word.

Feb 25, 2020
Replicating The Economist Plot: How much would giving up meat help the environment?

Workflow on how to replicate The Economist Plot from the daily chart entitled 'How much would giving up meat help the environment?' using ggplot2 package in R. The plot gives us an insight that going vegan for two-thirds of meals could cut food-related carbon emissions by 60%. We are going to break down the plot into two separate subplots and will be combined as one using grid package.

Jan 27, 2020