Feature Engineering
Careful, this website is still under construction.
Feature Engineering
One-Hot encoding is one of the approaches of feature encoding. It converts variables that takes multiple values into binary variables (0, 1) for each category. This creates several new variables.
As an example, let's assume a column of your dataset contains a feature called colour. Some of the variables in this feature are blue, green and red. Now what one-hot encoding does is to create new binary variables for each colour, i.e. colour_blue(0, 1), colour_green(0, 1) and colour_red(0, 1).
Let's create a small dataset with just 5 features
We want to encode the string categoricals only, so we filter the desired columns.
Pandas comes with a built-in 'get_dummies' function which is easy to use for this type of encoding.
Notice that new variables have been created for every column in the string categorical list, and have been assigned binary values (0, 1)
We'll be using the OneHotEncoder and LabelEncoder modules from sk-learn.
Write a for loop that performs the encoding functions on each of the columns in the one_hot_encode_cols. Start the loop by 'integer encoding' the string categories using the label encoder.
Since new columns would be generated by the encoding, we don't need the original column anymore. Drop the original columns (still inside the loop).
Next, perform one-hot encoding on the integer encoded data.
Create new columns names for the generated binary variables.
Construct a new dataframe for the sparse array
Finally, attach the new dataframe to the dataset
Your final code block should look like this:
Print the result of your encoding.
It is fast! It can also be easily implemented as shown in this short demo.
It can only represent a limited number of values. When the number of n values are large, other encoding alternatives are considered.
Since the pros far outweigh the cons, it is still widely used till today.