Categorical vs continuous feature selection/engineering Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsVisualizing Support Vector Machines (SVM) with Multiple Explanatory VariablesPredicting a Continuous output in a dataset with categoriesHow to perform Logistic Regression with a large number of features?Data balance -before or after feature selection/engineeringLSTM Feature selection processChi-squared for continuous variablesBest practices for selecting categorical featuresHierarchical Clustering and Variable SelectionTarget Encoding: missing value imputation before or after encodingManual feature engineering based on the output
Generate an RGB colour grid
Why is "Consequences inflicted." not a sentence?
Can a non-EU citizen traveling with me come with me through the EU passport line?
Dating a Former Employee
Check which numbers satisfy the condition [A*B*C = A! + B! + C!]
Why is my conclusion inconsistent with the van't Hoff equation?
Coloring maths inside a tcolorbox
Should I discuss the type of campaign with my players?
What exactly is a "Meth" in Altered Carbon?
How would the world control an invulnerable immortal mass murderer?
How to bypass password on Windows XP account?
What does the word "veer" mean here?
Seeking colloquialism for “just because”
Is there a (better) way to access $wpdb results?
Book where humans were engineered with genes from animal species to survive hostile planets
English words in a non-english sci-fi novel
Why do people hide their license plates in the EU?
Output the ŋarâþ crîþ alphabet song without using (m)any letters
List *all* the tuples!
Short Story with Cinderella as a Voo-doo Witch
Selecting the same column from Different rows Based on Different Criteria
Why are there no cargo aircraft with "flying wing" design?
Echoing a tail command produces unexpected output?
How do I stop a creek from eroding my steep embankment?
Categorical vs continuous feature selection/engineering
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsVisualizing Support Vector Machines (SVM) with Multiple Explanatory VariablesPredicting a Continuous output in a dataset with categoriesHow to perform Logistic Regression with a large number of features?Data balance -before or after feature selection/engineeringLSTM Feature selection processChi-squared for continuous variablesBest practices for selecting categorical featuresHierarchical Clustering and Variable SelectionTarget Encoding: missing value imputation before or after encodingManual feature engineering based on the output
$begingroup$
I'm working with a dataset with a number of potential predictors like :
Age : continuous
Number of children : discrete and numerical
Marital Situation : Categorical ( Married/Single/Divorced.. )
Id_User : Categorical ( an id of the user who conducted the first interview with this person )
I'm stopping at four potential predictors, there are more, but for the sake of shortness, these would be enough to ask my question.
Question : Continuous features are easy to deal with, normalize, and feed it to the model, what about categorical and independant ?
Note : I get that categorical features that follow a certain pattern can be encoded as integers and fed to the model, but what if those categorical features have no meaning as integers ( 1 for single, 2 for married , 3 for divorced ; for the model that treats it as a quantitative predictor it doesn't make sense to feed it to it like that)
Any ways to deal with these different types of features?
machine-learning feature-selection feature-engineering
$endgroup$
add a comment |
$begingroup$
I'm working with a dataset with a number of potential predictors like :
Age : continuous
Number of children : discrete and numerical
Marital Situation : Categorical ( Married/Single/Divorced.. )
Id_User : Categorical ( an id of the user who conducted the first interview with this person )
I'm stopping at four potential predictors, there are more, but for the sake of shortness, these would be enough to ask my question.
Question : Continuous features are easy to deal with, normalize, and feed it to the model, what about categorical and independant ?
Note : I get that categorical features that follow a certain pattern can be encoded as integers and fed to the model, but what if those categorical features have no meaning as integers ( 1 for single, 2 for married , 3 for divorced ; for the model that treats it as a quantitative predictor it doesn't make sense to feed it to it like that)
Any ways to deal with these different types of features?
machine-learning feature-selection feature-engineering
$endgroup$
add a comment |
$begingroup$
I'm working with a dataset with a number of potential predictors like :
Age : continuous
Number of children : discrete and numerical
Marital Situation : Categorical ( Married/Single/Divorced.. )
Id_User : Categorical ( an id of the user who conducted the first interview with this person )
I'm stopping at four potential predictors, there are more, but for the sake of shortness, these would be enough to ask my question.
Question : Continuous features are easy to deal with, normalize, and feed it to the model, what about categorical and independant ?
Note : I get that categorical features that follow a certain pattern can be encoded as integers and fed to the model, but what if those categorical features have no meaning as integers ( 1 for single, 2 for married , 3 for divorced ; for the model that treats it as a quantitative predictor it doesn't make sense to feed it to it like that)
Any ways to deal with these different types of features?
machine-learning feature-selection feature-engineering
$endgroup$
I'm working with a dataset with a number of potential predictors like :
Age : continuous
Number of children : discrete and numerical
Marital Situation : Categorical ( Married/Single/Divorced.. )
Id_User : Categorical ( an id of the user who conducted the first interview with this person )
I'm stopping at four potential predictors, there are more, but for the sake of shortness, these would be enough to ask my question.
Question : Continuous features are easy to deal with, normalize, and feed it to the model, what about categorical and independant ?
Note : I get that categorical features that follow a certain pattern can be encoded as integers and fed to the model, but what if those categorical features have no meaning as integers ( 1 for single, 2 for married , 3 for divorced ; for the model that treats it as a quantitative predictor it doesn't make sense to feed it to it like that)
Any ways to deal with these different types of features?
machine-learning feature-selection feature-engineering
machine-learning feature-selection feature-engineering
edited Apr 12 at 10:57
Blenzus
asked Apr 12 at 10:17
BlenzusBlenzus
16910
16910
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
$begingroup$
What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.
The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).
import pandas as pd
dataset2 = pd.get_dummies(dataset)
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
For encoding categorical features, there is two common ways:
Ordinal encoder
This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.
One-hot encoder
This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.
So in your case, I highly recommend that you use one-hot encoder.
$endgroup$
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
add a comment |
$begingroup$
One possibility to deal with categorical inputs is to introduce the category input vector $boldsymbolt$. The category input vector of the $n^textth$ observation is given by
$boldsymbolt_n=[t_1n, t_2n,...,t_Kn],$ in which $K$ is the number of categories. If the continuous input vector $boldsymbolx_n$ is belonging to category $k$, then $t_1i=1$ for $i=k$ and $t_1i=0$ for $ineq k$.
This type of encoding is called one hot encoding for classification.
$endgroup$
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
add a comment |
$begingroup$
As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.
If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.
$endgroup$
add a comment |
$begingroup$
There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.
If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.
>>> from keras.utils import to_categorical
>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)
[[1,0],[0,1],[1,0],[0,1],[0,1]]
```
$endgroup$
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49185%2fcategorical-vs-continuous-feature-selection-engineering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.
The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).
import pandas as pd
dataset2 = pd.get_dummies(dataset)
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.
The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).
import pandas as pd
dataset2 = pd.get_dummies(dataset)
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.
The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).
import pandas as pd
dataset2 = pd.get_dummies(dataset)
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.
The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).
import pandas as pd
dataset2 = pd.get_dummies(dataset)
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited Apr 13 at 13:12
Stephen Rauch♦
1,52551330
1,52551330
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered Apr 12 at 12:19
Juan Esteban de la CalleJuan Esteban de la Calle
35811
35811
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Juan Esteban de la Calle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
$begingroup$
For encoding categorical features, there is two common ways:
Ordinal encoder
This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.
One-hot encoder
This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.
So in your case, I highly recommend that you use one-hot encoder.
$endgroup$
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
add a comment |
$begingroup$
For encoding categorical features, there is two common ways:
Ordinal encoder
This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.
One-hot encoder
This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.
So in your case, I highly recommend that you use one-hot encoder.
$endgroup$
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
add a comment |
$begingroup$
For encoding categorical features, there is two common ways:
Ordinal encoder
This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.
One-hot encoder
This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.
So in your case, I highly recommend that you use one-hot encoder.
$endgroup$
For encoding categorical features, there is two common ways:
Ordinal encoder
This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.
One-hot encoder
This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.
So in your case, I highly recommend that you use one-hot encoder.
answered Apr 12 at 11:29
pythinkerpythinker
8291213
8291213
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
add a comment |
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right?
$endgroup$
– Blenzus
Apr 12 at 13:46
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
Are you afraid of overfitting?
$endgroup$
– pythinker
Apr 12 at 13:51
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting
$endgroup$
– Blenzus
Apr 12 at 13:53
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
$begingroup$
Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern.
$endgroup$
– pythinker
Apr 12 at 13:59
1
1
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
$begingroup$
I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot".
$endgroup$
– Acccumulation
Apr 12 at 15:21
add a comment |
$begingroup$
One possibility to deal with categorical inputs is to introduce the category input vector $boldsymbolt$. The category input vector of the $n^textth$ observation is given by
$boldsymbolt_n=[t_1n, t_2n,...,t_Kn],$ in which $K$ is the number of categories. If the continuous input vector $boldsymbolx_n$ is belonging to category $k$, then $t_1i=1$ for $i=k$ and $t_1i=0$ for $ineq k$.
This type of encoding is called one hot encoding for classification.
$endgroup$
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
add a comment |
$begingroup$
One possibility to deal with categorical inputs is to introduce the category input vector $boldsymbolt$. The category input vector of the $n^textth$ observation is given by
$boldsymbolt_n=[t_1n, t_2n,...,t_Kn],$ in which $K$ is the number of categories. If the continuous input vector $boldsymbolx_n$ is belonging to category $k$, then $t_1i=1$ for $i=k$ and $t_1i=0$ for $ineq k$.
This type of encoding is called one hot encoding for classification.
$endgroup$
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
add a comment |
$begingroup$
One possibility to deal with categorical inputs is to introduce the category input vector $boldsymbolt$. The category input vector of the $n^textth$ observation is given by
$boldsymbolt_n=[t_1n, t_2n,...,t_Kn],$ in which $K$ is the number of categories. If the continuous input vector $boldsymbolx_n$ is belonging to category $k$, then $t_1i=1$ for $i=k$ and $t_1i=0$ for $ineq k$.
This type of encoding is called one hot encoding for classification.
$endgroup$
One possibility to deal with categorical inputs is to introduce the category input vector $boldsymbolt$. The category input vector of the $n^textth$ observation is given by
$boldsymbolt_n=[t_1n, t_2n,...,t_Kn],$ in which $K$ is the number of categories. If the continuous input vector $boldsymbolx_n$ is belonging to category $k$, then $t_1i=1$ for $i=k$ and $t_1i=0$ for $ineq k$.
This type of encoding is called one hot encoding for classification.
answered Apr 12 at 11:02
MachineLearnerMachineLearner
399110
399110
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
add a comment |
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
1
1
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
$begingroup$
I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset?
$endgroup$
– Blenzus
Apr 12 at 11:08
1
1
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
$begingroup$
@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative.
$endgroup$
– MachineLearner
Apr 12 at 12:49
add a comment |
$begingroup$
As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.
If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.
$endgroup$
add a comment |
$begingroup$
As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.
If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.
$endgroup$
add a comment |
$begingroup$
As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.
If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.
$endgroup$
As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.
If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.
answered Apr 12 at 15:36
AcccumulationAcccumulation
1311
1311
add a comment |
add a comment |
$begingroup$
There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.
If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.
>>> from keras.utils import to_categorical
>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)
[[1,0],[0,1],[1,0],[0,1],[0,1]]
```
$endgroup$
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.
If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.
>>> from keras.utils import to_categorical
>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)
[[1,0],[0,1],[1,0],[0,1],[0,1]]
```
$endgroup$
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.
If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.
>>> from keras.utils import to_categorical
>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)
[[1,0],[0,1],[1,0],[0,1],[0,1]]
```
$endgroup$
There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.
If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.
>>> from keras.utils import to_categorical
>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)
[[1,0],[0,1],[1,0],[0,1],[0,1]]
```
answered Apr 13 at 9:27
thanatozthanatoz
643421
643421
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
$begingroup$
Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN.
$endgroup$
– Blenzus
yesterday
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49185%2fcategorical-vs-continuous-feature-selection-engineering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown