Imbalanced dataset binary classification The 2019 Stack Overflow Developer Survey Results Are InAre unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

Is it correct to say the Neural Networks are an alternative way of performing Maximum Likelihood Estimation? if not, why?

Kerning for subscripts of sigma?

Why doesn't UInt have a toDouble()?

What is preventing me from simply constructing a hash that's lower than the current target?

Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?

Match Roman Numerals

What do these terms in Caesar's Gallic Wars mean?

How much of the clove should I use when using big garlic heads?

Inverse Relationship Between Precision and Recall

I am an eight letter word. What am I?

Output the Arecibo Message

Worn-tile Scrabble

Ubuntu Server install with full GUI

For what reasons would an animal species NOT cross a *horizontal* land bridge?

Why are there uneven bright areas in this photo of black hole?

Can I have a signal generator on while it's not connected?

How come people say “Would of”?

How can I define good in a religion that claims no moral authority?

Straighten subgroup lattice

Does HR tell a hiring manager about salary negotiations?

Why “相同意思的词” is called “同义词” instead of "同意词"?

Did Scotland spend $250,000 for the slogan "Welcome to Scotland"?

The phrase "to the numbers born"?



Imbalanced dataset binary classification



The 2019 Stack Overflow Developer Survey Results Are InAre unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?



Regrds.










share|cite|improve this question







New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
    $endgroup$
    – Stephan Kolassa
    Apr 8 at 19:10

















2












$begingroup$


I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?



Regrds.










share|cite|improve this question







New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
    $endgroup$
    – Stephan Kolassa
    Apr 8 at 19:10













2












2








2





$begingroup$


I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?



Regrds.










share|cite|improve this question







New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?



Regrds.







machine-learning classification binary-data unbalanced-classes






share|cite|improve this question







New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question







New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question






New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Apr 8 at 10:31









Sid_MirzaSid_Mirza

112




112




New contributor




Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
    $endgroup$
    – Stephan Kolassa
    Apr 8 at 19:10
















  • $begingroup$
    Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
    $endgroup$
    – Stephan Kolassa
    Apr 8 at 19:10















$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10




$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10










1 Answer
1






active

oldest

votes


















6












$begingroup$

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.



Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:18











  • $begingroup$
    params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:21










  • $begingroup$
    Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
    $endgroup$
    – Frank Harrell
    2 days ago










  • $begingroup$
    Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
    $endgroup$
    – Sid_Mirza
    2 days ago










  • $begingroup$
    I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
    $endgroup$
    – Frank Harrell
    yesterday











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401800%2fimbalanced-dataset-binary-classification%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









6












$begingroup$

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.



Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:18











  • $begingroup$
    params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:21










  • $begingroup$
    Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
    $endgroup$
    – Frank Harrell
    2 days ago










  • $begingroup$
    Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
    $endgroup$
    – Sid_Mirza
    2 days ago










  • $begingroup$
    I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
    $endgroup$
    – Frank Harrell
    yesterday















6












$begingroup$

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.



Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:18











  • $begingroup$
    params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:21










  • $begingroup$
    Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
    $endgroup$
    – Frank Harrell
    2 days ago










  • $begingroup$
    Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
    $endgroup$
    – Sid_Mirza
    2 days ago










  • $begingroup$
    I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
    $endgroup$
    – Frank Harrell
    yesterday













6












6








6





$begingroup$

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.



Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.






share|cite|improve this answer









$endgroup$



You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.



Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Apr 8 at 11:59









Frank HarrellFrank Harrell

55.9k3110245




55.9k3110245











  • $begingroup$
    Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:18











  • $begingroup$
    params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:21










  • $begingroup$
    Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
    $endgroup$
    – Frank Harrell
    2 days ago










  • $begingroup$
    Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
    $endgroup$
    – Sid_Mirza
    2 days ago










  • $begingroup$
    I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
    $endgroup$
    – Frank Harrell
    yesterday
















  • $begingroup$
    Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:18











  • $begingroup$
    params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
    $endgroup$
    – Sid_Mirza
    Apr 8 at 17:21










  • $begingroup$
    Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
    $endgroup$
    – Frank Harrell
    2 days ago










  • $begingroup$
    Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
    $endgroup$
    – Sid_Mirza
    2 days ago










  • $begingroup$
    I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
    $endgroup$
    – Frank Harrell
    yesterday















$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18





$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18













$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21




$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21












$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
2 days ago




$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
2 days ago












$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago




$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago












$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday




$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday










Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.












Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.











Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401800%2fimbalanced-dataset-binary-classification%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Sum ergo cogito? 1 nng

三茅街道4182Guuntc Dn precexpngmageondP