How much of data wrangling is a data scientist's job?2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

Can I make popcorn with any corn?

Arthur Somervell: 1000 Exercises - Meaning of this notation

What is the word for reserving something for yourself before others do?

Why are 150k or 200k jobs considered good when there are 300k+ births a month?

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

How does strength of boric acid solution increase in presence of salicylic acid?

Python: next in for loop

How can I make my BBEG immortal short of making them a Lich or Vampire?

Why not use SQL instead of GraphQL?

Can a Warlock become Neutral Good?

What's the output of a record cartridge playing an out-of-speed record

Today is the Center

How can bays and straits be determined in a procedurally generated map?

Show that if two triangles built on parallel lines, with equal bases have the same perimeter only if they are congruent.

Is it unprofessional to ask if a job posting on GlassDoor is real?

Has the BBC provided arguments for saying Brexit being cancelled is unlikely?

Why do falling prices hurt debtors?

What typically incentivizes a professor to change jobs to a lower ranking university?

Is a tag line useful on a cover?

Is it important to consider tone, melody, and musical form while writing a song?

How to find program name(s) of an installed package?

What is the offset in a seaplane's hull?

Test whether all array elements are factors of a number

Email Account under attack (really) - anything I can do?

How much of data wrangling is a data scientist's job?

2019 Community Moderator ElectionTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.

Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".

Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).

1) Feels like most of the work is not related to data science at all. Is this accurate?

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?

3) Is this type of setup common for a company with serious data science needs?

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
2 days ago

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
2 days ago

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
11 hours ago

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
2 days ago

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
2 days ago

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
11 hours ago

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

data-wrangling

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

edited yesterday

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

asked Apr 3 at 15:16

Victor Valente

29028

asked Apr 3 at 15:16

Victor Valente

29028

New contributor

Victor Valente is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
2 days ago

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
2 days ago

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
11 hours ago

add a comment |

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
2 days ago

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
2 days ago

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
11 hours ago

Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?

– jonnor
Apr 3 at 19:57

@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.

– Victor Valente
Apr 3 at 20:12

Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.

– Nelson
2 days ago

If it is a burden on your time you could outsource it.

– Sarcoma
2 days ago

I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it

– Pedro Henrique Monforte
11 hours ago

add a comment |

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
2 days ago

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
2 days ago

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
2 days ago

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered 2 days ago

Underminer

1314

add a comment |

If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.

You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.

Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?

answered 2 days ago

Whelibeiren

542

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.

Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered 2 days ago

David M

1212

New contributor

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?
In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()

3) Is this type of setup common for a company with serious data science needs?
I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered yesterday

user70920

211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
Clean(e.g.-99/Null),
Schema mapping (e.g. wages/salary),
Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered 14 hours ago

hojusaram

1211

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

9 Answers
9

active

oldest

votes

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
2 days ago

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
2 days ago

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
2 days ago

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
2 days ago

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

first of all, understand what is technically possible. In many cases,
the people telling you no are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

edited 2 days ago

Stephen Rauch

1,52551330

edited 2 days ago

Stephen Rauch

1,52551330

edited 2 days ago

Stephen Rauch

1,52551330

answered 2 days ago

PythonGuest

2062

New contributor

answered 2 days ago

PythonGuest

2062

answered 2 days ago

PythonGuest

2062

New contributor

PythonGuest is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
2 days ago

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
2 days ago

add a comment |

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
2 days ago

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
2 days ago

Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.

– Jason
2 days ago

True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.

– PythonGuest
2 days ago

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
2 days ago

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
2 days ago

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

edited 2 days ago

Stephen Rauch

1,52551330

edited 2 days ago

Stephen Rauch

1,52551330

edited 2 days ago

Stephen Rauch

1,52551330

answered Apr 3 at 16:35

Tasos

1,45611038

answered Apr 3 at 16:35

Tasos

1,45611038

answered Apr 3 at 16:35

Tasos

1,45611038

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
2 days ago

add a comment |

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
2 days ago

Forbes article claiming the same 80% figure.

– Jesse Amano
Apr 3 at 19:08

Forbes should nowhere be mentioned together with the words "data science".

– gented
Apr 3 at 22:52

50-80% based on (quote) "interviews and expert estimates"

– oW_
Apr 3 at 23:33

@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?

– Keeta
2 days ago

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

answered Apr 3 at 16:47

Shamit Verma

1,4291214

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

answered Apr 3 at 23:03

Oliver Houston

1511

answered Apr 3 at 23:03

Oliver Houston

1511

New contributor

Oliver Houston is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered 2 days ago

Underminer

1314

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered 2 days ago

Underminer

1314

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered 2 days ago

Underminer

1314

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered 2 days ago

Underminer

1314

answered 2 days ago

Underminer

1314

answered 2 days ago

Underminer

1314

answered 2 days ago

Underminer

1314

add a comment |

answered 2 days ago

Whelibeiren

542

add a comment |

answered 2 days ago

Whelibeiren

542

add a comment |

answered 2 days ago

Whelibeiren

542

answered 2 days ago

Whelibeiren

542

answered 2 days ago

Whelibeiren

542

answered 2 days ago

Whelibeiren

542

answered 2 days ago

Whelibeiren

542

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered 2 days ago

David M

1212

New contributor

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered 2 days ago

David M

1212

New contributor

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered 2 days ago

David M

1212

New contributor

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered 2 days ago

David M

1212

New contributor

answered 2 days ago

David M

1212

New contributor

answered 2 days ago

David M

1212

answered 2 days ago

David M

1212

New contributor

David M is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered yesterday

user70920

211

New contributor

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered yesterday

user70920

211

New contributor

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered yesterday

user70920

211

New contributor

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered yesterday

user70920

211

New contributor

answered yesterday

user70920

211

New contributor

answered yesterday

user70920

211

answered yesterday

user70920

211

New contributor

user70920 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered 14 hours ago

hojusaram

1211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered 14 hours ago

hojusaram

1211

New contributor

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered 14 hours ago

hojusaram

1211

New contributor

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered 14 hours ago

hojusaram

1211

New contributor

answered 14 hours ago

hojusaram

1211

New contributor

answered 14 hours ago

hojusaram

1211

answered 14 hours ago

hojusaram

1211

New contributor

hojusaram is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Victor Valente is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mrthdrb

9 Answers
9

Your Answer

Post as a guest

9 Answers
9

9 Answers
9

Post as a guest

Popular posts from this blog

БиармияSxpst500bh2ntaf! 3h2r

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Post as a guest

9 Answers 9

9 Answers 9

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

БиармияSxpst500bh2ntaf! 3h2r

9 Answers
9

9 Answers
9

9 Answers
9