CHAPTER ONE:
WHAT IS DATA SCIENCE?
Different Types of Data Science
So you have made the
decision to become a data scientist. Great, you are on your way. But now you
have another choice, which is: what kind of data scientist do you want
to become? Because – it is important to acknowledge – while data
science as a profession has been recognised for a number of years now, there
still isn’t a commonly accepted definition of what it actually is.
In reality, the term
‘data scientist’ is regarded as a broad job title and so it comes in many
forms, with the specific demands dependent on the industry, the business, and
the purpose/output of the role in question. As a result, certain skillsets suit
certain positions better than others, and this is why the path to data science
is not uniform and can be via a diverse range of fields such as statistics,
computer science and other scientific disciplines.
The purpose is the
biggest factor that dictates what form data science takes, and this is related
to the Type A-Type B classification that has emerged (see here: What is Data
Science?). Broadly speaking, the categorizsation can
be summarized as:
- Data science for
people (Type A), i.e. analytics to support evidence-based decision making
- Data science for
software (Type B), for example: recommender systems as we see in Netflix
and Spotify
We may see further
evolution of these definitions as the field matures, but for now, we will
continue this exploration with a look at the ‘science’ in data science.
Owning
Up To The Title
All scientists work
with data, so in a sense all scientists are data scientists. But to take what
is generally considered to be data science in industry, what makes it a
science? What a good question! The answer should be: ‘the scientific method’.
Given the multi-disciplinary nature of science, the scientific method is the
one thing that binds the fields together. If you got this right, full marks to
you.
However, job titles
tend to be applied very loosely in industry and so not all data scientists are
true scientists. Ask yourself though: can you justify calling yourself a
scientist if your role does not involve actual science? Personally, I do not
see what is wrong with alternatives like ‘analyst’, or whatever best fits the
position in question. But maybe this is just me, and perhaps I would be better
off calling myself a recruitment scientist.
For an excellent
discussion on this, I thoroughly recommend this post by Sean McClure: Data
Scientist: Owning Up To The Title (yes,
I admit it – I plagiarised the title).
With that out the way,
we will now delve further into data science by considering what areas of
expertise you will need to master (if you haven’t already).
1.
Problem Solving
If this is not top of your
list, amend that list. Immediately. At the core of all scientific disciplines
is problem solving: a great data scientist is a great problem solver; it is as
simple as that. Need further proof? How about every single person I met for
this project, irrespective of background or current working situation,
mentioned this as THE most important factor in data science.
Clearly, you need to
possess the tools to solve the problems, but they are just that: tools.
In this sense, even the statistical/machine learning techniques can be thought
of as the tools by which you solve problems. New techniques arise, technology
evolves; the
one constant is problem solving.
To an extent, your
ability as a problem solver is dictated by your nature, but at the same time,
there is only one-way to improve: experience, experience,
experience. We will re-visit this in
Chapter Three, so at this point, just remember this important lesson: you can
only master something through doing.
Before we move on, I
would like to direct you to another great post from Sean McClure: The Only
Skill You Should Be Concerned With (just
to be clear, I am not receiving any payment for these pointers, but I am
totally open to it. Sean – if you are reading this, you can send me money
anytime).
2.
Statistics / Machine Learning
Ok, having read the
above, it might seem like I have trivialized statistics and machine learning.
But we are not talking about a power tool here; these are complex – and to an
extent – esoteric fields, and if you do not possess expert knowledge, you will
not be solving data science problems any time soon.
To provide some
much-needed clarification on these terms, machine learning can be viewed as a
multi-disciplinary field that grew out ofboth artificial
intelligence/computer science and statistics. It is
often seen as a subfield of AI, and while this is true, it is important to
recognise that there is no machine learning without statistics (ML is heavily
dependent on statistical algorithms in order to work). For a long time
statisticians were unconvinced by machine learning, with collaboration between
the two fields being a relatively recent development (see statistical learning
theory), and it is interesting to note that high dimensional statistical
learning only happened when statisticians embraced ML results (thanks
to Bhavani Rascutti, Advanced Analytics Domain Lead at Teradata for this input).
For the technical
readers who are interested in a more detailed account, check out this classic
paper published in 2001 by Leo Breiman:Statistical
Modelling: The Two Cultures.
3.
Computing
a.
Programming
We only need to
briefly touch on programming because it should be obvious: this is an absolute
must. How can you apply the theory if you cannot code a unique algorithm
or build a statistical model?
b.
Distributed Computing
Not all businesses have
massive datasets but considering the modern world, it is advisable to develop
the ability to work with BIG DATA (!). In short: the main memory of a single
computer is not going to cut it, and if you want to simultaneously train models
across hundreds of virtual machines, you need to get to grips with distributed
computation and parallel algorithms.
Why the exclamations mark? Personally, I find the misnomer that is “big
data” farcical. The term is continually confused and often used as an umbrella
term for all analytics. Furthermore, massive data volumes (and the technologies
to store and manage these quantities) are not new like they once were, so it is
only a matter of time before it expires from our lexicon. For an expanded
discussion on this, there is yet another sensible post from Sean McClure: Data Science and Big Data: Two
Very Different Beasts (this is getting ridiculous now – I swear I have never even talked
to the guy).
c.
Software Engineering
For Type A data
science, let me be clear: engineering is a separate discipline. So if this is
the type of data scientist you want to become, you do not need to be an
engineer. However, if you want to put machine learning algorithms into
production (i.e. Type B), you will need a strong foundation in software
engineering.
4.
Data Wrangling
Data
cleaning/preparation is a crucial and intrinsic part of data science. And
this will take up the majority of your time. If you fail to remove
the noise from your dataset (e.g. wrong/missing values, non-standardised
categories, etc.), then the accuracy of the model will be affected and will
ultimately lead to incorrect conclusions. Therefore, if you are not prepared to
spend the time and attention on this step, it renders your technical know-how
irrelevant.
It is also important to
note that data quality is a persistent issue in commercial organisations and
many businesses have complicated infrastructures when it comes to data storage.
So if you are not preapred for this environment and you want to work with nice
clean datasets, unfortunately commercial data science is not for you.
5.
Tools and Technology
As you should have
realized by now, developing your ability as a problem solving data scientist
should take precedence over everything else: technologies constantly change and
can ultimately be learnt in a relatively short timeframe. But we shouldn’t
ignore them altogether, so it is useful to be aware of the most widespread
tools in use today.
Starting with
programming languages, R and Python are the most common; so if you have a
choice, perhaps use one of these when you are experimenting.
Particularly in Type A
data science, having the ability to visualize data in intuitive dashboards is
very powerful for communicating with non-technical business stakeholders. You
might have the best model and the best insights, but if you cannot
present/explain the findings effectively, what use is it? It really doesn’t
matter what tool you use for visualization – it could be R, or Tableau (which
seems to be the most prevalent at the moment), but honestly – the tool is
unimportant.
Finally, a lot of
businesses put emphasis into SQL ability. SQL is the most common language used
to interact with databases in industry, whether we are talking about relational
databases or derivatives of SQL used with big data technologies. And it is the
bread and butter of data wrangling – at least when working at larger scales
(i.e. not in memory). As a result, it is worth investing some of your time to
pick this up.
6.
Communication / Business Acumen
This should not be
understated. Unless you are going into something very specific, perhaps pure
research (although let’s face it, there aren’t many of these positions around
in industry), the vast majority of data science positions involve business
interaction, often with individuals who are not analytically literate.
Having the ability to
conceptualize business problems and the environment in which they occur is
critical. And translating statistical insights into recommended actions and
implications to a lay audience is absolutely crucial, particularly for Type A
data science. I was chatting to Yanir Seroussi who is Head of Data Science at
Car Next Door (a start-up enabling car sharing), and this is how he put it:
“I find it weird how
some technical people don't pay attention to how non-technical people's eyes
glaze over when they start using jargon. It's really important to put yourself
in the listener's/reader's shoes”.
As a quick aside, if
you have some time, check out Yanir’s website; he is a regular and eloquent
writer on a variety of topics around data science.
Rock Stars
It probably isn’t
clear: I have used this title ironically. No – data scientists are not rock stars,
ninjas, unicorns or any other mythical creature. If you are planning on
referring to yourself like this, perhaps take a long look in the mirror.
Anyway, I digress. The point I want to make here is this: there are some data
scientists who possess expert level ability in all of the above, and perhaps
more. They are rare and extremely valuable. If you have the natural ability and
desire to become one of these, then great – you are going to be hot property.
But if not, remember: you can specialize in certain areas of data science, and
quite often, good teams are comprised of data scientists with different
specialities. Deciding what to focus on goes back to your interests and
capability, and this leads us nicely to the next chapter in our journey.
CHAPTER
TWO: LOOKING INWARDS
Now we are making
progress! Having successfully digested the information in Chapter One, you are
nearly ready to begin formulating your personal goals and objectives. But first
– some introspection is required – so grab a coffee, sit yourself down in a
quiet place, and have a deep think about:
- Why do you want to be a data scientist?
- What type of data science interests you?
- What natural capabilities or relevant skills do
you already possess?
Why is this important?
Simply put: data science is an expert field, so unless you have already
mastered a lot of what we covered in Chapter One, it is not an easy (or quick)
journey. There is an important message here, which addresses questions one and
two: you need to have the right reasonsfor going down this path, otherwise – chances are – you will give up
when the going gets tough (and it will).
To elaborate on this
message, enter Dylan Hogg (remember these names, as we will be
returning to them). Dylan was previously a software engineer and is
now Head of Data Science at The Search Party, a start-up that has built a
platform that utilizes machine learning (NLP) to link employers with relevant
candidates (the
future of recruitment!). Considering he has made the
transition from software engineering to data science (a journey he is still
on), we discussed what it takes, and he said:
“Regardless of
education or experience, there’s something more fundamental, which is your nature
of curiosity, determination and tenacity. There are so many times when you hit
a problem: perhaps the algorithm isn’t performing in the way it needs to, or
perhaps the technology is being a pain. Either way, you can study machine
learning algorithms or software engineering best practice, but if you’re not
really determined, you are going to give up and not get through it”.
There you go: you won’t
just face problems when you are learning; you will face them continually in
your working life, so you better make sure you are motivated for the right
reasons, and not just because you think having ‘scientist’ in your title is
cool.
But what about question
three? Why do your relevant skills matter? Well, where you are starting from
affects what type of data science you are most suited to, and what you need to
learn for the area that interests you. And so we will now explore the typical
paths to data science, starting with the wider scientific field.
Note: There are many quantitative disciplines where you will find people
with the ability to transition into data science. I won’t cover them all here,
but the point is this: if you take the time to really understand the different
nuances of data science, you should be able to figure out how relevant your
current skillset is, whatever your background.
Other Scientific Disciplines
This is not the most
common route to data science; statistics and computer science are, as we will
see. But with scientists from many fields having highly relevant skillsets
(especially in the world of physics), many have made this jump.
For an explanation on
why, allow me to introduce Will Hanninger, a Data Scientist with Commonwealth
Bank of Australia. In a previous life, Will was a particle physicist with CERN
where he worked on the discovery of the Higgs boson (very cool), and this is what
he had to say:
“In physics, you
naturally learn a lot of what you need in data science: programming,
manipulating data, getting the raw data and transforming it in a useful way.
You learn statistics, which is important. And crucially: you learn how to solve
problems. These are the basic skills needed for a data scientist”.
So the skillset is
highly transferable, with the main box ticked: problem solving. The differences
tend to arise in the tools and techniques; for example, while machine learning
is synonymous with data science, it is less common in wider science. In any
case, we are talking about very smart people here; they have the ability to
learn tools and techniques in a short timeframe.
I also met Sean Farrell
for this project; Sean’s background is in astrophysics and he moved into
commercial data science with Teradata Australia, where he wrote an excellent
blog post on this topic: Why Science’s
Loss is a Gain for Data Science.
The following passage is particularly pertinent:
“Until recently there
haven’t been any formal training pathways to become a Data Scientist. Most Data
Scientists come from backgrounds in statistics or computer science. However,
while these other career paths develop some of the skills listed above, they
typically don’t cover all of them. Statisticians are very strong on the maths
and stats side, but generally have weaker programming skills. Computer
scientists are very strong in the programming arena, but typically don’t have
as strong a comprehension of statistics. Both have good (yet different) data
analysis skill sets but can struggle with creative problem solving, which is
arguably the hardest skill to teach”.
To avoid
misunderstanding, remember the context here. Sean isn’t saying that all data
scientists from statistics or computer science lack creative problem solving;
the argument he is making is that science filters extremely effectively for
problem solving, arguably more so than statistics/computer science.
Statistics
With science covered,
it is statistics turn to be cross-examined. In recent times, many statistical
positions have been re-branded as data science (of the Type A variety), so in a
sense, we are getting into semantics. But as before, I hold the opinion that
the scientific method should be applied for it to be deemed a science: does it
involve setting hypotheses, designing robust experiments, etc.? If not, perhaps
a title like ‘statistician’ or ‘modelling analyst’ is a better fit.
That aside, if you are
a statistician/analyst in industry or just coming out of higher-level education
in statistics, there is a chance you already possess everything you need to
obtain a role as a data scientist. It depends on a few factors:
- Firstly, do you
have experience in machine learning techniques? As we saw in Chapter One,
statistical modelling and machine learning are related, but the latter
possesses significant advantages when applied to massive datasets. And
with the adoption of machine learning continuing to rise in all areas of
industry, it really is synonymous with all types of data science
- Secondly, at the
risk of repeating myself, what area of data science interests you? Clearly
a statistics background is better suited to Type A positions, so if your
goal is Type B work, you will have some learning to do
- Finally, do you
have practical experience working with data? Data wrangling is often a
comparative weakness of those coming from statistics, and as we learnt in
Chapter One, it is a crucial component of commercial data science
Computer Science / Software Engineering
If you have studied
artificial intelligence/computer science to a high level, then it is likely you
are already in a good position for Type B data science. But there is the other
well-trodden path to consider: the experienced software engineer who wants to
move into data science.
A software engineer
might, or might not have experience in machine learning – it depends. But
either way, this background is clearly more suited to Type B data science,
which requires a solid grounding in software engineering principles. I
discussed this with James Petterson who is a Senior Data Scientist at
Commonwealth Bank of Australia (and previously a software engineer), and here
is what he said on the matter:
“A lot of data science
work is software engineering. Not always in the sense of designing robust
systems, but simply writing software. A lot of tasks you can automate and if
you want to run experiments, you have to write code, and if you can do it fast,
it makes a huge difference. When I did my PhD, I had to run tens of thousands
of experiments every day, and at this scale, it wasn’t possible to do them
manually. Having an engineering background meant I could do this with speed,
whereas a lot of the students from other backgrounds struggled with basic
software issues: they were really good at mathematics but implementing their
ideas would take a long time”.
And Dylan added:
“Good software
engineering practices are so valuable when you want to create a robust
implementation of a machine learning algorithm in a production environment.
It’s all sorts of things – like maintainable code, a shared code base so
multiple people can work on it, things like logging, being able to debug
problems in production, scalability – to know that once things ramp up, you’ve
architected it in such a way so that you can parallize it, or add more CPU, if
needed. So if you’re looking for the type of roles where you need to get these
things into a platform, as opposed to doing exploratory research or answering
ad-hoc business questions, software engineering is so valuable”.
I think that says it
all, but to summarise: if you are a software engineer with a good disposition
for mathematics, you are in a great position to become a (Type B) data
scientist, providing you are prepared to put in the work to learn
statistics/machine learning,
Mathematics
To make an obvious
statement: mathematics underpins all areas of data science. Therefore, it seems
reasonable to assume that many mathematicians are now plying their trade as
data scientists. However, there are relatively few coming directly from
mathematics, and this peculiarity peaked my interest. One explanation is that
there are fewer graduates from mathematics (both pure and applied) compared to
the other relevant fields of study, but this fails to tell the whole story.
To dig deeper, I turned
to Boris Savkovic, Lead Data Scientist at BuildingIQ (a start-up that uses
advanced algorithms to optimise energy use in commercial buildings). Boris has
a background in Electrical Engineering and Applied Mathematics and having worked
with many mathematicians in his time, he provided the following insights:
“Many mathematicians
have a love of theoretical problems, beautiful equations and seeing deep
meaning in theorems, whereas commercial data science is empirical, messy and
dirty. While some mathematicians love this, many hate it. The real world is
complex, you cannot sandbox everything, you have to prioritise, appreciate the
incentives of others, compromise the math and technology for short-term vs.
medium-term vs. long-term, worry about diminishing returns (80/20 rule) and
deal with both deep theory and deep practice, and everything in-between. In
short: you have to be flexible and adaptable to deal with the real world. And
this is ultimately what commercial data science is about: finding faster and
better practical solutions that make money. For those with heavy
mathematical/theory backgrounds who want to understand everything to the last
degree, this can be very difficult, and I have seen a number of mathematics
PhDs struggle badly when transitioning from research/academia to commercial
data science”.
It is important to note
that Boris was referring more to pure mathematicians, and he added that he has
also worked with many excellent applied mathematicians in his career. This
seems logical because pure mathematics is likely to attract those with a love
for the theory, as opposed to real world problems. And theoretical work won’t
involve much interaction with data, which is – you know – quite important for
data science.
There are exceptions of
course and it ultimately comes down to individual character, not purely what
someone has studied. And clearly: a lot of what mathematics graduates learn is
highly transferable, so picking up the specific statistical/machine learning
techniques shouldn’t be too difficult (if not already known).
In terms of
suitability, most mathematicians are probably best equipped to learn the tools
and theory for Type A data science. However, there are mathematicians who study
computer science (theoretical computer science is essentially a branch of
mathematics) and so people with background may be more suited to Type B data
science.
There is an important
lesson to take from all this, and it comes down to understanding the reality of
what commercial data science involves. If you truly understand the challenges
and that is what you are seeking, then go for it. But if you have a love for
the theory more than the practical application, you might want to reassess your
thinking.
The Blank Canvas
If you are just
starting out, perhaps you are in school, you enjoy maths, science and
computing, and you like the sound of this thing called data science, well good
news: you can choose your path without being constrained by a pre-existing
background. And there are now a number of specific data science related
courses, which cover both computer science and mathematics/statistics. Just be
prepared for the long haul; you will not become a data scientist over night, as
we will see in Part Two, where we will be examining: how to learn.
- See more at:
https://www.experfy.com/blog/how-to-become-a-data-scientist-part-1-3#sthash.6NMFyfTV.dpuf
Ingen kommentarer:
Legg inn en kommentar