torsdag 2. juni 2016

How to Become a Data Scientist (Part 1/3)


Different Types of Data Science

So you have made the decision to become a data scientist. Great, you are on your way. But now you have another choice, which is: what kind of data scientist do you want to become? Because – it is important to acknowledge – while data science as a profession has been recognised for a number of years now, there still isn’t a commonly accepted definition of what it actually is.
In reality, the term ‘data scientist’ is regarded as a broad job title and so it comes in many forms, with the specific demands dependent on the industry, the business, and the purpose/output of the role in question. As a result, certain skillsets suit certain positions better than others, and this is why the path to data science is not uniform and can be via a diverse range of fields such as statistics, computer science and other scientific disciplines.
The purpose is the biggest factor that dictates what form data science takes, and this is related to the Type A-Type B classification that has emerged (see here: What is Data Science?). Broadly speaking, the categorizsation can be summarized as:
  • Data science for people (Type A), i.e. analytics to support evidence-based decision making
  • Data science for software (Type B), for example: recommender systems as we see in Netflix and Spotify
We may see further evolution of these definitions as the field matures, but for now, we will continue this exploration with a look at the ‘science’ in data science.
Owning Up To The Title
All scientists work with data, so in a sense all scientists are data scientists. But to take what is generally considered to be data science in industry, what makes it a science? What a good question! The answer should be: ‘the scientific method’. Given the multi-disciplinary nature of science, the scientific method is the one thing that binds the fields together. If you got this right, full marks to you.
However, job titles tend to be applied very loosely in industry and so not all data scientists are true scientists. Ask yourself though: can you justify calling yourself a scientist if your role does not involve actual science? Personally, I do not see what is wrong with alternatives like ‘analyst’, or whatever best fits the position in question. But maybe this is just me, and perhaps I would be better off calling myself a recruitment scientist.
For an excellent discussion on this, I thoroughly recommend this post by Sean McClure: Data Scientist: Owning Up To The Title (yes, I admit it – I plagiarised the title).
With that out the way, we will now delve further into data science by considering what areas of expertise you will need to master (if you haven’t already).

1.     Problem Solving

If this is not top of your list, amend that list. Immediately. At the core of all scientific disciplines is problem solving: a great data scientist is a great problem solver; it is as simple as that. Need further proof? How about every single person I met for this project, irrespective of background or current working situation, mentioned this as THE most important factor in data science.
Clearly, you need to possess the tools to solve the problems, but they are just that: tools. In this sense, even the statistical/machine learning techniques can be thought of as the tools by which you solve problems. New techniques arise, technology evolves; the one constant is problem solving.
To an extent, your ability as a problem solver is dictated by your nature, but at the same time, there is only one-way to improve: experience, experience, experience. We will re-visit this in Chapter Three, so at this point, just remember this important lesson: you can only master something through doing.
Before we move on, I would like to direct you to another great post from Sean McClure: The Only Skill You Should Be Concerned With (just to be clear, I am not receiving any payment for these pointers, but I am totally open to it. Sean – if you are reading this, you can send me money anytime).

2.     Statistics / Machine Learning

Ok, having read the above, it might seem like I have trivialized statistics and machine learning. But we are not talking about a power tool here; these are complex – and to an extent – esoteric fields, and if you do not possess expert knowledge, you will not be solving data science problems any time soon.
To provide some much-needed clarification on these terms, machine learning can be viewed as a multi-disciplinary field that grew out ofboth artificial intelligence/computer science and statistics. It is often seen as a subfield of AI, and while this is true, it is important to recognise that there is no machine learning without statistics (ML is heavily dependent on statistical algorithms in order to work). For a long time statisticians were unconvinced by machine learning, with collaboration between the two fields being a relatively recent development (see statistical learning theory), and it is interesting to note that high dimensional statistical learning only happened when statisticians embraced ML results (thanks to Bhavani Rascutti, Advanced Analytics Domain Lead at Teradata for this input).
For the technical readers who are interested in a more detailed account, check out this classic paper published in 2001 by Leo Breiman:Statistical Modelling: The Two Cultures.

3.     Computing

a.     Programming

We only need to briefly touch on programming because it should be obvious: this is an absolute must. How can you apply the theory if you cannot code a unique algorithm or build a statistical model?

b.     Distributed Computing

Not all businesses have massive datasets but considering the modern world, it is advisable to develop the ability to work with BIG DATA (!). In short: the main memory of a single computer is not going to cut it, and if you want to simultaneously train models across hundreds of virtual machines, you need to get to grips with distributed computation and parallel algorithms.
Why the exclamations mark? Personally, I find the misnomer that is “big data” farcical. The term is continually confused and often used as an umbrella term for all analytics. Furthermore, massive data volumes (and the technologies to store and manage these quantities) are not new like they once were, so it is only a matter of time before it expires from our lexicon. For an expanded discussion on this, there is yet another sensible post from Sean McClure: Data Science and Big Data: Two Very Different Beasts (this is getting ridiculous now – I swear I have never even talked to the guy).

c.     Software Engineering

For Type A data science, let me be clear: engineering is a separate discipline. So if this is the type of data scientist you want to become, you do not need to be an engineer. However, if you want to put machine learning algorithms into production (i.e. Type B), you will need a strong foundation in software engineering.

4.     Data Wrangling

Data cleaning/preparation is a crucial and intrinsic part of data science. And this will take up the majority of your time. If you fail to remove the noise from your dataset (e.g. wrong/missing values, non-standardised categories, etc.), then the accuracy of the model will be affected and will ultimately lead to incorrect conclusions. Therefore, if you are not prepared to spend the time and attention on this step, it renders your technical know-how irrelevant.
It is also important to note that data quality is a persistent issue in commercial organisations and many businesses have complicated infrastructures when it comes to data storage. So if you are not preapred for this environment and you want to work with nice clean datasets, unfortunately commercial data science is not for you.

5.     Tools and Technology

As you should have realized by now, developing your ability as a problem solving data scientist should take precedence over everything else: technologies constantly change and can ultimately be learnt in a relatively short timeframe. But we shouldn’t ignore them altogether, so it is useful to be aware of the most widespread tools in use today.
Starting with programming languages, R and Python are the most common; so if you have a choice, perhaps use one of these when you are experimenting.
Particularly in Type A data science, having the ability to visualize data in intuitive dashboards is very powerful for communicating with non-technical business stakeholders. You might have the best model and the best insights, but if you cannot present/explain the findings effectively, what use is it? It really doesn’t matter what tool you use for visualization – it could be R, or Tableau (which seems to be the most prevalent at the moment), but honestly – the tool is unimportant.
Finally, a lot of businesses put emphasis into SQL ability. SQL is the most common language used to interact with databases in industry, whether we are talking about relational databases or derivatives of SQL used with big data technologies. And it is the bread and butter of data wrangling – at least when working at larger scales (i.e. not in memory). As a result, it is worth investing some of your time to pick this up.  

6.     Communication / Business Acumen

This should not be understated. Unless you are going into something very specific, perhaps pure research (although let’s face it, there aren’t many of these positions around in industry), the vast majority of data science positions involve business interaction, often with individuals who are not analytically literate.
Having the ability to conceptualize business problems and the environment in which they occur is critical. And translating statistical insights into recommended actions and implications to a lay audience is absolutely crucial, particularly for Type A data science. I was chatting to Yanir Seroussi who is Head of Data Science at Car Next Door (a start-up enabling car sharing), and this is how he put it:
“I find it weird how some technical people don't pay attention to how non-technical people's eyes glaze over when they start using jargon. It's really important to put yourself in the listener's/reader's shoes”.
As a quick aside, if you have some time, check out Yanir’s website; he is a regular and eloquent writer on a variety of topics around data science. 

Rock Stars

It probably isn’t clear: I have used this title ironically. No – data scientists are not rock stars, ninjas, unicorns or any other mythical creature. If you are planning on referring to yourself like this, perhaps take a long look in the mirror. Anyway, I digress. The point I want to make here is this: there are some data scientists who possess expert level ability in all of the above, and perhaps more. They are rare and extremely valuable. If you have the natural ability and desire to become one of these, then great – you are going to be hot property. But if not, remember: you can specialize in certain areas of data science, and quite often, good teams are comprised of data scientists with different specialities. Deciding what to focus on goes back to your interests and capability, and this leads us nicely to the next chapter in our journey.


Now we are making progress! Having successfully digested the information in Chapter One, you are nearly ready to begin formulating your personal goals and objectives. But first – some introspection is required – so grab a coffee, sit yourself down in a quiet place, and have a deep think about:
  1. Why do you want to be a data scientist?
  2. What type of data science interests you?
  3. What natural capabilities or relevant skills do you already possess?
Why is this important? Simply put: data science is an expert field, so unless you have already mastered a lot of what we covered in Chapter One, it is not an easy (or quick) journey. There is an important message here, which addresses questions one and two: you need to have the right reasonsfor going down this path, otherwise – chances are – you will give up when the going gets tough (and it will).
To elaborate on this message, enter Dylan Hogg (remember these names, as we will be returning to them). Dylan was previously a software engineer and is now Head of Data Science at The Search Party, a start-up that has built a platform that utilizes machine learning (NLP) to link employers with relevant candidates (the future of recruitment!). Considering he has made the transition from software engineering to data science (a journey he is still on), we discussed what it takes, and he said:
“Regardless of education or experience, there’s something more fundamental, which is your nature of curiosity, determination and tenacity. There are so many times when you hit a problem: perhaps the algorithm isn’t performing in the way it needs to, or perhaps the technology is being a pain. Either way, you can study machine learning algorithms or software engineering best practice, but if you’re not really determined, you are going to give up and not get through it”.
There you go: you won’t just face problems when you are learning; you will face them continually in your working life, so you better make sure you are motivated for the right reasons, and not just because you think having ‘scientist’ in your title is cool.
But what about question three? Why do your relevant skills matter? Well, where you are starting from affects what type of data science you are most suited to, and what you need to learn for the area that interests you. And so we will now explore the typical paths to data science, starting with the wider scientific field.
Note: There are many quantitative disciplines where you will find people with the ability to transition into data science. I won’t cover them all here, but the point is this: if you take the time to really understand the different nuances of data science, you should be able to figure out how relevant your current skillset is, whatever your background.

Other Scientific Disciplines

This is not the most common route to data science; statistics and computer science are, as we will see. But with scientists from many fields having highly relevant skillsets (especially in the world of physics), many have made this jump.
For an explanation on why, allow me to introduce Will Hanninger, a Data Scientist with Commonwealth Bank of Australia. In a previous life, Will was a particle physicist with CERN where he worked on the discovery of the Higgs boson (very cool), and this is what he had to say:
“In physics, you naturally learn a lot of what you need in data science: programming, manipulating data, getting the raw data and transforming it in a useful way. You learn statistics, which is important. And crucially: you learn how to solve problems. These are the basic skills needed for a data scientist”.
So the skillset is highly transferable, with the main box ticked: problem solving. The differences tend to arise in the tools and techniques; for example, while machine learning is synonymous with data science, it is less common in wider science. In any case, we are talking about very smart people here; they have the ability to learn tools and techniques in a short timeframe.
I also met Sean Farrell for this project; Sean’s background is in astrophysics and he moved into commercial data science with Teradata Australia, where he wrote an excellent blog post on this topic: Why Science’s Loss is a Gain for Data Science. The following passage is particularly pertinent:
“Until recently there haven’t been any formal training pathways to become a Data Scientist. Most Data Scientists come from backgrounds in statistics or computer science. However, while these other career paths develop some of the skills listed above, they typically don’t cover all of them. Statisticians are very strong on the maths and stats side, but generally have weaker programming skills. Computer scientists are very strong in the programming arena, but typically don’t have as strong a comprehension of statistics. Both have good (yet different) data analysis skill sets but can struggle with creative problem solving, which is arguably the hardest skill to teach”.
To avoid misunderstanding, remember the context here. Sean isn’t saying that all data scientists from statistics or computer science lack creative problem solving; the argument he is making is that science filters extremely effectively for problem solving, arguably more so than statistics/computer science.
With science covered, it is statistics turn to be cross-examined. In recent times, many statistical positions have been re-branded as data science (of the Type A variety), so in a sense, we are getting into semantics. But as before, I hold the opinion that the scientific method should be applied for it to be deemed a science: does it involve setting hypotheses, designing robust experiments, etc.? If not, perhaps a title like ‘statistician’ or ‘modelling analyst’ is a better fit.
That aside, if you are a statistician/analyst in industry or just coming out of higher-level education in statistics, there is a chance you already possess everything you need to obtain a role as a data scientist. It depends on a few factors:
  • Firstly, do you have experience in machine learning techniques? As we saw in Chapter One, statistical modelling and machine learning are related, but the latter possesses significant advantages when applied to massive datasets. And with the adoption of machine learning continuing to rise in all areas of industry, it really is synonymous with all types of data science
  • Secondly, at the risk of repeating myself, what area of data science interests you? Clearly a statistics background is better suited to Type A positions, so if your goal is Type B work, you will have some learning to do
  • Finally, do you have practical experience working with data? Data wrangling is often a comparative weakness of those coming from statistics, and as we learnt in Chapter One, it is a crucial component of commercial data science

Computer Science / Software Engineering

If you have studied artificial intelligence/computer science to a high level, then it is likely you are already in a good position for Type B data science. But there is the other well-trodden path to consider: the experienced software engineer who wants to move into data science.
A software engineer might, or might not have experience in machine learning – it depends. But either way, this background is clearly more suited to Type B data science, which requires a solid grounding in software engineering principles. I discussed this with James Petterson who is a Senior Data Scientist at Commonwealth Bank of Australia (and previously a software engineer), and here is what he said on the matter:
“A lot of data science work is software engineering. Not always in the sense of designing robust systems, but simply writing software. A lot of tasks you can automate and if you want to run experiments, you have to write code, and if you can do it fast, it makes a huge difference. When I did my PhD, I had to run tens of thousands of experiments every day, and at this scale, it wasn’t possible to do them manually. Having an engineering background meant I could do this with speed, whereas a lot of the students from other backgrounds struggled with basic software issues: they were really good at mathematics but implementing their ideas would take a long time”.
And Dylan added:
“Good software engineering practices are so valuable when you want to create a robust implementation of a machine learning algorithm in a production environment. It’s all sorts of things – like maintainable code, a shared code base so multiple people can work on it, things like logging, being able to debug problems in production, scalability – to know that once things ramp up, you’ve architected it in such a way so that you can parallize it, or add more CPU, if needed. So if you’re looking for the type of roles where you need to get these things into a platform, as opposed to doing exploratory research or answering ad-hoc business questions, software engineering is so valuable”.
I think that says it all, but to summarise: if you are a software engineer with a good disposition for mathematics, you are in a great position to become a (Type B) data scientist, providing you are prepared to put in the work to learn statistics/machine learning,


To make an obvious statement: mathematics underpins all areas of data science. Therefore, it seems reasonable to assume that many mathematicians are now plying their trade as data scientists. However, there are relatively few coming directly from mathematics, and this peculiarity peaked my interest. One explanation is that there are fewer graduates from mathematics (both pure and applied) compared to the other relevant fields of study, but this fails to tell the whole story.
To dig deeper, I turned to Boris Savkovic, Lead Data Scientist at BuildingIQ (a start-up that uses advanced algorithms to optimise energy use in commercial buildings). Boris has a background in Electrical Engineering and Applied Mathematics and having worked with many mathematicians in his time, he provided the following insights:
“Many mathematicians have a love of theoretical problems, beautiful equations and seeing deep meaning in theorems, whereas commercial data science is empirical, messy and dirty. While some mathematicians love this, many hate it. The real world is complex, you cannot sandbox everything, you have to prioritise, appreciate the incentives of others, compromise the math and technology for short-term vs. medium-term vs. long-term, worry about diminishing returns (80/20 rule) and deal with both deep theory and deep practice, and everything in-between. In short: you have to be flexible and adaptable to deal with the real world. And this is ultimately what commercial data science is about: finding faster and better practical solutions that make money. For those with heavy mathematical/theory backgrounds who want to understand everything to the last degree, this can be very difficult, and I have seen a number of mathematics PhDs struggle badly when transitioning from research/academia to commercial data science”.
It is important to note that Boris was referring more to pure mathematicians, and he added that he has also worked with many excellent applied mathematicians in his career. This seems logical because pure mathematics is likely to attract those with a love for the theory, as opposed to real world problems. And theoretical work won’t involve much interaction with data, which is – you know – quite important for data science.
There are exceptions of course and it ultimately comes down to individual character, not purely what someone has studied. And clearly: a lot of what mathematics graduates learn is highly transferable, so picking up the specific statistical/machine learning techniques shouldn’t be too difficult (if not already known).
In terms of suitability, most mathematicians are probably best equipped to learn the tools and theory for Type A data science. However, there are mathematicians who study computer science (theoretical computer science is essentially a branch of mathematics) and so people with background may be more suited to Type B data science.
There is an important lesson to take from all this, and it comes down to understanding the reality of what commercial data science involves. If you truly understand the challenges and that is what you are seeking, then go for it. But if you have a love for the theory more than the practical application, you might want to reassess your thinking.

The Blank Canvas

If you are just starting out, perhaps you are in school, you enjoy maths, science and computing, and you like the sound of this thing called data science, well good news: you can choose your path without being constrained by a pre-existing background. And there are now a number of specific data science related courses, which cover both computer science and mathematics/statistics. Just be prepared for the long haul; you will not become a data scientist over night, as we will see in Part Two, where we will be examining: how to learn.
- See more at:

Ingen kommentarer:

Legg inn en kommentar