Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.
In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.
Continual Learning
Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) is evolving so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).
Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Conferences and networking with the data science community are effective ways of keeping on top of the latest developments. And regularly reading books and papers is very important (on this: if you do not have a research background, it is worth learning how to read academic papers properly).
Play. Build. Experiment.
Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the points I wanted to make, so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.
This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:
“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”
The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.
Kaggle / Open-Source / Freelancing
If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:
“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”
There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.
Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.
Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilize the Experfy platform for that purpose.
To PhD or not to PhD
Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:
“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”
And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.
But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:
“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”
Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.
There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient evidence to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.
Other University Degrees
So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason – there is no need to panic – because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember this from Chapter Two).
Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience.
Online Courses
In a similar sense – even if you come from another quantitative field – a few online courses will not make you an expert, and remember: this is an expert field. But even if an online course was enough to master a chosen subject, you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this in Chapter Four.
All this being said, online courses are incredibly useful tools to help kick-start your journey, or begin learning a new area (like deep learning, for example). The most popular courses are found via Coursera, Udacity and edX, with Dylan Hogg describing Andrew Ng’s Machine Learning on Coursera as “an absolute pre-requisite for anyone who does not have a research background”.
The following is by no means a complete list, or mandatory for that matter, but these also stood out to Dylan and some of the other data scientists we have met so far:
- Machine Learning: Intro to Machine Learning (Udacity)
- Deep Learning: Deep Learning (Udacity), Neural Networks for Machine Learning (Coursera)
- Spark: Big Data Analysis with Spark (edX), Distributed Machine Learning with Spark (edX)
During my interactions with the Experfy team, I've found out that they were also launching a training platform. You can see a preview here.
Books
Needless to say: good books are an invaluable resource and our favourite data scientists advocated the following:
- Pattern Recognition and Machine Learning by Christopher Bishop
- Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
- Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
- An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”: Elements of Statistical Learning
- And for a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting
Presenting / Communicating
If you do not have a natural disposition for communicating – especially with non-technical people – this is something you will need to work on (see Chapter One for why). Gaining practice and obtaining feedback is the best way to improve your soft skills, although Yanir also recommended the classic book by Dale Carnegie: How to Win Friends and Influence People.
Ingen kommentarer:
Legg inn en kommentar