mandag 15. august 2016

The Data Science Process 1/3

Congratulations! You’ve just been hired for your first job as a data scientist at Hotshot Inc., a startup in San Francisco that is the toast of Silicon Valley. It’s your first day at work. You’re excited to go and crunch some data and wow everyone around you with the insights you discover. But where do you start?
Over the (deliciously catered) lunch, you run into the VP of Sales at Hotshot Inc., introduce yourself and ask her, “What kinds of data challenges do you think I should be working on?”
The VP of Sales thinks carefully. You’re on the edge of your seat, waiting for her answer, the answer that will tell you exactly how you’re going to have this massive impact on the company of your dreams.
And she says, “Can you help us optimize our sales funnel and improve our conversion rates?”
The first thought that comes to your mind is: What? Is that a data science problem? You didn’t even mention the word ‘data’. What do I need to analyze? What does this mean?
Fortunately, your mentor data scientists have warned you already: this initial ambiguity is a regular situation that data scientists in industry encounter. All you have to do is systematically apply the data science process to figure out exactly what you need to do.
The data science process: a quick outline
When a non-technical supervisor asks you to solve a data problem, the description of your task can be quite ambiguous at first. It is up to you, as the data scientist, to translate the task into a concrete problem, figure out how to solve it and present the solution back to all of your stakeholders. We call the steps involved in this workflow the “Data Science Process.” This process involves several important steps:
  • Frame the problem: Who is your client? What exactly is the client asking you to solve? How can you translate their ambiguous request into a concrete, well-defined problem?
  • Collect the raw data needed to solve the problem: Is this data already available? If so, what parts of the data are useful? If not, what more data do you need? What kind of resources (time, money, infrastructure) would it take to collect this data in a usable form?
  • Process the data (data wrangling): Real, raw data is rarely usable out of the box. There are errors in data collection, corrupt records, missing values and many other challenges you will have to manage. You will first need to clean the data to convert it to a form that you can further analyze.
  • Explore the data: Once you have cleaned the data, you have to understand the information contained within at a high level. What kinds of obvious trends or correlations do you see in the data? What are the high-level characteristics and are any of them more significant than others?
  • Perform in-depth analysis (machine learning, statistical models, algorithms): This step is usually the meat of your project,where you apply all the cutting-edge machinery of data analysis to unearth high-value insights and predictions.
  • Communicate results of the analysis: All the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean, in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you will build and use here.
So how can you help the VP of Sales at Hotshot Inc.? In the next few emails, we will walk you through each step in the data science process, showing you how it plays out in practice. Stay tuned!

Ingen kommentarer:

Legg inn en kommentar