Advertisement

søndag 7. august 2016

The 7 Steps of a Data Project

steps

Well, building your first data project is actually not that hard. And yes, Dataiku DSS helps, but what will really helps you is understanding the data science process. Becoming data driven is about this: knowing the basic steps and following them to go from raw data to building a machine learning model.
The steps to complete a data project have been conceptualized a while ago as the KDD process (forKnowledge Discovery in Databases), and made popular with lots of vintage looking graphs like this one.
kdd
This is our take on the steps of a data project in this awesome age of big data!

STEP 1: UNDERSTAND THE BUSINESS

Business goal in data project
Understanding the business is the key to assuring the success of your data project. To motivate the different actors necessary to getting your project from design to production, your project must be the answer to a clear business need. So before you even think about the data, go out and talk to the people who could need to make their processes or their business better with data. Then sit down and define a timeline and concrete indicators to measure. I know, processes and politics seem boring, but in the end, they turn out to be quite useful!

If you’re working on a personal project, playing around with a dataset or an API, this may seem irrelevant. It’s not. Just downloading a cool open data set is not enough. I can’t tell you how many cool datasets I downloaded and never did anything with… So settle on a question to answer, or a product to build!

STEP 2: GET YOUR DATA

Once you’ve gotten your goal figured out, it’s time to start looking for your data. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible.

Here are a few ways to get yourself some data:
  • Connect to a database: ask your data and IT teams for the data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting.
  • Use APIs: think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. If you’re not an expert coder, plugins in DSS give you lots of possibilities to bring in external data!
  • Look for open data: the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. A lot of countries have open data platforms (like data gov in the US). If you’re working on a fun project outside of work, these open data sets are also an incredible resource! Check out kaggle, or this github with lots of datasets for example
  • Use more APIs: another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends. It’s extremely easy to set up these connections with tools like ifttt. For example, I have a bunch of recipes that collect the music I listen to, the places I visit, my steps and the kilometers I run, the contacts I add, etc. And this can be useful for businesses as well! You can analyze very interesting trends on twitter, or even monitor the competition.

STEP 3: EXPLORE AND CLEAN YOUR DATA

(AKA the dreaded preprocessing step that typically takes up 80% of the time dedicated to a data project)
Once you’ve gotten your data, it’s time to get to work on it! Start digging to see what you’ve got and how you can link everything together to answer your original goal. Start taking notes on your first analyses, and ask questions to business people, or the IT guys, to understand what all your variables mean! Because not everyone will get that c06xx is a product category referring to something awesome.

Once you understand your data, it’s time to clean it! You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.
Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project. So it’s going to suck a little bit. Luckily, tools like Dataiku DSS can make this much faster!

STEP 4: ENRICH YOUR DATASET

enriching in data project
Now that you’ve got clean data, it’s time to manipulate it to get the most value out of it. This is the time to join all your different sources, and group logs, to get your data down to the essential features.

You’ll then start manipulating the data to extract lots of valuable features. For example, getting a country and even a town out of a visitor’s IP address. Extracting time of day, or week of year from your dates to get something more meaningful.
The possibilities are pretty much endless, and you’ll get a pretty good idea by scrolling through Dataiku DSS’s processors in the Lab of the operations you can execute.

STEP 5: BUILD VISUALISATIONS

building insights and graphs in data project
You now have a nice dataset (or maybe several), so this is a good time to start exploring it by building graphs. When you’re dealing with large volumes of data, they’re the best way to explore and communicate your findings.

You’ll find lots of tools available that make this step fun to prepare and to receive. The tricky part is always to be able to dig into your graphs to answer any question somebody would have about an insight. That’s when the data preparation comes in handy: you’re the guy who did the dirty work so you know the data like the palm of your hand!
If this is the final step of your project, it’s important to use APIs and plugins so you can push those insights to where your end users want to have them. So get integrated with their tools!
Your graphs don’t have to be the end of your project though. They’re a way to uncover more trends that you want to explain. They’re also a way to develop more interesting features. For example, by putting your data points on a map you could perhaps notice that specific geographic zones are more telling than specific countries or cities.

STEP 6: GET PREDICTIVE

building insights and graphs in data project

By working with clustering algorithms (aka unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results. Tools like Dataiku DSS help beginners run basic open source algorithms easily in clickable interfaces.
More advanced data scientists can then go even further and predict future trends with supervised algorithms. By analyzing past data, they find features that have impacted past trends, and use them to build predictions. More than just gaining knowledge, this final step can lead to building whole new products and processes. To get these in production though, you’ll need the intervention of data scientists and engineers, but it’s important to understand the process so all the parties involved (business users and analysts as well), will be able to understand what comes out in the end.

STEP 7: ITERATE

building insights and graphs in data project
The main goal in any business project is to prove it’s effectiveness as fast as possible to justify, well, your job. Data projects are the same. By gaining time on data cleaning and enriching, you can go to the end of the project fast and get your first results. These first insights will be a great start to uncover more necessary cleaning, to develop more features in order to continuously improve results and model outputs.

data_science_project_process
Now that you’ve got the skills, get started right now by building projects in Dataiku DSS!

lørdag 6. august 2016

Training in Critical Thinking & Descriptive Intelligence Analysis

Level 1 Intelligence Analyst Certification

About This Course

Course Description

The views expressed in this course are the instructor's alone and do not reflect the official position of the U.S. Government, the Intelligence Community, or the Department of Defense.

Although anyone can claim the title of “intelligence analyst,” there are currently few commonly understood, standardized certifications available to confirm analytic skill and proficiency. Some may argue that each analytic assessment should be judged on its content and not on the certification or reputation of the author. However, an analytic product can often read well even though its analytic underpinnings are flawed. Also, it would be beneficial to have some objective measure of an analyst’s skill before selecting him for a task, rather than to discover afterwards that the analyst was unable to meet the task. Having addressed why certifications are needed and assuming certifications would provide a worthwhile benefit, the discussion then turns to how and in what areas should one attain certification. Through an analysis of the concept of analysis, the author proposes that three basic divisions should be created to train and certify one as either a descriptive, explanative, or predictive analyst. This course provides level 1 certification as a descriptive intelligence analyst.


What are the requirements?
No prior preparation is necessary; however, a strong academic background, understanding of the scientific method, and an open mind will help the student perform well in this course.

What am I going to get from this course?

Apply critical thinking skills throughout the analytic process
Identify and mitigate biases to reveal unstated assumptions
Refine and clarify intelligence questions
Conduct research to identify existing data and gather new evidence
Select and apply appropriate analytic techniques
Reevaluate and revalidate previous analytic conclusions.Full details

What is the target audience?

This course is intended for the new intelligence analyst who has little to no prior experience. This course will provide the basic analytic skills necessary to produce basic, logically sound, descriptive intelligence analysis.
More experienced intelligence analysts will also find this course provides great "back to basics" refresher training.

Full details

Curriculum

Section 1: Introduction
Lecture 1
Welcome and overview 05:02
Lecture 2
The need for intelligence analyst certifications Article
Lecture 3
Course administration and supplemental material 02:45
Section 2: Critical Thinking and Avoiding Bias
Lecture 4
Thinking about thinking I; critical thinking 16:31
Quiz 1
Critical thinking quiz 5 questions
Lecture 5
Thinking about thinking II; logical, probable, and plausible reasoning 06:42
Lecture 6
Analytic pitfalls 14:19
Lecture 7
Insights into problem solving 15:34
Quiz 2
Section review quiz 5 questions
Section 3: Getting the Question Right
Lecture 8
Problem restatement 06:23
Quiz 3
Section review quiz 5 questions
Section 4: Intelligence Research and Collection
Lecture 9
Gathering the evidence 08:42
Lecture 10
Evaluating the evidence 08:40
Quiz 4
Section review quiz 5 questions
Section 5: Intelligence Analysis
Lecture 11
Selecting the right technique 03:03
Lecture 12
Realizing the power of analytics: arming the human mind Article
Lecture 13
Sorting, chronologies, and timelines 05:43
Lecture 14
The matrix 06:25
Lecture 15
Decision/event trees 07:57
Lecture 16
Link analysis 03:23
Lecture 17
Analysis of competing hypothesis (ACH) 16:27
Section 6: Conclusion
Lecture 18
Argument evaluation and reevaluation 10:30
Quiz 5
Final certification exam 25 questions
Lecture 19
Bonus Lecture Article

tirsdag 26. juli 2016

Approaching (Almost) Any Machine Learning Problem

An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the pre-processing steps. The pipelines discussed in this post come as a result of over a hundred machine learning competitions that I’ve taken part in. It must be noted that the discussion here is very general but very useful and there can also be very complicated methods which exist and are practiced by professionals.

We will be using python!


Data

Before applying the machine learning models, the data must be converted to a tabular form. This whole process is the most time consuming and difficult process and is depicted in the figure below.





The machine learning models are then applied to the tabular data. Tabular data is most common way of representing data in machine learning or data mining. We have a data table, rows with different samples of the data or X and labels, y. The labels can be single column or multi-column, depending on the type of problem. We will denote data by X and labels by y.

Types of labels

The labels define the problem and can be of different types, such as:
Single column, binary values (classification problem, one sample belongs to one class only and there are only two classes)
Single column, real values (regression problem, prediction of only one value)
Multiple column, binary values (classification problem, one sample belongs to one class, but there are more than two classes)
Multiple column, real values (regression problem, prediction of multiple values)
And multi label (classification problem, one sample can belong to several classes)
Evaluation Metrics

For any kind of machine learning problem, we must know how we are going to evaluate our results, or what the evaluation metric or objective is. For example in case of a skewed binary classification problem we generally choose area under the receiver operating characteristic curve (ROC AUC or simply AUC). In case of multi-label or multi-class classification problems, we generally choose categorical cross-entropy or multiclass log loss and mean squared error in case of regression problems.

I won’t go into details of the different evaluation metrics as we can have many different types, depending on the problem.
The Libraries

To start with the machine learning libraries, install the basic and most important ones first, for example, numpy and scipy.
To see and do operations on data: pandas (http://pandas.pydata.org/)
For all kinds of machine learning models: scikit-learn (http://scikit-learn.org/stable/)
The best gradient boosting library: xgboost (https://github.com/dmlc/xgboost)
For neural networks: keras (http://keras.io/)
For plotting data: matplotlib (http://matplotlib.org/)
To monitor progress: tqdm (https://pypi.python.org/pypi/tqdm)

I don’t use Anaconda (https://www.continuum.io/downloads). It’s easy and does everything for you, but I want more freedom. The choice is yours.
The Machine Learning Framework

In 2015, I came up with a framework for automatic machine learning which is still under development and will be released soon. For this post, the same framework will be the basis. The framework is shown in the figure below:


A FRAMEWORK FOR MACHINE LEARNING COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015.

In the framework shown above, the pink lines represent the most common paths followed. After we have extracted and reduced the data to a tabular format, we can go ahead with building machine learning models.

The very first step is identification of the problem. This can be done by looking at the labels. One must know if the problem is a binary classification, a multi-class or multi-label classification or a regression problem. After we have identified the problem, we split the data into two different parts, a training set and a validation set as depicted in the figure below.



The splitting of data into training and validation sets “must” be done according to labels. In case of any kind of classification problem, use stratified splitting. In python, you can do this using scikit-learn very easily.



In case of regression task, a simple K-Fold splitting should suffice. There are, however, some complex methods which tend to keep the distribution of labels same for both training and validation set and this is left as an exercise for the reader.



I have chosen the eval_size or the size of the validation set as 10% of the full data in the examples above, but one can choose this value according to the size of the data they have.

After the splitting of the data is done, leave this data out and don’t touch it. Any operations that are applied on training set must be saved and then applied to the validation set. Validation set, in any case, should not be joined with the training set. Doing so will result in very good evaluation scores and make the user happy but instead he/she will be building a useless model with very high overfitting.

Next step is identification of different variables in the data. There are usually three types of variables we deal with. Namely, numerical variables, categorical variables and variables with text inside them. Let’s take example of the popular Titanic dataset (https://www.kaggle.com/c/titanic/data).



Here, survival is the label. We have already separated labels from the training data in the previous step. Then, we have pclass, sex, embarked. These variables have different levels and thus they are categorical variables. Variables like age, sibsp, parch, etc are numerical variables. Name is a variable with text data but I don’t think it’s a useful variable to predict survival.

Separate out the numerical variables first. These variables don’t need any kind of processing and thus we can start applying normalization and machine learning models to these variables.

There are two ways in which we can handle categorical data:
Convert the categorical data to labels


Convert the labels to binary variables (one-hot encoding)



Please remember to convert categories to numbers first using LabelEncoder before applying OneHotEncoder on it.

Since, the Titanic data doesn’t have good example of text variables, let’s formulate a general rule on handling text variables. We can combine all the text variables into one and then use some algorithms which work on text data and convert it to numbers.

The text variables can be joined as follows:



We can then use CountVectorizer or TfidfVectorizer on it:



or,



The TfidfVectorizer performs better than the counts most of the time and I have seen that the following parameters for TfidfVectorizer work almost all the time.



If you are applying these vectorizers only on the training set, make sure to dump it to hard drive so that you can use it later on the validation set.



Next, we come to the stacker module. Stacker module is not a model stacker but a feature stacker. The different features after the processing steps described above can be combined using the stacker module.



You can horizontally stack all the features before putting them through further processing by using numpy hstack or sparse hstack depending on whether you have dense or sparse features.



And can also be achieved by FeatureUnion module in case there are other processing steps such as pca or feature selection (we will visit decomposition and feature selection later in this post).



Once, we have stacked the features together, we can start applying machine learning models. At this stage only models you should go for should be ensemble tree based models. These models include:
RandomForestClassifier
RandomForestRegressor
ExtraTreesClassifier
ExtraTreesRegressor
XGBClassifier
XGBRegressor

We cannot apply linear models to the above features since they are not normalized. To use linear models, one can use Normalizer or StandardScaler from scikit-learn.

These normalization methods work only on dense features and don’t give very good results if applied on sparse features. Yes, one can apply StandardScaler on sparse matrices without using the mean (parameter: with_mean=False).

If the above steps give a “good” model, we can go for optimization of hyperparameters and in case it doesn’t we can go for the following steps and improve our model.

The next steps include decomposition methods:



For the sake of simplicity, we will leave out LDA and QDA transformations. For high dimensional data, generally PCA is used decompose the data. For images start with 10-15 components and increase this number as long as the quality of result improves substantially. For other type of data, we select 50-60 components initially (we tend to avoid PCA as long as we can deal with the numerical data as it is).



For text data, after conversion of text to sparse matrix, go for Singular Value Decomposition (SVD). A variation of SVD called TruncatedSVD can be found in scikit-learn.



The number of SVD components that generally work for TF-IDF or counts are between 120-200. Any number above this might improve the performance but not substantially and comes at the cost of computing power.

After evaluating further performance of the models, we move to scaling of the datasets, so that we can evaluate linear models too. The normalized or scaled features can then be sent to the machine learning models or feature selection modules.



There are multiple ways in which feature selection can be achieved. One of the most common way is greedy feature selection (forward or backward). In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score. One implementation of greedy feature selection with AUC as evaluation metric can be found here:https://github.com/abhishekkrthakur/greedyFeatureSelection. It must be noted that this implementation is not perfect and must be changed/modified according to the requirements.

Other faster methods of feature selection include selecting best features from a model. We can either look at coefficients of a logit model or we can train a random forest to select best features and then use them later with other machine learning models.



Remember to keep low number of estimators and minimal optimization of hyper parameters so that you don’t overfit.

The feature selection can also be achieved using Gradient Boosting Machines. It is good if we use xgboost instead of the implementation of GBM in scikit-learn since xgboost is much faster and more scalable.



We can also do feature selection of sparse datasets using RandomForestClassifier / RandomForestRegressor and xgboost.

Another popular method for feature selection from positive sparse datasets is chi-2 based feature selection and we also have that implemented in scikit-learn.



Here, we use chi2 in conjunction with SelectKBest to select 20 features from the data. This also becomes a hyperparameter we want to optimize to improve the result of our machine learning models.

Don’t forget to dump any kinds of transformers you use at all the steps. You will need them to evaluate performance on the validation set.

Next (or intermediate) major step is model selection + hyperparameter optimization.



We generally use the following algorithms in the process of selecting a machine learning model:
Classification:
Random Forest
GBM
Logistic Regression
Naive Bayes
Support Vector Machines
k-Nearest Neighbors
Regression
Random Forest
GBM
Linear Regression
Ridge
Lasso
SVR


Which parameters should I optimize? How do I choose parameters closest to the best ones? These are a couple of questions people come up with most of the time. One cannot get answers to these questions without experience with different models + parameters on a large number of datasets. Also people who have experience are not willing to share their secrets. Luckily, I have quite a bit of experience too and I’m willing to give away some of the stuff.

Let’s break down the hyperparameters, model wise:



RS* = Cannot say about proper values, go for Random Search in these hyperparameters.

In my opinion, and strictly my opinion, the above models will out-perform any others and we don’t need to evaluate any other models.

Once again, remember to save the transformers:



And apply them on validation set separately:



The above rules and the framework has performed very well in most of the datasets I have dealt with. Of course, it has also failed for very complicated tasks. Nothing is perfect and we keep on improving on what we learn. Just like in machine learning.

Get in touch with me with any doubts: beyonditas [at] gmail [dot] com

fredag 8. juli 2016

Evolution of R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R

R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.
  • A large group of individuals has contributed to R by sending code and bug reports.
  • Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.

Features of R

As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −
  • R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
  • R has an effective data handling and storage facility,
  • R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
  • R provides a large, coherent and integrated collection of tools for data analysis.
  • R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. This tutorial will teach you R programming along with suitable examples in simple and easy steps.

onsdag 29. juni 2016

If social networks were countries, which would they be?

Facebook CEO Mark Zuckerberg speaks on stage during the Facebook F8 conference in San Francisco, California

If Facebook were a country, it would be substantially bigger than China. The size of Facebook's user base translates to around one in seven of the global population using it each month - around 1.65 billion people.
The role of digital technology in breaking down physical borders is one of the many trends in the Fourth Industrial Revolution. As social media continues to open up new opportunities for businesses and societies, how do today's networks compare?
Facebook 
According to Statista, Facebook had over 1.65 billion monthly active users in the first quarter of 2016. The number of monthly active mobile users also passed 1.5 billion in the same quarter. China's population, by comparison, is around 1.37 billion. 

WhatsApp 
While not technically a social network, it's worth including the messaging giant in this list due to the 1 billion-plus people using it each month. Monthly active users isn't the best metric for measuring messaging apps (you either use them daily-ish or not at all) but the MAU figure has grown impressively from 700 million in January 2015 to 1 billion now, putting it within sight of India, which has a population of 1.25 billion. The messaging app also handles over 64 billion messages and 600 million photos each day.

Top 15 countries by population, and the social media giants
Instagram 

The photo- and video-sharing app reported over 400 million monthly active users worldwide in September 2015, just ahead of the US population of 319 million. Nearly all of these are engaging with the service via the mobile app, although there is also a desktop version. The number of Instagram users in the US is predicted to pass 106 million by 2018

Twitter 

The network for those happy to keep their musings to 140 characters or less, Twitter has over 305 million monthly active users, with around 80% living outside the US. The social network upset the apple cart last year somewhat with the introduction of a tailored algorithm to order tweets, moving away from a live feed, which upset some users. Growth has slowed, as well as the company's stock price, but it's still the go-to place for breaking news alerts and a glimpse of the world in real-time. 

Google+ 

Google doesn't particularly like talking about its MAUs, and it's fair to say it isn't the obvious destination when people want to share something about themselves. At last count, the network had over 300 million users, which would make it bigger than Indonesia, and a tad smaller than the USA. 

LinkedIn 

LinkedIn's monthly active user base is growing robustly, with around 100 million people currently using the site each month. Over 400 million have an account, however. The social network generates revenue from 3 areas - hiring solutions, advertising revenue, and premium subscriptions. The 100 million MAUs puts it just behind the Philippines in terms of size. 

Snapchat 

The newest member of the social media giants, it was reported back in January last year that Snapchat had over 100 million monthly active users, which would make it around the same size as Ethiopia. However, data is hard to come by, with some other sources suggesting the figure could be as high as 200 million.

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary:  What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist.  That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.
It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates.  It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?
Yes there is.  Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.
If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees.  We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience.  It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.
The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist.  The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.
Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles.  Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.
So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists.  Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior.  Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?
There are two key points to understand here.  The first is that the data science market has divided into two distinctly different segments, Production and Development.
Production:  This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses.  Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing.  These companies are widely distributed geographically so you can work pretty much anywhere.  The primary data science activities are predictive analytics and recommenders.
Development:  This is the new and sexy world of data science that gets all the press coverage.  In these enterprises the data science and the code are the product.  Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics.  This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.
The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin.  This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent. 
As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science
The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists.  These four types are ultimately differentiated by what they spend their time doing. 
The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013.  You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.
There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.
In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.
1.    Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.
2.    Data Developer.  Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data.  If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.
3.    Data Creatives.  Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations.  This is a more senior role innovating new types of predictive analytic use cases, data products, and services.  This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience.  Data Creatives are heavily present in the Development world.
4.    Data Researchers.  Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD.  These are folks who are innovating data science at its most fundamental level.
According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems.  Here’s their chart.



This is an important decision since you need to do activities within data science that you like.  This may lead you toward an advanced degree or simply to develop you skills through experience.  It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market
If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from.  If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.
If you have a choice, pick a college that specifically offers a Data Science degree.  If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.
Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master.  The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors.  It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.  
Zynga’s in-house program is 12 to 18 months.  To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar.  At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists.  This could easily be the course list for your undergraduate program.  I have added some observations of my own.

Phase I: Foundational Statistical Theory
Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions.  For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others.  Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills
To be an effective data scientist, knowledge of scripting languages is a requirement.  Selecting which ones is a matter of discussion.  My take is this:
SQL:  Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases.  Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).
Python:  The big discussion over the last five or so years has been around R versus Python.  Python is my pick as a production language with a very generous data science library.  More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not.  In the most recent surveys you’ll see Python pulling away from R.
SAS: Yes SAS.  SAS was practically the original DS scripting language before R and Python.  Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily.  Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage.  IBM SPSS is an option but SAS has a huge lead in adoption.  You will rarely encounter SAS in the Development world.

Phase III: Machine Learning
Participants learn both supervised and unsupervised learning techniques.  Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs.  Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.
Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results.  The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.
The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day.  Perhaps you will be the one to suggest they utilize them.  They can really speed up the modeling process.  Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox
It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets.  For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.
It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions.  Until they’re operationalized, they’re of no value.
It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation.  Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?
In the Development world you will increasingly only be selected if you have a specialty.  In the Production world you are likely to have more opportunities if you don’t specialize.  Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.
Supply Chain Forecasting:  There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.
IoT for Manufacturing:  This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.
If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.
For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years. 
Remember also that our profession is changing fast.  It is already well past the time that a single data scientist could master the entire field.  Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses.  Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.
What about the rumors of those outsized salaries even for beginners?  Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions.  You’re not going get a Silicon Valley salary if you’re working in Milwaukee. 
The best salary and skills studies come from O’Reilly.  Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings.  This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors.  Be sure to carefully evaluate who filled out the surveys and whether you think they are representative.  There are no purely objective bias-free surveys in our profession.

As Your Career Progresses
Data science has been and continues to be a field in which knowledge of tools as well as business in paramount.  We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.
It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising.  In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.
In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others).  As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered.  The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.