Data Science Jedi: dataviz

Viser innlegg med etiketten dataviz. Vis alle innlegg

søndag 7. august 2016

The 7 Steps of a Data Project

By Alivia Smith, User Marketing Manager - Dataiku

It’s hard to know where to start once you’ve decided that yes, you want to become more data-driven. Just looking at all the technologies you have to understand and all the languages you’re supposed to master is enough to make your dizzy.

Well, building your first data project is actually not that hard. And yes, Dataiku DSS helps, but what will really helps you is understanding the data science process. Becoming data driven is about this: knowing the basic steps and following them to go from raw data to building a machine learning model.

The steps to complete a data project have been conceptualized a while ago as the KDD process (forKnowledge Discovery in Databases), and made popular with lots of vintage looking graphs like this one.

This is our take on the steps of a data project in this awesome age of big data!

STEP 1: UNDERSTAND THE BUSINESS

Understanding the business is the key to assuring the success of your data project. To motivate the different actors necessary to getting your project from design to production, your project must be the answer to a clear business need. So before you even think about the data, go out and talk to the people who could need to make their processes or their business better with data. Then sit down and define a timeline and concrete indicators to measure. I know, processes and politics seem boring, but in the end, they turn out to be quite useful!

If you’re working on a personal project, playing around with a dataset or an API, this may seem irrelevant. It’s not. Just downloading a cool open data set is not enough. I can’t tell you how many cool datasets I downloaded and never did anything with… So settle on a question to answer, or a product to build!

STEP 2: GET YOUR DATA

Once you’ve gotten your goal figured out, it’s time to start looking for your data. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible.

Here are a few ways to get yourself some data:

Connect to a database: ask your data and IT teams for the data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting.
Use APIs: think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. If you’re not an expert coder, plugins in DSS give you lots of possibilities to bring in external data!
Look for open data: the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. A lot of countries have open data platforms (like data gov in the US). If you’re working on a fun project outside of work, these open data sets are also an incredible resource! Check out kaggle, or this github with lots of datasets for example
Use more APIs: another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends. It’s extremely easy to set up these connections with tools like ifttt. For example, I have a bunch of recipes that collect the music I listen to, the places I visit, my steps and the kilometers I run, the contacts I add, etc. And this can be useful for businesses as well! You can analyze very interesting trends on twitter, or even monitor the competition.

STEP 3: EXPLORE AND CLEAN YOUR DATA

(AKA the dreaded preprocessing step that typically takes up 80% of the time dedicated to a data project)

Once you’ve gotten your data, it’s time to get to work on it! Start digging to see what you’ve got and how you can link everything together to answer your original goal. Start taking notes on your first analyses, and ask questions to business people, or the IT guys, to understand what all your variables mean! Because not everyone will get that c06xx is a product category referring to something awesome.

Once you understand your data, it’s time to clean it! You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.

Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project. So it’s going to suck a little bit. Luckily, tools like Dataiku DSS can make this much faster!

STEP 4: ENRICH YOUR DATASET

Now that you’ve got clean data, it’s time to manipulate it to get the most value out of it. This is the time to join all your different sources, and group logs, to get your data down to the essential features.

You’ll then start manipulating the data to extract lots of valuable features. For example, getting a country and even a town out of a visitor’s IP address. Extracting time of day, or week of year from your dates to get something more meaningful.

The possibilities are pretty much endless, and you’ll get a pretty good idea by scrolling through Dataiku DSS’s processors in the Lab of the operations you can execute.

STEP 5: BUILD VISUALISATIONS

building insights and graphs in data project

You now have a nice dataset (or maybe several), so this is a good time to start exploring it by building graphs. When you’re dealing with large volumes of data, they’re the best way to explore and communicate your findings.

You’ll find lots of tools available that make this step fun to prepare and to receive. The tricky part is always to be able to dig into your graphs to answer any question somebody would have about an insight. That’s when the data preparation comes in handy: you’re the guy who did the dirty work so you know the data like the palm of your hand!

If this is the final step of your project, it’s important to use APIs and plugins so you can push those insights to where your end users want to have them. So get integrated with their tools!

Your graphs don’t have to be the end of your project though. They’re a way to uncover more trends that you want to explain. They’re also a way to develop more interesting features. For example, by putting your data points on a map you could perhaps notice that specific geographic zones are more telling than specific countries or cities.

STEP 6: GET PREDICTIVE

By working with clustering algorithms (aka unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results. Tools like Dataiku DSS help beginners run basic open source algorithms easily in clickable interfaces.

More advanced data scientists can then go even further and predict future trends with supervised algorithms. By analyzing past data, they find features that have impacted past trends, and use them to build predictions. More than just gaining knowledge, this final step can lead to building whole new products and processes. To get these in production though, you’ll need the intervention of data scientists and engineers, but it’s important to understand the process so all the parties involved (business users and analysts as well), will be able to understand what comes out in the end.

STEP 7: ITERATE

The main goal in any business project is to prove it’s effectiveness as fast as possible to justify, well, your job. Data projects are the same. By gaining time on data cleaning and enriching, you can go to the end of the project fast and get your first results. These first insights will be a great start to uncover more necessary cleaning, to develop more features in order to continuously improve results and model outputs.

Now that you’ve got the skills, get started right now by building projects in Dataiku DSS!

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary: What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist. That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.

It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates. It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?

Yes there is. Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.

If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees. We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience. It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.

The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist. The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.

Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles. Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.

So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists. Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior. Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?

There are two key points to understand here. The first is that the data science market has divided into two distinctly different segments, Production and Development.

Production: This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses. Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing. These companies are widely distributed geographically so you can work pretty much anywhere. The primary data science activities are predictive analytics and recommenders.

Development: This is the new and sexy world of data science that gets all the press coverage. In these enterprises the data science and the code are the product. Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics. This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.

The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin. This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent.

As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science

The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists. These four types are ultimately differentiated by what they spend their time doing.

The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.

There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.

In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

1. Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.

2. Data Developer. Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data. If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.

3. Data Creatives. Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations. This is a more senior role innovating new types of predictive analytic use cases, data products, and services. This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience. Data Creatives are heavily present in the Development world.

4. Data Researchers. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD. These are folks who are innovating data science at its most fundamental level.

According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems. Here’s their chart.

This is an important decision since you need to do activities within data science that you like. This may lead you toward an advanced degree or simply to develop you skills through experience. It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market

If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from. If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.

If you have a choice, pick a college that specifically offers a Data Science degree. If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.

Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master. The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors. It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.

Zynga’s in-house program is 12 to 18 months. To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar. At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists. This could easily be the course list for your undergraduate program. I have added some observations of my own.

Phase I: Foundational Statistical Theory

Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions. For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others. Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages is a requirement. Selecting which ones is a matter of discussion. My take is this:

SQL: Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases. Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).

Python: The big discussion over the last five or so years has been around R versus Python. Python is my pick as a production language with a very generous data science library. More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not. In the most recent surveys you’ll see Python pulling away from R.

SAS: Yes SAS. SAS was practically the original DS scripting language before R and Python. Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily. Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage. IBM SPSS is an option but SAS has a huge lead in adoption. You will rarely encounter SAS in the Development world.

Phase III: Machine Learning

Participants learn both supervised and unsupervised learning techniques. Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs. Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results. The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.

The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day. Perhaps you will be the one to suggest they utilize them. They can really speed up the modeling process. Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets. For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.

It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions. Until they’re operationalized, they’re of no value.

It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation. Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?

In the Development world you will increasingly only be selected if you have a specialty. In the Production world you are likely to have more opportunities if you don’t specialize. Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.

Supply Chain Forecasting: There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.

IoT for Manufacturing: This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.

If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.

For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years.

Remember also that our profession is changing fast. It is already well past the time that a single data scientist could master the entire field. Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses. Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.

What about the rumors of those outsized salaries even for beginners? Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions. You’re not going get a Silicon Valley salary if you’re working in Milwaukee.

The best salary and skills studies come from O’Reilly. Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings. This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors. Be sure to carefully evaluate who filled out the surveys and whether you think they are representative. There are no purely objective bias-free surveys in our profession.

As Your Career Progresses

Data science has been and continues to be a field in which knowledge of tools as well as business in paramount. We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.

It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising. In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.

In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others). As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered. The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) is evolving so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Conferences and networking with the data science community are effective ways of keeping on top of the latest developments. And regularly reading books and papers is very important (on this: if you do not have a research background, it is worth learning how to read academic papers properly).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the points I wanted to make, so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.

Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilize the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:

“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient evidence to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.

Other University Degrees

So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason – there is no need to panic – because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember this from Chapter Two).

Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience.

Online Courses

In a similar sense – even if you come from another quantitative field – a few online courses will not make you an expert, and remember: this is an expert field. But even if an online course was enough to master a chosen subject, you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this in Chapter Four.

All this being said, online courses are incredibly useful tools to help kick-start your journey, or begin learning a new area (like deep learning, for example). The most popular courses are found via Coursera, Udacity and edX, with Dylan Hogg describing Andrew Ng’s Machine Learning on Coursera as “an absolute pre-requisite for anyone who does not have a research background”.

The following is by no means a complete list, or mandatory for that matter, but these also stood out to Dylan and some of the other data scientists we have met so far:

Machine Learning: Intro to Machine Learning (Udacity)
Deep Learning: Deep Learning (Udacity), Neural Networks for Machine Learning (Coursera)
Spark: Big Data Analysis with Spark (edX), Distributed Machine Learning with Spark (edX)

During my interactions with the Experfy team, I've found out that they were also launching a training platform. You can see a preview here.

Books

Needless to say: good books are an invaluable resource and our favourite data scientists advocated the following:

Pattern Recognition and Machine Learning by Christopher Bishop
Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”: Elements of Statistical Learning
And for a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting

Presenting / Communicating

If you do not have a natural disposition for communicating – especially with non-technical people – this is something you will need to work on (see Chapter One for why). Gaining practice and obtaining feedback is the best way to improve your soft skills, although Yanir also recommended the classic book by Dale Carnegie: How to Win Friends and Influence People.

Advertisement