Data Science Jedi: juni 2016

onsdag 29. juni 2016

If social networks were countries, which would they be?

Facebook CEO Mark Zuckerberg speaks on stage during the Facebook F8 conference in San Francisco, California

If Facebook were a country, it would be substantially bigger than China. The size of Facebook's user base translates to around one in seven of the global population using it each month - around 1.65 billion people.

The role of digital technology in breaking down physical borders is one of the many trends in the Fourth Industrial Revolution. As social media continues to open up new opportunities for businesses and societies, how do today's networks compare?

Facebook
According to Statista, Facebook had over 1.65 billion monthly active users in the first quarter of 2016. The number of monthly active mobile users also passed 1.5 billion in the same quarter. China's population, by comparison, is around 1.37 billion.

WhatsApp
While not technically a social network, it's worth including the messaging giant in this list due to the 1 billion-plus people using it each month. Monthly active users isn't the best metric for measuring messaging apps (you either use them daily-ish or not at all) but the MAU figure has grown impressively from 700 million in January 2015 to 1 billion now, putting it within sight of India, which has a population of 1.25 billion. The messaging app also handles over 64 billion messages and 600 million photos each day.

Top 15 countries by population, and the social media giants

Instagram

The photo- and video-sharing app reported over 400 million monthly active users worldwide in September 2015, just ahead of the US population of 319 million. Nearly all of these are engaging with the service via the mobile app, although there is also a desktop version. The number of Instagram users in the US is predicted to pass 106 million by 2018.

Twitter

The network for those happy to keep their musings to 140 characters or less, Twitter has over 305 million monthly active users, with around 80% living outside the US. The social network upset the apple cart last year somewhat with the introduction of a tailored algorithm to order tweets, moving away from a live feed, which upset some users. Growth has slowed, as well as the company's stock price, but it's still the go-to place for breaking news alerts and a glimpse of the world in real-time.

Google+

Google doesn't particularly like talking about its MAUs, and it's fair to say it isn't the obvious destination when people want to share something about themselves. At last count, the network had over 300 million users, which would make it bigger than Indonesia, and a tad smaller than the USA.

LinkedIn

LinkedIn's monthly active user base is growing robustly, with around 100 million people currently using the site each month. Over 400 million have an account, however. The social network generates revenue from 3 areas - hiring solutions, advertising revenue, and premium subscriptions. The 100 million MAUs puts it just behind the Philippines in terms of size.

Snapchat

The newest member of the social media giants, it was reported back in January last year that Snapchat had over 100 million monthly active users, which would make it around the same size as Ethiopia. However, data is hard to come by, with some other sources suggesting the figure could be as high as 200 million.

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary: What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist. That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.

It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates. It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?

Yes there is. Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.

If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees. We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience. It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.

The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist. The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.

Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles. Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.

So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists. Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior. Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?

There are two key points to understand here. The first is that the data science market has divided into two distinctly different segments, Production and Development.

Production: This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses. Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing. These companies are widely distributed geographically so you can work pretty much anywhere. The primary data science activities are predictive analytics and recommenders.

Development: This is the new and sexy world of data science that gets all the press coverage. In these enterprises the data science and the code are the product. Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics. This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.

The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin. This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent.

As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science

The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists. These four types are ultimately differentiated by what they spend their time doing.

The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.

There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.

In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

1. Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.

2. Data Developer. Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data. If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.

3. Data Creatives. Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations. This is a more senior role innovating new types of predictive analytic use cases, data products, and services. This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience. Data Creatives are heavily present in the Development world.

4. Data Researchers. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD. These are folks who are innovating data science at its most fundamental level.

According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems. Here’s their chart.

This is an important decision since you need to do activities within data science that you like. This may lead you toward an advanced degree or simply to develop you skills through experience. It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market

If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from. If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.

If you have a choice, pick a college that specifically offers a Data Science degree. If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.

Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master. The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors. It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.

Zynga’s in-house program is 12 to 18 months. To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar. At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists. This could easily be the course list for your undergraduate program. I have added some observations of my own.

Phase I: Foundational Statistical Theory

Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions. For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others. Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages is a requirement. Selecting which ones is a matter of discussion. My take is this:

SQL: Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases. Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).

Python: The big discussion over the last five or so years has been around R versus Python. Python is my pick as a production language with a very generous data science library. More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not. In the most recent surveys you’ll see Python pulling away from R.

SAS: Yes SAS. SAS was practically the original DS scripting language before R and Python. Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily. Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage. IBM SPSS is an option but SAS has a huge lead in adoption. You will rarely encounter SAS in the Development world.

Phase III: Machine Learning

Participants learn both supervised and unsupervised learning techniques. Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs. Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results. The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.

The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day. Perhaps you will be the one to suggest they utilize them. They can really speed up the modeling process. Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets. For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.

It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions. Until they’re operationalized, they’re of no value.

It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation. Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?

In the Development world you will increasingly only be selected if you have a specialty. In the Production world you are likely to have more opportunities if you don’t specialize. Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.

Supply Chain Forecasting: There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.

IoT for Manufacturing: This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.

If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.

For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years.

Remember also that our profession is changing fast. It is already well past the time that a single data scientist could master the entire field. Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses. Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.

What about the rumors of those outsized salaries even for beginners? Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions. You’re not going get a Silicon Valley salary if you’re working in Milwaukee.

The best salary and skills studies come from O’Reilly. Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings. This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors. Be sure to carefully evaluate who filled out the surveys and whether you think they are representative. There are no purely objective bias-free surveys in our profession.

As Your Career Progresses

Data science has been and continues to be a field in which knowledge of tools as well as business in paramount. We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.

It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising. In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.

In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others). As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered. The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) is evolving so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Conferences and networking with the data science community are effective ways of keeping on top of the latest developments. And regularly reading books and papers is very important (on this: if you do not have a research background, it is worth learning how to read academic papers properly).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the points I wanted to make, so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.

Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilize the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:

“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient evidence to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.

Other University Degrees

So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason – there is no need to panic – because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember this from Chapter Two).

Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience.

Online Courses

In a similar sense – even if you come from another quantitative field – a few online courses will not make you an expert, and remember: this is an expert field. But even if an online course was enough to master a chosen subject, you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this in Chapter Four.

All this being said, online courses are incredibly useful tools to help kick-start your journey, or begin learning a new area (like deep learning, for example). The most popular courses are found via Coursera, Udacity and edX, with Dylan Hogg describing Andrew Ng’s Machine Learning on Coursera as “an absolute pre-requisite for anyone who does not have a research background”.

The following is by no means a complete list, or mandatory for that matter, but these also stood out to Dylan and some of the other data scientists we have met so far:

Machine Learning: Intro to Machine Learning (Udacity)
Deep Learning: Deep Learning (Udacity), Neural Networks for Machine Learning (Coursera)
Spark: Big Data Analysis with Spark (edX), Distributed Machine Learning with Spark (edX)

During my interactions with the Experfy team, I've found out that they were also launching a training platform. You can see a preview here.

Books

Needless to say: good books are an invaluable resource and our favourite data scientists advocated the following:

Pattern Recognition and Machine Learning by Christopher Bishop
Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”: Elements of Statistical Learning
And for a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting

Presenting / Communicating

If you do not have a natural disposition for communicating – especially with non-technical people – this is something you will need to work on (see Chapter One for why). Gaining practice and obtaining feedback is the best way to improve your soft skills, although Yanir also recommended the classic book by Dale Carnegie: How to Win Friends and Influence People.

Advertisement

onsdag 29. juni 2016

If social networks were countries, which would they be?

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

GoorooThink by Gooroo.io

Bloggarkiv

Advertise Sidebar

Advert

Advertise

Advertise