Data Science Jedi: statistics

Viser innlegg med etiketten statistics. Vis alle innlegg

mandag 15. august 2016

The Data Science Process 1/3

Congratulations! You’ve just been hired for your first job as a data scientist at Hotshot Inc., a startup in San Francisco that is the toast of Silicon Valley. It’s your first day at work. You’re excited to go and crunch some data and wow everyone around you with the insights you discover. But where do you start?

Over the (deliciously catered) lunch, you run into the VP of Sales at Hotshot Inc., introduce yourself and ask her, “What kinds of data challenges do you think I should be working on?”

The VP of Sales thinks carefully. You’re on the edge of your seat, waiting for her answer, the answer that will tell you exactly how you’re going to have this massive impact on the company of your dreams.

And she says, “Can you help us optimize our sales funnel and improve our conversion rates?”

The first thought that comes to your mind is: What? Is that a data science problem? You didn’t even mention the word ‘data’. What do I need to analyze? What does this mean?

Fortunately, your mentor data scientists have warned you already: this initial ambiguity is a regular situation that data scientists in industry encounter. All you have to do is systematically apply the data science process to figure out exactly what you need to do.

The data science process: a quick outline

When a non-technical supervisor asks you to solve a data problem, the description of your task can be quite ambiguous at first. It is up to you, as the data scientist, to translate the task into a concrete problem, figure out how to solve it and present the solution back to all of your stakeholders. We call the steps involved in this workflow the “Data Science Process.” This process involves several important steps:

Frame the problem: Who is your client? What exactly is the client asking you to solve? How can you translate their ambiguous request into a concrete, well-defined problem?
Collect the raw data needed to solve the problem: Is this data already available? If so, what parts of the data are useful? If not, what more data do you need? What kind of resources (time, money, infrastructure) would it take to collect this data in a usable form?
Process the data (data wrangling): Real, raw data is rarely usable out of the box. There are errors in data collection, corrupt records, missing values and many other challenges you will have to manage. You will first need to clean the data to convert it to a form that you can further analyze.
Explore the data: Once you have cleaned the data, you have to understand the information contained within at a high level. What kinds of obvious trends or correlations do you see in the data? What are the high-level characteristics and are any of them more significant than others?
Perform in-depth analysis (machine learning, statistical models, algorithms): This step is usually the meat of your project,where you apply all the cutting-edge machinery of data analysis to unearth high-value insights and predictions.
Communicate results of the analysis: All the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean, in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you will build and use here.

So how can you help the VP of Sales at Hotshot Inc.? In the next few emails, we will walk you through each step in the data science process, showing you how it plays out in practice. Stay tuned!

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary: What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist. That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.

It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates. It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?

Yes there is. Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.

If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees. We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience. It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.

The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist. The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.

Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles. Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.

So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists. Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior. Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?

There are two key points to understand here. The first is that the data science market has divided into two distinctly different segments, Production and Development.

Production: This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses. Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing. These companies are widely distributed geographically so you can work pretty much anywhere. The primary data science activities are predictive analytics and recommenders.

Development: This is the new and sexy world of data science that gets all the press coverage. In these enterprises the data science and the code are the product. Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics. This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.

The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin. This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent.

As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science

The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists. These four types are ultimately differentiated by what they spend their time doing.

The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.

There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.

In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

1. Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.

2. Data Developer. Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data. If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.

3. Data Creatives. Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations. This is a more senior role innovating new types of predictive analytic use cases, data products, and services. This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience. Data Creatives are heavily present in the Development world.

4. Data Researchers. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD. These are folks who are innovating data science at its most fundamental level.

According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems. Here’s their chart.

This is an important decision since you need to do activities within data science that you like. This may lead you toward an advanced degree or simply to develop you skills through experience. It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market

If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from. If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.

If you have a choice, pick a college that specifically offers a Data Science degree. If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.

Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master. The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors. It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.

Zynga’s in-house program is 12 to 18 months. To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar. At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists. This could easily be the course list for your undergraduate program. I have added some observations of my own.

Phase I: Foundational Statistical Theory

Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions. For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others. Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages is a requirement. Selecting which ones is a matter of discussion. My take is this:

SQL: Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases. Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).

Python: The big discussion over the last five or so years has been around R versus Python. Python is my pick as a production language with a very generous data science library. More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not. In the most recent surveys you’ll see Python pulling away from R.

SAS: Yes SAS. SAS was practically the original DS scripting language before R and Python. Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily. Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage. IBM SPSS is an option but SAS has a huge lead in adoption. You will rarely encounter SAS in the Development world.

Phase III: Machine Learning

Participants learn both supervised and unsupervised learning techniques. Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs. Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results. The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.

The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day. Perhaps you will be the one to suggest they utilize them. They can really speed up the modeling process. Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets. For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.

It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions. Until they’re operationalized, they’re of no value.

It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation. Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?

In the Development world you will increasingly only be selected if you have a specialty. In the Production world you are likely to have more opportunities if you don’t specialize. Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.

Supply Chain Forecasting: There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.

IoT for Manufacturing: This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.

If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.

For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years.

Remember also that our profession is changing fast. It is already well past the time that a single data scientist could master the entire field. Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses. Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.

What about the rumors of those outsized salaries even for beginners? Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions. You’re not going get a Silicon Valley salary if you’re working in Milwaukee.

The best salary and skills studies come from O’Reilly. Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings. This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors. Be sure to carefully evaluate who filled out the surveys and whether you think they are representative. There are no purely objective bias-free surveys in our profession.

As Your Career Progresses

Data science has been and continues to be a field in which knowledge of tools as well as business in paramount. We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.

It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising. In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.

In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others). As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered. The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.

onsdag 8. juni 2016

R Passes SAS in Academia Use (finally)

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

Way back in 2012 I published a forecast that showed that the use of R for scholarly publications would likely pass the use of SAS in 2015. But I didn’t believe the forecast since I expected the sharp decline in SAS and SPSS use to level off. In 2013, the trend accelerated and I expected R to pass SAS in the middle of 2014. As luck would have it, Google changed their algorithm, somehow finding vast additional quantities of SAS and SPSS articles. I just collected data on the most recent complete year of scholarly publications, and it turns out that 2015 was indeed the year that R passed SAS to garner the #2 position. Once again, models do better than “expert” opinion! I’ve updated The Popularity of Data Analysis Software to reflect this new data and include it here to save you the trouble of reading the whole 45 pages of it.

If you’re interested in learning R, you might consider reading my books R for SAS and SPSS Users, or R for Stata Users. I also teach workshops on R, but I’m currently booked through mid October, so please plan ahead.

Figure 2a. Number of scholarly articles found in the most recent complete year (2015) for each software package.
Figure 2a. Number of scholarly articles found in the most recent complete year (2015) for each software package.
Scholarly Articles
Scholarly articles are also rich in information and backed by significant amounts of effort. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool or even an object of study. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2015. SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. For the first time ever, R is in second place with around half as many articles. Although now in third place, SAS is nearly tied with R. Stata and MATLAB are essentially tied for fourth and fifth place. Starting with Java, usage slowly tapers off. Note that the general-purpose software C, C++, C#, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those as much rougher counts than the rest. Since Scala and Julia have a heavy data science angle to them, I cut them some slack by not adding any data science terms to the search, not that it helped them much!

From Spark on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 1,200 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. Spark and RapidMiner top out the list of these packages, followed by KNIME and BMDP. There’s a slow decline in the group that goes from Enterprise Miner to Salford Systems. Then comes a group of mostly relative new arrivals beginning with Microsoft’s Azure Machine Learning. A package that’s not a new arrival is from Megaputer, whose Polyanalyst software has been around for many years now, with little progress to show for it. Dead last is Lavastorm, which to my knowledge is the only commercial package that includes Tibco’s internally written version of R, TERR.

Fig_2b_ScholarlyImpact2015
Figure 2b. The number of scholarly articles for software that was used by fewer than 1,200 scholarly articles (i.e. the bottom part of Fig. 2a, rescaled.)
Figures 2a and 2b are useful for studying market share as it is now, but they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting such data is too time consuming since it must be re-collected every year (since Google’s search algorithms change). What I’ve done instead is collect data only for the past two complete years, 2014 and 2015. Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red. Those whose use is declining or “cooling” are shown in blue. Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2014.

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2013 to 2014). Packages shown in red are "hot" and growing, while those shown in blue are "cooling down" or declining.
Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2014 to 2015). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.
Python is the fastest growing. Note that the Python figures are strictly for data science use as defined here. The open-source KNIME and RapidMiner are the second and third fastest growing, respectively. Both use the easy yet powerful workflow approach to data science. Figure 2b showed that RapidMiner has almost twice the marketshare of KNIME, but here we see use of KNIME is growing faster. That may be due to KNIME’s greater customer satisfaction, as shown in the Rexer Analytics Data Science Survey. The companies are two of only four chosen by IT advisory firm Gartner, Inc. as having both a complete vision of the future and the ability to execute that vision (Fig. 3a).

R is in fourth place in growth, and given its second place in overall marketshare, it is in an enviable position.

At the other end of the scale are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use. Hadoop use declined slightly, perhaps as people turned to alternatives Spark and H2O.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I’ve plotted the same scholarly-use data for 1995 through 2015, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown in this particular graph. However, if you add up all the other software shown in Figure 2a, you come close. There still seems to be a slight decline in people reporting the particular software tool they used.

Fig_2d_ScholarlyImpact
Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.
Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only a single point of SAS usage in 2015. The the result is shown in Figure 2e. Freeing up so much space in the plot now allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack (recall that the curve for SAS has a steep downward slope). If the current trends continue, R will cross SPSS to become the #1 software for scholarly data science use by the end of 2017. Stata use is also growing more quickly than the rest. Note that trends have shifted before as discussed here. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another.

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after market leaders SPSS and SAS have been removed.
Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.
Using a logarithmic y-axis scales down the more popular packages, allowing us to see the full picture in a single image (Figure 2f.) This view makes it more clear that R use has passed that of SAS, and that Stata use is closing in on it. However, even when one studies the y-axis values carefully, it can be hard to grasp how much the logarithmic transformation has changed the values. For example, in 2015 value for SPSS is well over twice the value for R. The original scale shown in Figure 2d makes that quite clear.

Fig_2f_ScholarlyImpactLogs
Figure 2f. A logarithmic view of the number of scholarly articles found in each year by Google Scholar. This combines the previous two figures into one by compressing the y-axis with a base 10 logarithm.

søndag 5. juni 2016

10 Minutes to Data Science

I just followed John Hopkin's Executive Data Science team. In the first chapter of the course they talk,

*In Data Science, the importance is science and not data. Data Science is only useful when we use data to answer the question.*
It actually true in some point. I've actually seen too many companies that brag how big they data are but they don't actually know how to pose a question. If the data can't be used to answer the question that makes the company growth, it's not a good idea. They press even more to the point of data, data, data but really is in the end, machine learning will just feed into it, to make a better prediction to answer the question. It's critical to pose a question first, then try to get/build data to answer the question. And more importantly, don't be afraid to get other data from sources outside your company.
When investigating a problem and communicating to the broader audience, it's important to find the right question to answer, and then find the data related to answering the questions. This how all data science work will look like. Even in A/B testing when we measure the confidence interval between control and experiment group, the real question is still about how we can use data to answer our question.
In that course, Jeff Leek continues in term of Money Ball. We can find some evaluation metrics to measure player's skills, but the key important to answer questions is, "Can we be a winning team with a small budget?" Creating the best predictive model is not the most important, in the case of Netflix prize, one-million-dollar algorithm can't be implemented because it's impossible to scale with all of the customer combined.

Statistics and Machine Learning

Statistics mainly divided into two parts, Descriptive Statistics and Inferential Statistics. Descriptive Statistics, as the name implied, is using statistics to better understand the data. This includes using summary statistics and visualization to explore the data. Jake Vanderplas, the author of Python Data Science Handbook, showed how to use this method to understand the pattern of Seattle's Bicycle Habits in his blog.
While Inferential Statistics let you use Hypothesis testing and Confidence Interval to make an inference about your assumption. Suppose you have two group with numerical variable (female age vs male age), and you want to found whether these groups is significantly different with each other. Statistical Inference need you to follow experiment design, so that your inference can generalize well to your population of interest, and also found correlation that suggest causation. This method is useful to get insight, whether the relationship of two variable is correlate with each other.
Machine Learning is one of the fields in artificial intelligence that give a machine a capability to learn about your data. Thanks to the modern era, where computers have grown computation power and hype of data science , machine learning has grown into broad area. Two of the interesting topic is Supervised Learning and Unsupervised Learning. Supervised Learning is where given a set of input and output, machine learning tried to predict future output given future input. Think of it as a student that given a bunch of papers of quiz where it has a wrong and right answer, he learns and will able to answer the quiz. While the example of Unsupervised Learning is clustering algorithm like we discussed earlier.
Machine Learning is a different use case when compared to Statistical Inference. Have you seen Kaggle competitions? take a look around at their leaderboard, and all that score is extremely close. Often top 100 hundred is already a winner. They're willing to go into 2-3 times more complexity to get 1% increase. This is not practically possible, and we saw from Netflix prize, they give up on the one million dollar code because it's computation is expensive. So if you in for the accuracy games, go ahead! Otherwise, a simpler model is better.
So what do you use when you want to make a prediction? use Statistical modeling to understand what your prediction is. Use machine learning to make your prediction better. Statistical modeling concerns about the complexity model because we have to understand it, but machine learning scale as complexity increase. Moreover, Machine Learning concern about parameter tuning to performance, statistics concern about parsimonious model (get understanding bett

Software Engineering

So why is software engineering is important in data science? Because often you will have to get data using programming language. Sure there is some reporting that you can be download from Google Analytics or any dashboard in your company, but it's only summary and aggregation metrics. Things will get harder if you need advanced or specialized metrics that would need you to create aggregation yourself. Log data will always have some messy data. And if you have human-input data, there's going to be a human error. In order to do that, you need engineering skills. Pulling data from database alone need some programming or SQL skills. In case you need additional data from open data in the web, you also need programming to get data through API.
So engineering skill is a critical part of data science. In fact, it's so critical that you won't get the fun part, analyzing and making inference/prediction if you don't have engineering skills. You don't even know how to get data and clean it. At least software engineer alone can still do something. They can get some data and validate through analyzing inconsistency, and make some descriptive statistics to analyze the data.

Toolbox

One of data science toolbox is often a debate between choosing R or Python. But they are two different things. Python comes from software engineer background. Since Python famous first at web development, gathering and manipulating data has becomes very important. On the other hand, R has statistician background. There is widely variant of statistical packages available. And as statistician also use visualization to get insight about data, it also rich with visualization packages. So when processing the data use Python, and doing R when you want to analyze. Of course, sometimes these two overlaps, and you can choose to stick to one language. You can manipulate string in R, or do statistical analysis in Python. You have few choices when you want to present your result of analysis. If narrative, you can use Jupyter or Rmarkdown to storytelling your analysis in a narrative way. Sometimes you end-up want to engage your audience based on your findings. In this case, you want to create interactive visualization. D3.js is great in this area.
So when you talk about data science, think more about the question that you want to answer. Can you get some information if you have to answer the question? Do you have useful data to answer your question? If you could even answer it, is the answer practically possible? By doing this series of questions, it will avoid your missteps in the long run.

lørdag 4. juni 2016

The big data challenge: Extracting actual business value

You've got the tools and the power of the cloud to capture big data, but figuring out what you want from it and how to extract it is the final, crucial challenge.

Advances in data networks and storage mean organizations capture far more data than they ever have - perhaps a stream of measurements from manufacturing equipment, from vehicles, or from game-changers like web-enabled refrigerators (no, I've never seen one either).

The enterprise CTO may have the data storage part all figured out - theirMongoDB cloud database is in place, or they rent DBaaS from Cloudant. But why? What does an enterprise do with all this unstructured data?

The first thing is to identify what the enterprise wants. Analytics can be an area of blind faith – if the enterprise is not clear about its big data needs, it may just hope that something good pops out.
Identify the big data needs.

Big data analytics, like all IT, is subordinate to business needs. An organization must figure out their requirements before working on big data.

No two organizations are the same, so there is always a variation in needs. The IT department may receive requirements like these.

Crunch data for instant reports.
Decode telemetry on the fly.
Find a needle in a haystack in a vast quantity of signals.
Find the regular operational patterns in a vast quantity of signals.

Analytics is a service-oriented area so the CTO could just finish his work there and outsource the rest. If he decides to keep it in-house, he needs a few more things.
Get some analytics applications.

Analytics applications help turn large data sets into business value. The enterprise uses analytics tools to tackle the difficult job of doing something useful with their unstructured data.

Data analytics products are one of the big data technologies and live in a data scientist's toolbox. Analytics products don't usually deliver ready-made business value.

When an organization purchases analytics applications, they must leave plenty of cash for the training budget. Complex tools are not intuitive.
Write a big data policy.
Managing large data sets is a difficult job. The big data manager has plenty of moving parts to configure to meet these requirements.

What is the retention policy? What parts of the data pool can be deleted, and when? What happens to the rest of the historical data?
What is the data protection policy? Who gets to view data? What are the privacy implications? What are the legal restrictions?
Where is the data stored? If a cloud provider is holding the data, how do we get it back?
What kind of meta-data is required? How can anyone identify the purpose of a big data store?
How many data sets are there, and how can they be blended?
Assemble an analysis team.
The first part of building a team is partnering up a business executive and an IT sponsor. Both are required.
There may be a data warehouse and data miners in the organization, but probably no data scientists. There are a few ways of getting some.
Hire experts. Pros are in demand.
Hire people with the right capability and let them learn.
Spot the budding statisticians in your organization and grab them.

Spotting capability means looking for clues. John Foreman is chief scientist at Mailchimp and writes a blog on data science. If someone is a fan of his work, that's a clue. Perhaps one of the data miners has an artistic streak. The person obsessively dragging consumer behaviour out of click trails is worth talking to.
That still leaves some gaps.

A few huge organizations, like telecoms companies and global retailers, have been battling with the problem of analytics for decades. They have specialist teams, home-grown tools, and years of experience. Alongside their expensive specialized capabilities, a brave new world of big data and commoditized data analytics is appearing. There is quite a way to go.

The enterprise is doing new things with existing data sets, rather than collecting new data.
Plenty of big data tools exist, but few tools ready for business users.
Organizations in many parts of the world have not started exploiting big data.
Better machine learning is required to extract signal from noise.

It takes statistical, technical and business expertise to get value from big data. Even where the analytics tools exist, they must be tailored for business needs - it's not a one-size-fits-all world.

Over to you, big data startups around the world. Plug those gaps

søndag 8. mai 2016

Which Of The Five Types Of Data Science Does Your Startup Need?

Credit: O'Reilly

Startups, you are doing data science wrong. That’s the title of a post penned by Ryan Weald in GigaOm this week. Weald echoes DJ Patil’s idea: “product-focused data science is different than the current business intelligence style of data science.”

Weald points to a different model of data scientist, an engineer, not a statistician, who can perform queries and based upon some insights, improve the product with a few code changes and a push to git.

I like Weald’s post but disagree on one point. I don’t think there is one type of data scientist, but five.

Quantitative, exploratory data scientists tend to have PhDs and use theory to understand behavior. I count Hal Varian, Chief Economist at Google, and Redpoint’s own Jamie Davidson, among them. Varian’s team researches the advertiser dynamics within the ads auction and compares those dynamics to theoretical auction models like the Vickery auction. By combining theory and exploratory research, these data scientists improve products.
Operational data scientists often work in the finance, sales or operations teams at Google. In the AdSense ops team where I started, we had a star data analyst who each week would discuss our team’s performance: our email response times, the satisfaction scores of our publishers, and changes in publisher behavior by segment. His work provided a feedback loop to improve the team’s tactics and efficiency. Only infrequently were these insights used to influence product.
Product data scientists tend to belong to product management or engineering. This is the group of data scientists Weald writes about. PMs and engineers sift through logs and analysis tools to understand the way users interact a product and leverage that knowledge to refine the product. At Google, the ads quality team analyzed user clicks data to improve ad targeting.
Marketing data scientists segment the user base, evaluate the performance of advertising campaigns, match product features to customer segments, and design content marketing campaigns. The marketing data scientist creates awareness and leads for the sales team, helping generate revenue.
Research data scientists create insights as a product. Nate Silver is arguably the most famous of them. Silver’s work doesn’t influence a product; the analysis is the product itself. Sometimes the data science leads to a thought leadership whitepaper, or a blog post, or a financial report. It’s rarer for startups to employ research scientists because the output isn’t tied to revenue. But larger companies like Google do, think tanks do, financial institutions do.

These five types of data scientists span almost every department of knowledge work. Sometime in the past thirty years, data science became inextricable from the day-to-day operation of these teams. Product, marketing, eng, sales all use data to make decisions. These teams use data to identify, understand and implement feedback loops and to reinforce the behavior a company desires.

To talk about data scientists might be too myopic. Your startup may need a research data scientist or one with a PhD. Or it may need an engineer with an understanding of basic statistics who can work up and down the Rails stack. Or another type all together.

Like any role, when hiring or recruiting a data scientist it’s important to identify what the key problems facing the business and the relevant skills the right candidate will need to solve those challenges.

Tableau explain why FC Barcelona is still the best team in Spain (and the whole World)

Introduction

Futbol Club Barcelona, also known as Barça is a professional football club, based in Barcelona, Catalonia, Spain. Founded in 1899 by a group of Swiss, English and Catalan footballers led by Joan Gamper, the club has become a symbol of Catalan culture and Catalanism, hence the motto "Més que un club" (More than a club). Unlike many other football clubs, the supporters own and operate Barcelona. It is the world's second-richest football club in terms of revenue, with an annual turnover of $613 million and the third most valuable sports team, worth $2.6 billion.The official Barcelona anthem is the "Cant del Barça" and it is knows to all 480 000 000 fans around the World.

Team Philosophy

Johan Cruyff and Charly Rexach returned to Barcelona in 1988 and began to install a philosophy that would change the way people play and view football forever. Both Cruyff and Rexach admit that it wasn't a completely new philosophy. Admitting that it was one adapted from ideas given to them by Michels, and one which many believe had been given to Europe by the Hungarian side of the early 1950s. The foundation of this philosophy was, and still is, built upon the basic template of touch, technique, maintaining possession, and stretching the pitch with continuous circulation of the ball (Tiqui-Taka). Elements that, at the time, were not valued by many Barcelona supporters (Hunter, 2012)

The chart down (OptaPro) shows the power of Barcelona over other La Liga teams, its play philosophy of short passes and ball possession.

La Liga

Team Discipline and attractivness (support by fans)

Learn About Tableau

Why FC Barcelona is better than any team, this season also?!

Learn About Tableau

Conclusion

With this in mind, it seems that a lot of clubs steal ideas from Barcelona that they see on the surface, such as tactics, and implement them in the short term, but fail to intertwine them into their own specifically moulded model. It is one thing to use Barcelona as inspiration, but it must be remembered that Barcelona's philosophy is tailored to THEIR own needs, no one else's, and the coaching and playing staff have grown together surrounded by it. They have lived by it and through it. Due to Barcelona's on field success, many are using various aspects of the Barcelona model to shape their own coaching methods, training programmes, and playing style etc. Often missing the point that a club philosophy needs to be self defined and fully committed to by all. There are many different styles of playing football, Barcelona have their own unique style of playing, but it should be remembered that this playing style is born out of wider and deeper beliefs in the cultural values of their personal, independent philosophy.

Advertisement