Advertisement

søndag 5. juni 2016

10 Minutes to Data Science



I just followed John Hopkin's Executive Data Science team. In the first chapter of the course they talk,

*In Data Science, the importance is science and not data. Data Science is only useful when we use data to answer the question.*
It actually true in some point. I've actually seen too many companies that brag how big they data are but they don't actually know how to pose a question. If the data can't be used to answer the question that makes the company growth, it's not a good idea. They press even more to the point of data, data, data but really is in the end, machine learning will just feed into it, to make a better prediction to answer the question. It's critical to pose a question first, then try to get/build data to answer the question. And more importantly, don't be afraid to get other data from sources outside your company.
When investigating a problem and communicating to the broader audience, it's important to find the right question to answer, and then find the data related to answering the questions. This how all data science work will look like. Even in A/B testing when we measure the confidence interval between control and experiment group, the real question is still about how we can use data to answer our question.
In that course, Jeff Leek continues in term of Money Ball. We can find some evaluation metrics to measure player's skills, but the key important to answer questions is, "Can we be a winning team with a small budget?" Creating the best predictive model is not the most important, in the case of Netflix prize, one-million-dollar algorithm can't be implemented because it's impossible to scale with all of the customer combined.


Statistics and Machine Learning

Statistics mainly divided into two parts, Descriptive Statistics and Inferential Statistics. Descriptive Statistics, as the name implied, is using statistics to better understand the data. This includes using summary statistics and visualization to explore the data. Jake Vanderplas, the author of Python Data Science Handbook, showed how to use this method to understand the pattern of Seattle's Bicycle Habits in his blog.
While Inferential Statistics let you use Hypothesis testing and Confidence Interval to make an inference about your assumption. Suppose you have two group with numerical variable (female age vs male age), and you want to found whether these groups is significantly different with each other. Statistical Inference need you to follow experiment design, so that your inference can generalize well to your population of interest, and also found correlation that suggest causation. This method is useful to get insight, whether the relationship of two variable is correlate with each other.
Machine Learning is one of the fields in artificial intelligence that give a machine a capability to learn about your data. Thanks to the modern era, where computers have grown computation power and hype of data science , machine learning has grown into broad area. Two of the interesting topic is Supervised Learning and Unsupervised Learning. Supervised Learning is where given a set of input and output, machine learning tried to predict future output given future input. Think of it as a student that given a bunch of papers of quiz where it has a wrong and right answer, he learns and will able to answer the quiz. While the example of Unsupervised Learning is clustering algorithm like we discussed earlier.
Machine Learning is a different use case when compared to Statistical Inference. Have you seen Kaggle competitions? take a look around at their leaderboard, and all that score is extremely close. Often top 100 hundred is already a winner. They're willing to go into 2-3 times more complexity to get 1% increase. This is not practically possible, and we saw from Netflix prize, they give up on the one million dollar code because it's computation is expensive. So if you in for the accuracy games, go ahead! Otherwise, a simpler model is better.
So what do you use when you want to make a prediction? use Statistical modeling to understand what your prediction is. Use machine learning to make your prediction better. Statistical modeling concerns about the complexity model because we have to understand it, but machine learning scale as complexity increase. Moreover, Machine Learning concern about parameter tuning to performance, statistics concern about parsimonious model (get understanding bett


Software Engineering

So why is software engineering is important in data science? Because often you will have to get data using programming language. Sure there is some reporting that you can be download from Google Analytics or any dashboard in your company, but it's only summary and aggregation metrics. Things will get harder if you need advanced or specialized metrics that would need you to create aggregation yourself. Log data will always have some messy data. And if you have human-input data, there's going to be a human error. In order to do that, you need engineering skills. Pulling data from database alone need some programming or SQL skills. In case you need additional data from open data in the web, you also need programming to get data through API.
So engineering skill is a critical part of data science. In fact, it's so critical that you won't get the fun part, analyzing and making inference/prediction if you don't have engineering skills. You don't even know how to get data and clean it. At least software engineer alone can still do something. They can get some data and validate through analyzing inconsistency, and make some descriptive statistics to analyze the data.


Toolbox

One of data science toolbox is often a debate between choosing R or Python. But they are two different things. Python comes from software engineer background. Since Python famous first at web development, gathering and manipulating data has becomes very important. On the other hand, R has statistician background. There is widely variant of statistical packages available. And as statistician also use visualization to get insight about data, it also rich with visualization packages. So when processing the data use Python, and doing R when you want to analyze. Of course, sometimes these two overlaps, and you can choose to stick to one language. You can manipulate string in R, or do statistical analysis in Python. You have few choices when you want to present your result of analysis. If narrative, you can use Jupyter or Rmarkdown to storytelling your analysis in a narrative way. Sometimes you end-up want to engage your audience based on your findings. In this case, you want to create interactive visualization. D3.js is great in this area.
So when you talk about data science, think more about the question that you want to answer. Can you get some information if you have to answer the question? Do you have useful data to answer your question? If you could even answer it, is the answer practically possible? By doing this series of questions, it will avoid your missteps in the long run.

Ingen kommentarer:

Legg inn en kommentar