Data Science Jedi: databases

Viser innlegg med etiketten databases. Vis alle innlegg

torsdag 11. august 2016

Last day(s) to participate for a chance to win a FREE space at our Data Science Boot Camp

A free place on our pioneering Data Science Boot Camp training programme is being offered by specialist recruitment agency, MBN Solutions. Places on the much-anticipated course, aimed at upskilling those with raw analytical grounding into bona fide data scientists, are worth £7,000. The average cost of recruiting a data science specialist is £15,000.

The Data Lab has partnered with New York’s globally renowned, The Data Incubator (whose courses are reputedly harder to get into than Harvard), to develop the three-week data Boot Camp as part of a drive to plug the nation’s data skills gap. It is aimed at helping to unlock the economic potential of data to Scotland, estimated to be worth £17 billion* in Scotland alone.

To apply for the MBN Solutions sponsored place, potential participants need to submit a video explaining how they would use the data science Boot Camp training in their current organisations. The video should be maximum two minutes and include:

Your current role and experience
Why you want to take part in the course
Why you believe improving your skills in data science is important
How you hope to use the skills you will learn in the course to improve your work
What impact do you expect to achieve for your organisation as a result of your skills

The video must be uploaded to YouTube, the link to the video sent to skills@thedatalab.com by 12th August.

Michael Young, CEO of MBN Solutions, said: “With the average cost of recruiting a data scientist £15,000, the Boot Camp presents an incredible opportunity to upskill current staff and invest in your company’s data science offering.

“The Data Incubator is recognised as the go-to experts in the data training sector globally and, by sponsoring a place for a budding data scientist, we are helping to enhance Scotland’s pipeline of data science talent.

“Every day we see fantastic, innovative data science projects going on in our client’s organisations, Scotland is leading the way in data science in the UK and The Data Lab are really driving the data agenda forward. Some countries are only just waking up to the potential of data. This course marks a really exciting time for Scotland and The Data Lab and we at MBN Solutions are thrilled to be a part of it.”

Brian Hills, Head of Data at The Data Lab, said: “We’re very pleased to have MBN Solutions sponsor a place on the Boot Camp which will take us one step closer to exploiting the data opportunity in great demand and short supply.

“It is going to be an incredible three weeks with attendees gaining a highly sought after data science skillset and learnings from world-leaders in data science.

“It’s crucial Scotland remains ahead of the curve in data science. By investing in our pipeline of talent and learning from international experts, we are securing our future and taking critical steps toward exploiting the data potential available here in Scotland.”

The pioneering training initiative will allow Scottish businesses to fast track potential returns by using data analysis to drive insight and decision-making across industry. There are only a few places left for the Boot Camp which will take place in September in Edinburgh. It will focus on developing practical application skills such as advanced python, machine learning and data visualisation in a collaborative environment.

For further information on the Boot Camp, how to apply, and how to enter the competition, please check out our Boot camp page, download our brochure or email skills@thedatalab.com

About The Data Incubator

The Data Incubator is data science education company based in NYC, DC, and SF with both corporate training and hiring offerings. They leverage real world business cases to offer customized, in-house training solutions in data and analytics. They also offer partners the opportunity to hire from their 8 week fellowship training PhDs to become data scientists. The fellowship selects 2% of its 2000+ quarterly applicants and is free for fellows. Hiring companies (including EBay, Capital One, AIG, and Genentech) pay a recruiting fee only if they successfully hire. You can read more about The Data Incubator on Harvard Business Review, VentureBeat, or The Next Web, or read about their alumni at Palantir or the NYTimes.

About MBN Solutions

In a field saturated by many lookalike recruitment consultancies, MBN is a truly different business. Priding ourselves on values of deep, real subject matter knowledge in the Data Science, Big Data, Analytics and Technology space, a passionate approach to developing our own consultants and a strategy placing our clients at the heart of our business, MBN are a true market defining ‘People Solutions’ business.

søndag 7. august 2016

The 7 Steps of a Data Project

By Alivia Smith, User Marketing Manager - Dataiku

It’s hard to know where to start once you’ve decided that yes, you want to become more data-driven. Just looking at all the technologies you have to understand and all the languages you’re supposed to master is enough to make your dizzy.

Well, building your first data project is actually not that hard. And yes, Dataiku DSS helps, but what will really helps you is understanding the data science process. Becoming data driven is about this: knowing the basic steps and following them to go from raw data to building a machine learning model.

The steps to complete a data project have been conceptualized a while ago as the KDD process (forKnowledge Discovery in Databases), and made popular with lots of vintage looking graphs like this one.

This is our take on the steps of a data project in this awesome age of big data!

STEP 1: UNDERSTAND THE BUSINESS

Understanding the business is the key to assuring the success of your data project. To motivate the different actors necessary to getting your project from design to production, your project must be the answer to a clear business need. So before you even think about the data, go out and talk to the people who could need to make their processes or their business better with data. Then sit down and define a timeline and concrete indicators to measure. I know, processes and politics seem boring, but in the end, they turn out to be quite useful!

If you’re working on a personal project, playing around with a dataset or an API, this may seem irrelevant. It’s not. Just downloading a cool open data set is not enough. I can’t tell you how many cool datasets I downloaded and never did anything with… So settle on a question to answer, or a product to build!

STEP 2: GET YOUR DATA

Once you’ve gotten your goal figured out, it’s time to start looking for your data. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible.

Here are a few ways to get yourself some data:

Connect to a database: ask your data and IT teams for the data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting.
Use APIs: think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. If you’re not an expert coder, plugins in DSS give you lots of possibilities to bring in external data!
Look for open data: the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. A lot of countries have open data platforms (like data gov in the US). If you’re working on a fun project outside of work, these open data sets are also an incredible resource! Check out kaggle, or this github with lots of datasets for example
Use more APIs: another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends. It’s extremely easy to set up these connections with tools like ifttt. For example, I have a bunch of recipes that collect the music I listen to, the places I visit, my steps and the kilometers I run, the contacts I add, etc. And this can be useful for businesses as well! You can analyze very interesting trends on twitter, or even monitor the competition.

STEP 3: EXPLORE AND CLEAN YOUR DATA

(AKA the dreaded preprocessing step that typically takes up 80% of the time dedicated to a data project)

Once you’ve gotten your data, it’s time to get to work on it! Start digging to see what you’ve got and how you can link everything together to answer your original goal. Start taking notes on your first analyses, and ask questions to business people, or the IT guys, to understand what all your variables mean! Because not everyone will get that c06xx is a product category referring to something awesome.

Once you understand your data, it’s time to clean it! You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.

Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project. So it’s going to suck a little bit. Luckily, tools like Dataiku DSS can make this much faster!

STEP 4: ENRICH YOUR DATASET

Now that you’ve got clean data, it’s time to manipulate it to get the most value out of it. This is the time to join all your different sources, and group logs, to get your data down to the essential features.

You’ll then start manipulating the data to extract lots of valuable features. For example, getting a country and even a town out of a visitor’s IP address. Extracting time of day, or week of year from your dates to get something more meaningful.

The possibilities are pretty much endless, and you’ll get a pretty good idea by scrolling through Dataiku DSS’s processors in the Lab of the operations you can execute.

STEP 5: BUILD VISUALISATIONS

building insights and graphs in data project

You now have a nice dataset (or maybe several), so this is a good time to start exploring it by building graphs. When you’re dealing with large volumes of data, they’re the best way to explore and communicate your findings.

You’ll find lots of tools available that make this step fun to prepare and to receive. The tricky part is always to be able to dig into your graphs to answer any question somebody would have about an insight. That’s when the data preparation comes in handy: you’re the guy who did the dirty work so you know the data like the palm of your hand!

If this is the final step of your project, it’s important to use APIs and plugins so you can push those insights to where your end users want to have them. So get integrated with their tools!

Your graphs don’t have to be the end of your project though. They’re a way to uncover more trends that you want to explain. They’re also a way to develop more interesting features. For example, by putting your data points on a map you could perhaps notice that specific geographic zones are more telling than specific countries or cities.

STEP 6: GET PREDICTIVE

By working with clustering algorithms (aka unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results. Tools like Dataiku DSS help beginners run basic open source algorithms easily in clickable interfaces.

More advanced data scientists can then go even further and predict future trends with supervised algorithms. By analyzing past data, they find features that have impacted past trends, and use them to build predictions. More than just gaining knowledge, this final step can lead to building whole new products and processes. To get these in production though, you’ll need the intervention of data scientists and engineers, but it’s important to understand the process so all the parties involved (business users and analysts as well), will be able to understand what comes out in the end.

STEP 7: ITERATE

The main goal in any business project is to prove it’s effectiveness as fast as possible to justify, well, your job. Data projects are the same. By gaining time on data cleaning and enriching, you can go to the end of the project fast and get your first results. These first insights will be a great start to uncover more necessary cleaning, to develop more features in order to continuously improve results and model outputs.

Now that you’ve got the skills, get started right now by building projects in Dataiku DSS!

fredag 8. juli 2016

Evolution of R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R

R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.

Features of R

As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −

R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. This tutorial will teach you R programming along with suitable examples in simple and easy steps.

fredag 17. juni 2016

How to Become a Data Scientist (Part 2/3)

Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) is evolving so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Conferences and networking with the data science community are effective ways of keeping on top of the latest developments. And regularly reading books and papers is very important (on this: if you do not have a research background, it is worth learning how to read academic papers properly).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the points I wanted to make, so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.

Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilize the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:

“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient evidence to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.

Other University Degrees

So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason – there is no need to panic – because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember this from Chapter Two).

Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience.

Online Courses

In a similar sense – even if you come from another quantitative field – a few online courses will not make you an expert, and remember: this is an expert field. But even if an online course was enough to master a chosen subject, you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this in Chapter Four.

All this being said, online courses are incredibly useful tools to help kick-start your journey, or begin learning a new area (like deep learning, for example). The most popular courses are found via Coursera, Udacity and edX, with Dylan Hogg describing Andrew Ng’s Machine Learning on Coursera as “an absolute pre-requisite for anyone who does not have a research background”.

The following is by no means a complete list, or mandatory for that matter, but these also stood out to Dylan and some of the other data scientists we have met so far:

Machine Learning: Intro to Machine Learning (Udacity)
Deep Learning: Deep Learning (Udacity), Neural Networks for Machine Learning (Coursera)
Spark: Big Data Analysis with Spark (edX), Distributed Machine Learning with Spark (edX)

During my interactions with the Experfy team, I've found out that they were also launching a training platform. You can see a preview here.

Books

Needless to say: good books are an invaluable resource and our favourite data scientists advocated the following:

Pattern Recognition and Machine Learning by Christopher Bishop
Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”: Elements of Statistical Learning
And for a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting

Presenting / Communicating

If you do not have a natural disposition for communicating – especially with non-technical people – this is something you will need to work on (see Chapter One for why). Gaining practice and obtaining feedback is the best way to improve your soft skills, although Yanir also recommended the classic book by Dale Carnegie: How to Win Friends and Influence People.

onsdag 15. juni 2016

The Professionalization of Data Science

There has been much discussion and debate about the definition of data science and the new rare breed of sexy bird called the data scientist. The Data Science Association defines "Data Science" as the scientific study of the creation, validation and transformation of data to create meaning; and the "Data Scientist" as a professional who uses scientific methods to liberate and create meaning from raw data.

While these definitions may appear overbroad, think about the definitions of a lawyer or physician. A lawyer is a legal professional who can help prevent or solve legal issues and a physician is a health professional who can help prevent or cure health issues. Like the professionalization of law and medicine in the past hundred years, data science is at the very beginning of becoming a profession - with competency standards and a Data Science Code of Professional Conduct.

This means that data science will evolve into a profession where data scientists specialize in different areas - like lawyers and physicians. When you need to hire a lawyer you usually consider the special area of law that a lawyer practices. If you have a tax problem you hire a tax lawyer, not a divorce lawyer. If you have a heart problem you do not hire a gynecologist.

The simple truth is that data science is a vast and complicated field and - like law and medicine - much too big and complex for a person to master in one lifetime. My colleague Gary Mazzaferro has been exploring the concepts and ideas surrounding data science and definitions as formalizations aligning with knowledge economies and the knowledge / science / technology maturity models. Gary has (to date) defined the following data science specializations and types of data scientists:

Data Science: A field of systematic interdisciplinary study to elucidate relationships across and within Formal, Social Natural and Special Sciences phenomenon through the application of scientific methods. Interdisciplinary areas include analytical processes, mathematics, probability and statistics, logic, modeling, machine learning, algorithms, communications, traditional sciences, business, public policy and philosophy.

Blue Sky Data Science: A purely curiosity driven exploratory branch of Data Science oriented towards the development and establish understanding about relationships across and within phenomenon with no focus on specific goals and immediate application.

Basic Data Science: A branch of Data Science research focused on clearly defined goals and oriented towards the development and establish understanding about relationships across and within phenomenon.

Applied Data Science: A branch of Data Science oriented toward the development of practical applications, technologies other interventions including engineering practices. Applied Data Science bridges the gap between Basic Data Science and the engineering domains to provide predicable, usable tools to industries including standard methods and practices.

Data Science Practice: The regular performance of Applied Data Science activities and methods for private and public organizations. May practice externally or internally. Practice may necessitate additional disciplines based on the needs of the organization including domain expertise and communications supporting presentation and reporting activities.

Data Scientist: A person that studies or has expert knowledge of the interdisciplinary field of Data Science.

Blue Sky Data Scientist: A person that studies or researches in the branch of Blue Sky Data Science.

Basic Data Scientist: A person that studies, researches or has expert knowledge in the branch of Basic Data Science.

Applied Data Scientist: A person that studies or researches in the branch of Applied Science.

Note that this is a preliminary list and is not complete. The profession of data science will evolve to create many specializations. After all, it took law and medicine over one hundred years to evolve as professions with different specialties.

tirsdag 14. juni 2016

Moneyball: Sports Analytics in Soccer to Predict Performance and Outcomes

There is no doubt that soccer is the most popular sport in the world, and its popularity is growing in the US. Over 25 million fans watched U.S. Women’s FIFA World Cup 2015, and earned Fox over $40 million in ad revenue. Similarly, the United States’ 2-2 draw with Portugal in the 2014 World Cup was seen by an average of 24.7 million viewers on Univision and ESPN.
What's Sports Analytics?

Sports analytics is the processes that identify and acquire the knowledge and insight about potential players’ performances based on the use of a variety of data sources such as game data and individual player performance data. These advanced and sophisticated type of analytics should be able to extract valuable actionable insights for the coaches and managers to utilize.

Sports analytics can be utilized in various domains including:
Predicting the outcome of a game
Predicting the performances of teams or individual players
Building new strategies for upcoming competitions
Deciding the price of a player if a club was to rent/sell/buy him or her
Connecting players to brands and sponsors

Of course, not all teams use analytical tools. In addition to the costs involved, there’s also the problem of explaining complex analytical methods to coaches in ways they can understand. Thus, soccer analytics is more widespread in big clubs where they have the necessary financial power to utilize these methods.

But that should not necessarily be the case. Since traditional sports analysts do not reveal the logic behind their methods, we’ve decided to give you a flavor of what can be done with soccer data. While this post will not answer all the questions you have about soccer analytics, it can help you understand how to get started.

In this project, we have used a very limited number of player’s attributes (See the section under Player’s Features/Attributes) that are easy and not expensive to gather, to both reverse engineer the most advanced Rating and Performance index (The results are presented in Figures 1-12) and then to propose a more robust and easy model for future player ratings and performance prediction (See the section under Machine-Learning and AI Models). The program runs on Spark and Cloud Environment, and can be used for Terra-Petta scale of data, from multiple years, with thousands of players, with 100s of attributes.
What's Soccer Analytics?

Soccer Analytics is the art of creating insights and actionable decisions using soccer related data. While predictive analytics uses big data to determine the probability or the likelihood of a certain outcome, intelligent descriptive analytics looks at big data and analyzes it using machine learning and artificial intelligence methods to come of with suggestions that will improve the likelihood of a desired outcome.

Some important concepts to know while conducting this analysis are:
Game Modeling: Modeling of the game before, during and after the game using scientific techniques to match or predict a set of outcomes.
Expert Player Rating: Players ratings given by an expert. These ratings take a black-box approach, and they vary according to the prior knowledge of the expert.
Soccer Performance Analytics: Is a tool to help players, coaches and managers to quantitatively assess the players and team performance and help to improve both players and team performance and design a set of wining strategies for upcoming game(s).

Here are some questions to ask when running analysis on soccer players:
How does expert rating differ from ratings generated by data-driven Machine Learning ratings?
What are the most important players’ attributes linked to their performance?
What criteria do experts use when they evaluate Players? Is there a way to reverse engineer their criteria?
Which attributes are important for each specific position?
Can we use ratings and players’ attributes to predict the outcome of a game?
Can we aggregate the players’ rating to come up with a team rating?
Is there a way to correlate the team rating to the outcome of the game?
Do the outcomes of games influence expert ratings more than individual performance indicators do?
Can we predict the outcome of a new game given the past performance of the players?

Using advance analytics and visualization tools such as Machine Learning and network analytics to predict the outcomes of soccer games is becoming more and more popular as these methods continue to move into the mainstream with the help of tools that make it easier to conduct these advanced analytical methods.
Some insights to our approach:

In this project, we used our selected set of clustering and classification techniques and the best model was selected and ranked based on Train-Validation-Test process.
Companies such as OPTA, Prozone, Amisco, and WhoScored are now collecting rich soccer data. These can be utilized to conduct accurate assessments.
For our project, a rich data set containing more than 210 attributes of players including 198 performance statistics were used. To calculate the overall performance and ratings of the players, some or all of the attributes were being used. Some of the very advanced Expert Ratings include: Caapello Index, Castrol Index, and WhoScored.com. These Ratings include each player’s cumulative ratings and game-based ratings.
For classification-regression and clustering, there are many Machine learning models that can be used. For classification-regression model, you can use Machine Learning models (SVMs, logistic regression, linear regression), naive Bayes, Regression by Discretization using J48, Additive Regression with Decision Stump, decision trees, ensembles of trees (Random Forests and Gradient-Boosted Trees), isotonic regression, Multilayer Perceptron, RBF Network. For Clustering, you can use k-means, clustering using affinity propagation, Agglomerative Clustering (Ward, Average, and Complete), Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation (LDA). Furthermore, you can use dimensionality reduction such as singular value decomposition (SVD) and principal component analysis (PCA) to reduce the feature space. In our case study, we tested all the models and the best results combined and are presented in Figures 1-12, without elaborating about specific model and how they can be aggregated.
Players' Features/Attributes:

In this project, we have used a subset of the Player’s Features/Attributes from the following list. Keep in mind that your model should be able to select and rank these attributes based on their importance. At this blog, we are not elaborating on what features have been selected by models, as these will depend on the specific approach you choose to take when building your model.
Nationality, Club, League, Age, Height, String Foot, Position (GK, CB, RB, LB, DM, CM, RM, LM, AM, RW, LW, SS, CF)
Attacking Prowess, Ball Control, Dribbling, Low Pass, Lofted Pass, Finishing
Place Kicking, Swerve, Header, Defensive Prowess, Ball Winning, Kicking Power, Speed, Explosive Power, Body Balance, Jump, Stamina, Goalkeeping, Saving, Form, Injury, Resistance, Weak Foot Use, Weak Foot Accuracy, Trickster, Mazing Run, Speeding Bullet,, Incisive Run, Long Ball Expert, Early Cross, Long Ranger
Scissors Feint, Flip Flap, Marseille Turn, Sombrero, Cut Behind & Turn, Scotch Move, Long Range Drive, Knuckle Shot, Acrobatic Finishing, First-time Shot, One-touch Pass, Weighted Pass, Pinpoint Crossing, Outside Curler, Low Punt Trajectory, Long Throw, GK Long Throw, Man Marking, Track Back, Captancy, Super-sub, Fighting Spirit

While, we do not have intention to disclose the final optimized features selected by our optimized model and strategy, we used the following ML-AL based aggregator operator model (Figure 13 shows the Structure of our Committee Machine).
Figures:

Finally, here's a list of the plots and figures created as a result of this analysis:

Figure 1. Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Forward Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Figure 2. Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Goalkeepers. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Figure 3. Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Defensive Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Figure 4. Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the MidField Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Figure 5. Clustering (similarities of players’ clusters) using advanced Visual-Analytics-Clustering for Forward Players. The similarities of the individual players are shown by the lines and their strength with the width of lines on clustering plot.

Figure 6. Clustering (similarities of players’ clusters) using advanced Visual-Analytics-Clustering for Forward Players. Similar clusters are closer to each other.

Figure 7. Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Train data for Forward Players.

Figure 8. Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Validation data for Forward Players.

Figure 9. Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Test data for Forward Players.

Figure 10. Typical performance of our Regression model’s Prediction for the rating of the players (Prediction Vs. Actual). Test data for Forward-MidField-Defensive Players.

Figure 11. Typical performance of our Regression model’s Prediction for the rating of the players (Prediction Vs. Actual). Validation data for Forward-MidField-Defensive Players.

Figure 12. Typical performance of our Regression model’s Prediction for the rating of the players (Prediction Vs. Actual). Test data for Forward-MidField-Defensive Players.

Figure 13. Committee Machine, and Intelligent Multi-Level-Aggregator Tree (MAT) platform.

mandag 13. juni 2016

Everything you ever wanted or needed to know about Big Data #BigData

Big Data is a phrase that gets bandied about quite a bit in the media, the board room – and everywhere in between. It’s been used, overused and used incorrectly so many times that it’s become difficult to know what it really means. Is it a tool? Is it a technology? Is it just a buzzword used by data scientists to scare us? Is it really going to change the world? Or ruin it?

This post is all about demystifying the mess that has become Big Data, and more importantly demonstrating how you can use it to improve your bottom line.

What Is Big Data?

First of all, what is Big Data? In it’s purest form, Big Data is used to describe the massive volume of both structured and unstructured data that is so large it is difficult to process using traditional techniques. So Big Data is just what it sounds like – a whole lot of data.

The concept of Big Data is a relatively new one and it represents both the increasing amount and the varied types of data that is now being collected. Proponents of Big Data often refer to this as the “datification” of the world. As more and more of the world’s information moves online and becomes digitized, it means that analysts can start to use it as data. Things like social media, online books, music, videos and the increased amount of sensors have all added to the astounding increase in the amount of data that has become available for analysis.

Everything you do online is now stored and tracked as data. Reading a book on your Kindle generates data about what you’re reading, when you read it, how fast you read it and so on. Similarly, listening to music generates data about what you’re listening to, when how often and in what order. Your smart phone is constantly uploading data about where you are, how fast you’re moving and what apps you’re using.

What’s also important to keep in mind is that Big Data isn’t just about the amount of data we’re generating, it’s also about all the different types of data (text, video, search logs, sensor logs, customer transactions, etc.). In fact, Big Data has four important characteristics that are known in the industry as the 4 V’s:

Volume – the increasing amount of data that is generated every second
Velocity – the speed at which data is being generated
Variety – the different types of data being generated
Veracity – the messiness of data, ie. it’s unstructured nature

Based on the incredible amount, speed, variety and unstructuredness of the data we are now generating and storing, it’s no surprise that it quickly became unmanageable using traditional storing and analysis methods. This is where the term Big Data becomes confusing, because it is often used to refer to the new technologies, tools and processes that have sprung up to accommodate this vast amount of data.

Glossary of Big Data Terms

Inevitably, much of the confusion around Big Data comes from the variety of new (for many) terms that have sprung up around it. Here is a quick run-down of the most popular ones:

Algorithm – mathematical formula run by software to analyze data
Amazon Web Services (AWS) – collection of cloud computing services that help businesses carry out large-scale computing operations without needing the storage or processing power in-house
Cloud (computing) – running software on remote servers rather than locally
Data Scientist – an expert in extracting insights and analysis from data
Hadoop – collection of programs that allow for the storage, retrieval and analysis of very large data sets
Internet of Things (IoT) – refers to objects (like sensors) that collect, analyze and transmit their own data (often without human input)
Predictive Analytics – using analytics to predict trends or future events
Structured v Unstructured data – structured data is anything that can be organized in a table so that it relates to to other data in the same table. Unstructured data is everything that can’t.
Web scraping – the process of automating the collection and structuring of data from web sites (usually through writing code)

Why Has It Become So Popular

Big Data’s recent popularity has been due in large part to new advances in technology and infrastructure that allow for the processing, storing and analysis of so much data. Computing power has increased considerably in the past five years while at the same time dropping in price – making it more accessible to small and midsize companies. In the same vein, the infrastructure and tools for large-scale data analysis has gotten more powerful, less expensive and easier to use. According to

As the technology has gotten more powerful and less expensive, numerous companies have emerged to take advantage of it by creating products and services that help businesses to take advantage of all Big Data has to offer. According to Inc, in 2012 the Big Data industry was worth $3.2 billion and growing quickly. They went on to say that “Total [Big Data] industry revenue is expected to reach nearly $17 billion by 2015, growing about seven times faster than the overall IT market”. For more on the size and projected growth of the Big Data industry, check out this Forbes article.

Businesses have also started taking notice of the Big Data trend. In a recent survey, “Eighty-seven percent of enterprises believe big data analytics will redefine the competitive landscape of their industries within the next three years.”

Why Should Businesses Care?

Data has always been used by businesses to gain insights through analysis. The emergence of Big Data means that they can now do this on an even greater scale, taking into account more and more factors. By analyzing greater volumes from a more varied set of data, businesses can derive new insights with a greater degree of accuracy. This directly contributes to improved performance and decision making within an organization.

Big Data is fast becoming a crucial way for companies to outperform their peers. Good data analysis can highlight new growth opportunities, identify and even predict market trends, be used for competitor analysis, generate new leads and much more. Learning to use this data effectively will give businesses greater transparency into their operations, better predictions, faster sales and bigger profits.

Best Big Data Tools

Taking advantage of all that Big Data has to offer can seem like a daunting task, but there are a number of tools (both free and paid) that can help businesses to collect, store, analyze and derive insight from Big Data. Here are just a few…

OpenRefine

OpenRefine is a data cleaning software that allows you to pre-process your data for analysis. This is especially useful if you are analyzing unstructured data or combining multiple data sets into one for analysis.

WolframAlpha

WorlframAlpha provides detailed responses to technical searches and does very complex calculations. For business users, it presents information charts and graphs, and is excellent for high level pricing history, commodity information, and topic overviews.

import.io

import.io is allows you to turn the unstructured data displayed on web pages into structured tables of data that can be accessed over an API.

Tableau

Tableau is a visualization tool that makes it easy to look at your data in new ways. In the analytics process, Tableau’s visuals allow you to quickly investigate a hypothesis, sanity check your instincts or build a compelling infographic to convince your audience with.

Google Fusion Tables

Google Fusion Tables is a versatile tool for data analysis, large data set visualization and mapping.

Best Additional Resources (blog posts, case studies, books, videos, etc)

If you’re interested in learning more about Big Data and how you can use it, here are a few of our favorite resources:

Blogs

No Free Hunch (kaggle) – Kaggle hosts a number of predictive modeling competitions. Their competition and data science blog, covers all things related to the sport of data science.
SmartData Collective – SmartData Collective is an online community moderated by Social Media Today that provides information on the latest trends in business intelligence and data management.
FlowingData – FlowingData explores the ways in which data scientists, designers, and statisticians use analysis, visualization, and exploration to understand data and ourselves.
KDnuggets – KDnuggets is a comprehensive resource for anyone with a vested interest in the data science community.
Data Elixir – Data Elixir is a great roundup of data news across the web, you can get a weekly digest sent straight to your inbox.

Online Courses/Learning Resources

DataCamp – DataCamp is a resource for learning data analysis and R interactively.
School of Data – School of Data offers a variety of courses designed for everyone, from the data science-newbie to the professional seeking inspiration.
Udemy – Udemy is the world’s largest destination for online courses with many in the data science field.
w3schools – W3schools is great online tutorials for learning basic coding and data analysis skills.

Videos

The Data Science Revolution – an expert panel that considerations of the future of data science and the ethics involved with data analytics and enhanced predictive powers.
Turning Big Data into Big Analytics – focuses on the opportunity businesses have when dealing correctly with their data and serves as a case study for data science professionals.

Books

Big Data: A Revolution That Will Transform How We Live, Work and Think – a fascinating survey of big data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think.
Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance – a clear understanding, blueprint, and step-by-step approach to building your own big data strategy.
Data Science for Business: What you need to know about data mining and data-analytic thinking – introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for extracting useful knowledge and business value from the data you collect.

Looking Ahead

What the future of Big Data really holds, no one can predict. The rapid development of new technologies, especially in the machine learning space, will undoubtedly usurp any predictions we try to make. What is certain, is that Big Data is here to stay. The amount of data we are producing is only going to increase and by analyzing it, we can learn and eventually be able to predict some pretty cool things. Very soon, Big Data will touch and transform every industry and every piece of your daily life.

Wrapping Up

Whether or not you believe the hype about whether Big Data will change the world, the fact remains that learning how to use the recent influx of data effectively can help you make better, more informed decisions. The thing to take away from Big Data isn’t it’s largeness, it’s the variety. You don’t necessarily need to analyze a lot of data to get accurate insights, you just need to make sure you are analyzing theright data. To really take advantage of this data revolution, you need to start thinking about new and varied data sources that can give you a more well rounded picture of your customers, market and competitors. With today’s Big Data technologies, everything can be used as data – giving you unparalleled access to market factors.

What’s your take on the future of Big Data? Leave a comment for us below!

Advertisement

torsdag 11. august 2016

søndag 7. august 2016

STEP 1: UNDERSTAND THE BUSINESS

STEP 2: GET YOUR DATA

STEP 3: EXPLORE AND CLEAN YOUR DATA

STEP 4: ENRICH YOUR DATASET

STEP 5: BUILD VISUALISATIONS

STEP 6: GET PREDICTIVE

STEP 7: ITERATE

fredag 8. juli 2016

Evolution of R

Features of R

fredag 17. juni 2016

onsdag 15. juni 2016

tirsdag 14. juni 2016

mandag 13. juni 2016

What Is Big Data?

Glossary of Big Data Terms

Why Has It Become So Popular

Why Should Businesses Care?

Best Big Data Tools

OpenRefine

WolframAlpha

import.io

Tableau

Google Fusion Tables

Best Additional Resources (blog posts, case studies, books, videos, etc)

Blogs

Online Courses/Learning Resources

Videos

Books

Looking Ahead

Wrapping Up

Advertise

Advertise