Data Science Jedi: algorithm

Viser innlegg med etiketten algorithm. Vis alle innlegg

søndag 7. august 2016

The 7 Steps of a Data Project

By Alivia Smith, User Marketing Manager - Dataiku

It’s hard to know where to start once you’ve decided that yes, you want to become more data-driven. Just looking at all the technologies you have to understand and all the languages you’re supposed to master is enough to make your dizzy.

Well, building your first data project is actually not that hard. And yes, Dataiku DSS helps, but what will really helps you is understanding the data science process. Becoming data driven is about this: knowing the basic steps and following them to go from raw data to building a machine learning model.

The steps to complete a data project have been conceptualized a while ago as the KDD process (forKnowledge Discovery in Databases), and made popular with lots of vintage looking graphs like this one.

This is our take on the steps of a data project in this awesome age of big data!

STEP 1: UNDERSTAND THE BUSINESS

Understanding the business is the key to assuring the success of your data project. To motivate the different actors necessary to getting your project from design to production, your project must be the answer to a clear business need. So before you even think about the data, go out and talk to the people who could need to make their processes or their business better with data. Then sit down and define a timeline and concrete indicators to measure. I know, processes and politics seem boring, but in the end, they turn out to be quite useful!

If you’re working on a personal project, playing around with a dataset or an API, this may seem irrelevant. It’s not. Just downloading a cool open data set is not enough. I can’t tell you how many cool datasets I downloaded and never did anything with… So settle on a question to answer, or a product to build!

STEP 2: GET YOUR DATA

Once you’ve gotten your goal figured out, it’s time to start looking for your data. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible.

Here are a few ways to get yourself some data:

Connect to a database: ask your data and IT teams for the data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting.
Use APIs: think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. If you’re not an expert coder, plugins in DSS give you lots of possibilities to bring in external data!
Look for open data: the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. A lot of countries have open data platforms (like data gov in the US). If you’re working on a fun project outside of work, these open data sets are also an incredible resource! Check out kaggle, or this github with lots of datasets for example
Use more APIs: another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends. It’s extremely easy to set up these connections with tools like ifttt. For example, I have a bunch of recipes that collect the music I listen to, the places I visit, my steps and the kilometers I run, the contacts I add, etc. And this can be useful for businesses as well! You can analyze very interesting trends on twitter, or even monitor the competition.

STEP 3: EXPLORE AND CLEAN YOUR DATA

(AKA the dreaded preprocessing step that typically takes up 80% of the time dedicated to a data project)

Once you’ve gotten your data, it’s time to get to work on it! Start digging to see what you’ve got and how you can link everything together to answer your original goal. Start taking notes on your first analyses, and ask questions to business people, or the IT guys, to understand what all your variables mean! Because not everyone will get that c06xx is a product category referring to something awesome.

Once you understand your data, it’s time to clean it! You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.

Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project. So it’s going to suck a little bit. Luckily, tools like Dataiku DSS can make this much faster!

STEP 4: ENRICH YOUR DATASET

Now that you’ve got clean data, it’s time to manipulate it to get the most value out of it. This is the time to join all your different sources, and group logs, to get your data down to the essential features.

You’ll then start manipulating the data to extract lots of valuable features. For example, getting a country and even a town out of a visitor’s IP address. Extracting time of day, or week of year from your dates to get something more meaningful.

The possibilities are pretty much endless, and you’ll get a pretty good idea by scrolling through Dataiku DSS’s processors in the Lab of the operations you can execute.

STEP 5: BUILD VISUALISATIONS

building insights and graphs in data project

You now have a nice dataset (or maybe several), so this is a good time to start exploring it by building graphs. When you’re dealing with large volumes of data, they’re the best way to explore and communicate your findings.

You’ll find lots of tools available that make this step fun to prepare and to receive. The tricky part is always to be able to dig into your graphs to answer any question somebody would have about an insight. That’s when the data preparation comes in handy: you’re the guy who did the dirty work so you know the data like the palm of your hand!

If this is the final step of your project, it’s important to use APIs and plugins so you can push those insights to where your end users want to have them. So get integrated with their tools!

Your graphs don’t have to be the end of your project though. They’re a way to uncover more trends that you want to explain. They’re also a way to develop more interesting features. For example, by putting your data points on a map you could perhaps notice that specific geographic zones are more telling than specific countries or cities.

STEP 6: GET PREDICTIVE

By working with clustering algorithms (aka unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results. Tools like Dataiku DSS help beginners run basic open source algorithms easily in clickable interfaces.

More advanced data scientists can then go even further and predict future trends with supervised algorithms. By analyzing past data, they find features that have impacted past trends, and use them to build predictions. More than just gaining knowledge, this final step can lead to building whole new products and processes. To get these in production though, you’ll need the intervention of data scientists and engineers, but it’s important to understand the process so all the parties involved (business users and analysts as well), will be able to understand what comes out in the end.

STEP 7: ITERATE

The main goal in any business project is to prove it’s effectiveness as fast as possible to justify, well, your job. Data projects are the same. By gaining time on data cleaning and enriching, you can go to the end of the project fast and get your first results. These first insights will be a great start to uncover more necessary cleaning, to develop more features in order to continuously improve results and model outputs.

Now that you’ve got the skills, get started right now by building projects in Dataiku DSS!

fredag 8. juli 2016

Evolution of R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R

R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.

Features of R

As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −

R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. This tutorial will teach you R programming along with suitable examples in simple and easy steps.

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary: What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist. That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.

It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates. It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?

Yes there is. Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.

If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees. We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience. It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.

The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist. The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.

Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles. Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.

So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists. Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior. Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?

There are two key points to understand here. The first is that the data science market has divided into two distinctly different segments, Production and Development.

Production: This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses. Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing. These companies are widely distributed geographically so you can work pretty much anywhere. The primary data science activities are predictive analytics and recommenders.

Development: This is the new and sexy world of data science that gets all the press coverage. In these enterprises the data science and the code are the product. Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics. This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.

The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin. This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent.

As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science

The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists. These four types are ultimately differentiated by what they spend their time doing.

The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.

There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.

In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

1. Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.

2. Data Developer. Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data. If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.

3. Data Creatives. Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations. This is a more senior role innovating new types of predictive analytic use cases, data products, and services. This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience. Data Creatives are heavily present in the Development world.

4. Data Researchers. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD. These are folks who are innovating data science at its most fundamental level.

According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems. Here’s their chart.

This is an important decision since you need to do activities within data science that you like. This may lead you toward an advanced degree or simply to develop you skills through experience. It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market

If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from. If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.

If you have a choice, pick a college that specifically offers a Data Science degree. If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.

Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master. The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors. It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.

Zynga’s in-house program is 12 to 18 months. To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar. At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists. This could easily be the course list for your undergraduate program. I have added some observations of my own.

Phase I: Foundational Statistical Theory

Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions. For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others. Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages is a requirement. Selecting which ones is a matter of discussion. My take is this:

SQL: Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases. Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).

Python: The big discussion over the last five or so years has been around R versus Python. Python is my pick as a production language with a very generous data science library. More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not. In the most recent surveys you’ll see Python pulling away from R.

SAS: Yes SAS. SAS was practically the original DS scripting language before R and Python. Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily. Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage. IBM SPSS is an option but SAS has a huge lead in adoption. You will rarely encounter SAS in the Development world.

Phase III: Machine Learning

Participants learn both supervised and unsupervised learning techniques. Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs. Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results. The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.

The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day. Perhaps you will be the one to suggest they utilize them. They can really speed up the modeling process. Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets. For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.

It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions. Until they’re operationalized, they’re of no value.

It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation. Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?

In the Development world you will increasingly only be selected if you have a specialty. In the Production world you are likely to have more opportunities if you don’t specialize. Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.

Supply Chain Forecasting: There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.

IoT for Manufacturing: This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.

If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.

For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years.

Remember also that our profession is changing fast. It is already well past the time that a single data scientist could master the entire field. Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses. Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.

What about the rumors of those outsized salaries even for beginners? Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions. You’re not going get a Silicon Valley salary if you’re working in Milwaukee.

The best salary and skills studies come from O’Reilly. Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings. This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors. Be sure to carefully evaluate who filled out the surveys and whether you think they are representative. There are no purely objective bias-free surveys in our profession.

As Your Career Progresses

Data science has been and continues to be a field in which knowledge of tools as well as business in paramount. We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.

It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising. In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.

In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others). As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered. The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.

onsdag 15. juni 2016

The Professionalization of Data Science

There has been much discussion and debate about the definition of data science and the new rare breed of sexy bird called the data scientist. The Data Science Association defines "Data Science" as the scientific study of the creation, validation and transformation of data to create meaning; and the "Data Scientist" as a professional who uses scientific methods to liberate and create meaning from raw data.

While these definitions may appear overbroad, think about the definitions of a lawyer or physician. A lawyer is a legal professional who can help prevent or solve legal issues and a physician is a health professional who can help prevent or cure health issues. Like the professionalization of law and medicine in the past hundred years, data science is at the very beginning of becoming a profession - with competency standards and a Data Science Code of Professional Conduct.

This means that data science will evolve into a profession where data scientists specialize in different areas - like lawyers and physicians. When you need to hire a lawyer you usually consider the special area of law that a lawyer practices. If you have a tax problem you hire a tax lawyer, not a divorce lawyer. If you have a heart problem you do not hire a gynecologist.

The simple truth is that data science is a vast and complicated field and - like law and medicine - much too big and complex for a person to master in one lifetime. My colleague Gary Mazzaferro has been exploring the concepts and ideas surrounding data science and definitions as formalizations aligning with knowledge economies and the knowledge / science / technology maturity models. Gary has (to date) defined the following data science specializations and types of data scientists:

Data Science: A field of systematic interdisciplinary study to elucidate relationships across and within Formal, Social Natural and Special Sciences phenomenon through the application of scientific methods. Interdisciplinary areas include analytical processes, mathematics, probability and statistics, logic, modeling, machine learning, algorithms, communications, traditional sciences, business, public policy and philosophy.

Blue Sky Data Science: A purely curiosity driven exploratory branch of Data Science oriented towards the development and establish understanding about relationships across and within phenomenon with no focus on specific goals and immediate application.

Basic Data Science: A branch of Data Science research focused on clearly defined goals and oriented towards the development and establish understanding about relationships across and within phenomenon.

Applied Data Science: A branch of Data Science oriented toward the development of practical applications, technologies other interventions including engineering practices. Applied Data Science bridges the gap between Basic Data Science and the engineering domains to provide predicable, usable tools to industries including standard methods and practices.

Data Science Practice: The regular performance of Applied Data Science activities and methods for private and public organizations. May practice externally or internally. Practice may necessitate additional disciplines based on the needs of the organization including domain expertise and communications supporting presentation and reporting activities.

Data Scientist: A person that studies or has expert knowledge of the interdisciplinary field of Data Science.

Blue Sky Data Scientist: A person that studies or researches in the branch of Blue Sky Data Science.

Basic Data Scientist: A person that studies, researches or has expert knowledge in the branch of Basic Data Science.

Applied Data Scientist: A person that studies or researches in the branch of Applied Science.

Note that this is a preliminary list and is not complete. The profession of data science will evolve to create many specializations. After all, it took law and medicine over one hundred years to evolve as professions with different specialties.

mandag 13. juni 2016

Everything you ever wanted or needed to know about Big Data #BigData

Big Data is a phrase that gets bandied about quite a bit in the media, the board room – and everywhere in between. It’s been used, overused and used incorrectly so many times that it’s become difficult to know what it really means. Is it a tool? Is it a technology? Is it just a buzzword used by data scientists to scare us? Is it really going to change the world? Or ruin it?

This post is all about demystifying the mess that has become Big Data, and more importantly demonstrating how you can use it to improve your bottom line.

What Is Big Data?

First of all, what is Big Data? In it’s purest form, Big Data is used to describe the massive volume of both structured and unstructured data that is so large it is difficult to process using traditional techniques. So Big Data is just what it sounds like – a whole lot of data.

The concept of Big Data is a relatively new one and it represents both the increasing amount and the varied types of data that is now being collected. Proponents of Big Data often refer to this as the “datification” of the world. As more and more of the world’s information moves online and becomes digitized, it means that analysts can start to use it as data. Things like social media, online books, music, videos and the increased amount of sensors have all added to the astounding increase in the amount of data that has become available for analysis.

Everything you do online is now stored and tracked as data. Reading a book on your Kindle generates data about what you’re reading, when you read it, how fast you read it and so on. Similarly, listening to music generates data about what you’re listening to, when how often and in what order. Your smart phone is constantly uploading data about where you are, how fast you’re moving and what apps you’re using.

What’s also important to keep in mind is that Big Data isn’t just about the amount of data we’re generating, it’s also about all the different types of data (text, video, search logs, sensor logs, customer transactions, etc.). In fact, Big Data has four important characteristics that are known in the industry as the 4 V’s:

Volume – the increasing amount of data that is generated every second
Velocity – the speed at which data is being generated
Variety – the different types of data being generated
Veracity – the messiness of data, ie. it’s unstructured nature

Based on the incredible amount, speed, variety and unstructuredness of the data we are now generating and storing, it’s no surprise that it quickly became unmanageable using traditional storing and analysis methods. This is where the term Big Data becomes confusing, because it is often used to refer to the new technologies, tools and processes that have sprung up to accommodate this vast amount of data.

Glossary of Big Data Terms

Inevitably, much of the confusion around Big Data comes from the variety of new (for many) terms that have sprung up around it. Here is a quick run-down of the most popular ones:

Algorithm – mathematical formula run by software to analyze data
Amazon Web Services (AWS) – collection of cloud computing services that help businesses carry out large-scale computing operations without needing the storage or processing power in-house
Cloud (computing) – running software on remote servers rather than locally
Data Scientist – an expert in extracting insights and analysis from data
Hadoop – collection of programs that allow for the storage, retrieval and analysis of very large data sets
Internet of Things (IoT) – refers to objects (like sensors) that collect, analyze and transmit their own data (often without human input)
Predictive Analytics – using analytics to predict trends or future events
Structured v Unstructured data – structured data is anything that can be organized in a table so that it relates to to other data in the same table. Unstructured data is everything that can’t.
Web scraping – the process of automating the collection and structuring of data from web sites (usually through writing code)

Why Has It Become So Popular

Big Data’s recent popularity has been due in large part to new advances in technology and infrastructure that allow for the processing, storing and analysis of so much data. Computing power has increased considerably in the past five years while at the same time dropping in price – making it more accessible to small and midsize companies. In the same vein, the infrastructure and tools for large-scale data analysis has gotten more powerful, less expensive and easier to use. According to

As the technology has gotten more powerful and less expensive, numerous companies have emerged to take advantage of it by creating products and services that help businesses to take advantage of all Big Data has to offer. According to Inc, in 2012 the Big Data industry was worth $3.2 billion and growing quickly. They went on to say that “Total [Big Data] industry revenue is expected to reach nearly $17 billion by 2015, growing about seven times faster than the overall IT market”. For more on the size and projected growth of the Big Data industry, check out this Forbes article.

Businesses have also started taking notice of the Big Data trend. In a recent survey, “Eighty-seven percent of enterprises believe big data analytics will redefine the competitive landscape of their industries within the next three years.”

Why Should Businesses Care?

Data has always been used by businesses to gain insights through analysis. The emergence of Big Data means that they can now do this on an even greater scale, taking into account more and more factors. By analyzing greater volumes from a more varied set of data, businesses can derive new insights with a greater degree of accuracy. This directly contributes to improved performance and decision making within an organization.

Big Data is fast becoming a crucial way for companies to outperform their peers. Good data analysis can highlight new growth opportunities, identify and even predict market trends, be used for competitor analysis, generate new leads and much more. Learning to use this data effectively will give businesses greater transparency into their operations, better predictions, faster sales and bigger profits.

Best Big Data Tools

Taking advantage of all that Big Data has to offer can seem like a daunting task, but there are a number of tools (both free and paid) that can help businesses to collect, store, analyze and derive insight from Big Data. Here are just a few…

OpenRefine

OpenRefine is a data cleaning software that allows you to pre-process your data for analysis. This is especially useful if you are analyzing unstructured data or combining multiple data sets into one for analysis.

WolframAlpha

WorlframAlpha provides detailed responses to technical searches and does very complex calculations. For business users, it presents information charts and graphs, and is excellent for high level pricing history, commodity information, and topic overviews.

import.io

import.io is allows you to turn the unstructured data displayed on web pages into structured tables of data that can be accessed over an API.

Tableau

Tableau is a visualization tool that makes it easy to look at your data in new ways. In the analytics process, Tableau’s visuals allow you to quickly investigate a hypothesis, sanity check your instincts or build a compelling infographic to convince your audience with.

Google Fusion Tables

Google Fusion Tables is a versatile tool for data analysis, large data set visualization and mapping.

Best Additional Resources (blog posts, case studies, books, videos, etc)

If you’re interested in learning more about Big Data and how you can use it, here are a few of our favorite resources:

Blogs

No Free Hunch (kaggle) – Kaggle hosts a number of predictive modeling competitions. Their competition and data science blog, covers all things related to the sport of data science.
SmartData Collective – SmartData Collective is an online community moderated by Social Media Today that provides information on the latest trends in business intelligence and data management.
FlowingData – FlowingData explores the ways in which data scientists, designers, and statisticians use analysis, visualization, and exploration to understand data and ourselves.
KDnuggets – KDnuggets is a comprehensive resource for anyone with a vested interest in the data science community.
Data Elixir – Data Elixir is a great roundup of data news across the web, you can get a weekly digest sent straight to your inbox.

Online Courses/Learning Resources

DataCamp – DataCamp is a resource for learning data analysis and R interactively.
School of Data – School of Data offers a variety of courses designed for everyone, from the data science-newbie to the professional seeking inspiration.
Udemy – Udemy is the world’s largest destination for online courses with many in the data science field.
w3schools – W3schools is great online tutorials for learning basic coding and data analysis skills.

Videos

The Data Science Revolution – an expert panel that considerations of the future of data science and the ethics involved with data analytics and enhanced predictive powers.
Turning Big Data into Big Analytics – focuses on the opportunity businesses have when dealing correctly with their data and serves as a case study for data science professionals.

Books

Big Data: A Revolution That Will Transform How We Live, Work and Think – a fascinating survey of big data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think.
Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance – a clear understanding, blueprint, and step-by-step approach to building your own big data strategy.
Data Science for Business: What you need to know about data mining and data-analytic thinking – introduces the fundamental principles of data science, and walks you through the “data-analytic thinking” necessary for extracting useful knowledge and business value from the data you collect.

Looking Ahead

What the future of Big Data really holds, no one can predict. The rapid development of new technologies, especially in the machine learning space, will undoubtedly usurp any predictions we try to make. What is certain, is that Big Data is here to stay. The amount of data we are producing is only going to increase and by analyzing it, we can learn and eventually be able to predict some pretty cool things. Very soon, Big Data will touch and transform every industry and every piece of your daily life.

Wrapping Up

Whether or not you believe the hype about whether Big Data will change the world, the fact remains that learning how to use the recent influx of data effectively can help you make better, more informed decisions. The thing to take away from Big Data isn’t it’s largeness, it’s the variety. You don’t necessarily need to analyze a lot of data to get accurate insights, you just need to make sure you are analyzing theright data. To really take advantage of this data revolution, you need to start thinking about new and varied data sources that can give you a more well rounded picture of your customers, market and competitors. With today’s Big Data technologies, everything can be used as data – giving you unparalleled access to market factors.

What’s your take on the future of Big Data? Leave a comment for us below!

Advertisement

søndag 7. august 2016

The 7 Steps of a Data Project

STEP 1: UNDERSTAND THE BUSINESS

STEP 2: GET YOUR DATA

STEP 3: EXPLORE AND CLEAN YOUR DATA

STEP 4: ENRICH YOUR DATASET

STEP 5: BUILD VISUALISATIONS

STEP 6: GET PREDICTIVE

STEP 7: ITERATE

fredag 8. juli 2016

Evolution of R

Evolution of R

Features of R

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

onsdag 15. juni 2016

The Professionalization of Data Science

mandag 13. juni 2016

Everything you ever wanted or needed to know about Big Data #BigData

What Is Big Data?

Glossary of Big Data Terms

Why Has It Become So Popular

Why Should Businesses Care?

Best Big Data Tools

OpenRefine

WolframAlpha

import.io

Tableau

Google Fusion Tables

Best Additional Resources (blog posts, case studies, books, videos, etc)

Blogs

Online Courses/Learning Resources

Videos

Books

Looking Ahead

Wrapping Up

Advertise

Advertise