Data Science Jedi: mathematics

Viser innlegg med etiketten mathematics. Vis alle innlegg

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

Summary: What do you need to do to get an entry level job in data science?

This article is written for anyone who is considering becoming a data scientist. That includes young people just starting their bachelor’s degrees and folks in the first two or three years of their careers who want to make the switch.

It’s not for folks who know they are going to pursue one of the new Master’s in Data Science or Ph.D. candidates. It’s for folks looking for entry level jobs that are specifically on the data science career ladder.

Is There a Data Science Career Progression That Doesn’t Require an Advanced Degree?

Yes there is. Like many high skill professions that’s not to say that an advanced degree won’t make it easier but there are definitely ways to enter this market with only a bachelor’s degree.

If you’ve been practicing data science for more than five or ten years you also know that the majority of us over 35 don’t have specific data science degrees. We came to data science via a variety of related disciplines and gained our cred largely based on performance and experience. It’s only the cohort under 35 working in data science that’s likely to have a DS-specific degree, advanced or bachelor’s.

The flack this article is likely to draw is not over the level of degree required or the types of experience but the just-below-boiling controversy about who gets to call themselves a data scientist. The problem in our profession, and I’m not going to solve it here, is there is not an accepted nomenclature that differentiates the various skill levels of data scientists or who gets to wear that title at all.

Employers aren’t helping since actual data science jobs may be called engineer, analyst, developer, team lead or many other less exciting sounding titles. Other employers are giving data science titles to folks who are not really doing data science, but more descriptive analytics and straight EDW work.

So for simplicity’s sake I’m going to call our target audience folks who are seeking positions as Junior or Associate Data Scientists. Specifically that means doing work that involves detecting signals in the data that can be used to make predictions about future behavior. Not simple descriptive historical analysis of what’s happened in the past.

For Beginners What Does the Market Look Like and What Type of Work Will You Do?

There are two key points to understand here. The first is that the data science market has divided into two distinctly different segments, Production and Development.

Production: This is by far the largest and most mature segment where predictive analytics has been used for longest and where it is best integrated to create truly data-driven businesses. Large B2C service businesses dominate this group, specifically insurance, financial services, cable and telecos, healthcare, plus retail, ecommerce, and some manufacturing. These companies are widely distributed geographically so you can work pretty much anywhere. The primary data science activities are predictive analytics and recommenders.

Development: This is the new and sexy world of data science that gets all the press coverage. In these enterprises the data science and the code are the product. Think Google, Facebook, eHarmony, Apple, and the thousands of start-ups that are either developing new analytic and big data platforms, or products with embedded analytics. This is also where you find the newest developments in data science including deep learning for image, text, and speech recognition, much of IoT (some crossover here to the production world), and all the flavors of AI.

The Development world is geographically concentrated in a few areas that we all know: the Bay area, Silicon Beach, New York, Boston, and maybe Austin. This is exciting and heady stuff where you will probably devote upwards of 60% to 70% of your substantial starting salary to rent.

As a new Associate Data Scientist you are much more likely to find your first career step in the Production world.

The Four Paths of Data Science

The second main point is that your career progression in DS will probably take you down one of four paths represented by different types of data scientists. These four types are ultimately differentiated by what they spend their time doing.

The best analysis that I’ve seen on this comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. You can find the original at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp and I strongly encourage you to read it.

There are 40 pages of good analysis here or for the Cliff Notes version see my previous article How to Become a Data Scientist.

In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

1. Data Businesspeople are those that are most focused on the organization and how data projects yield profit. At the entry level you’ll be performing the junior duties of blending and cleaning data and preparing basic predictive models.

2. Data Developer. Focused on the technical problem of managing data — how to get it, store it, and learn from it. At the entry level you’ll be working with Hadoop as well as structured data. If you are more interested in the data science infrastructure side this may be for you and is a particularly good path for a current analyst and IT staff to move up into the data science career path.

3. Data Creatives. Often tackle the entire soup-to-nuts analytics process on their own: from extracting and blending data, to performing advanced analyses and building models, to creating visualizations and interpretations. This is a more senior role innovating new types of predictive analytic use cases, data products, and services. This may also be you if you find yourself in a company with little or no experience with advanced analytics but you’re unlikely to get this job fresh out of college with no experience. Data Creatives are heavily present in the Development world.

4. Data Researchers. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD. These are folks who are innovating data science at its most fundamental level.

According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems. Here’s their chart.

This is an important decision since you need to do activities within data science that you like. This may lead you toward an advanced degree or simply to develop you skills through experience. It’s not something you have to decide from day one but one that you’ll want to consider early in your career.

The Skills You’ll Need to Enter the Data Science Market

If you were shopping for a two-year Master’s Degree in Data Science you’d have lots to pick from. If you search for Bachelor’s degrees in Data Science you’ll find a good selection but at many institutions the undergraduate degree is more likely to be titled ‘Computer Science’ leaving you to wonder if you’re actually getting the knowledge that you need.

If you have a choice, pick a college that specifically offers a Data Science degree. If you don’t have that choice you’ll have to analyze and select the blocks of learning that you’ll need.

Yes you need to be grounded in the broad aspects of computer science but in addition there are specific skills and knowledge you’ll need to master. The best description I’ve seen for this incremental learning is also an excellent guide for those of you who have recently finished your bachelors. It’s from an article by Amy Gershkoff, the Chief Data Officer at Zynga and describes their in-house program for growing their own data scientists.

Zynga’s in-house program is 12 to 18 months. To be considered there are a variety of performance requirements and academically the candidate needs a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar. At Zynga, some of this is in an on-line academic environment and some is mentored by their in-house data scientists. This could easily be the course list for your undergraduate program. I have added some observations of my own.

Phase I: Foundational Statistical Theory

Participants learn the basics of probability theory and statistical analysis including sampling theory, hypothesis testing, and statistical distributions. For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others. Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages is a requirement. Selecting which ones is a matter of discussion. My take is this:

SQL: Not really a hard data science language but reflects the fact that you’re likely to have to extract data yourself from relational databases. Also, SQL is now almost universally available as a query language on Hadoop (it’s really no longer accurate to call it NoSQL).

Python: The big discussion over the last five or so years has been around R versus Python. Python is my pick as a production language with a very generous data science library. More importantly, as SPARK has come on so quickly as the preferred tool on Hadoop, Python works easily here while R does not. In the most recent surveys you’ll see Python pulling away from R.

SAS: Yes SAS. SAS was practically the original DS scripting language before R and Python. Although it’s included here under programming skills you can learn to use the SAS packages via drag-and-drop UI just as easily. Depending on what survey you’re reading you may or may not see SAS on each list, but in the Production world SAS is extremely common and having this skill is a definite competitive advantage. IBM SPSS is an option but SAS has a huge lead in adoption. You will rarely encounter SAS in the Development world.

Phase III: Machine Learning

Participants learn both supervised and unsupervised learning techniques. Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVMs. Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

Only a matter of a year or two ago you could not be an effective data scientist without knowing the inner workings of these algorithms including how to manipulate their tuning parameters to optimize results. The late breaking news however is the new availability of completely automated predictive analytic platformswhere selection and operation of the ML algorithms is handled by AI.

The likelihood that your new employer will have any of these new platforms on hand is still fairly slim but growing by the day. Perhaps you will be the one to suggest they utilize them. They can really speed up the modeling process. Until then, you need to know what’s going on under the hood of all the major ML algorithms.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets. For this reason, basic knowledge of tools such as Hadoop, Spark, and an analytics platform for large data sets constitutes a dedicated module.

It’s here that you’ll learn how those models you built in the last section are put into operation to assist business decisions. Until they’re operationalized, they’re of no value.

It’s also here that you’ll learn the basics of streaming versus batch both in model development and implementation. Spark has come on very fast with extremely high adoption rates and is the basic tool now for both batch and streaming.

Should You Specialize Early?

In the Development world you will increasingly only be selected if you have a specialty. In the Production world you are likely to have more opportunities if you don’t specialize. Having said that there are two areas you may want to examine which can be picked up fairly rapidly and are considered specializations within the Production world.

Supply Chain Forecasting: There are some very specific techniques and packages associated with true demand driven supply chain forecasting that can provide an unique entre in the world of manufacturing or logistics.

IoT for Manufacturing: This is the use of predictive models on streaming data from SCADA systems and the like to predict the quality of output during a production run or the imminent failure of a piece of capital equipment.

If you wanted to make your living in an area dominated by manufacturing you would consider adding these to your portfolio early in your career.

For the most part however, if you’re in the Production world, predictive modeling and recommenders will be a complete toolset for several years.

Remember also that our profession is changing fast. It is already well past the time that a single data scientist could master the entire field. Employers may still be looking for unicorns but very rapidly there will be emerging specialty fields you may consider as your career progresses. Deep learning, natural language processing, image processing, and AI are all examples that will take either additional education or serious OJT.

What about the rumors of those outsized salaries even for beginners? Well they are at least partly true in that you will earn a well above average salary compared to other analyst or IT staff positions. You’re not going get a Silicon Valley salary if you’re working in Milwaukee.

The best salary and skills studies come from O’Reilly. Their most recent survey for example says that a Master’s degree will only add about $3,500 per year to your earnings. This is a well done survey that evaluates not only salary but time spent in different tasks, tools used, and other factors. Be sure to carefully evaluate who filled out the surveys and whether you think they are representative. There are no purely objective bias-free surveys in our profession.

As Your Career Progresses

Data science has been and continues to be a field in which knowledge of tools as well as business in paramount. We utilize a complex toolbox to extract, blend, clean, transform, engineer, model, and implement models that can create business value from data that only a few years ago was not considered valuable.

It should come as no surprise that innovation is simplifying and automating the toolbox of existing tools even as new tools are arising. In the past if we were expert carpenters with great skill with our tools, in the future we will be more like architects bringing a broad range of tools and design skills to bear to build value.

In management consulting where I spent many years we used to say that a consultant needs three legs to stand on, domain knowledge (knowledge of a particular industry), process knowledge (deep understanding a particular process such as planning, manufacturing, or accounting), and methodology (in management consulting this means process improvement, reengineering, strategy development, or package implementation among others). As your career progresses you should build your own foundation on these three principles where methodology becomes the skills of data science that you’ve mastered. The other two legs, deep knowledge of one or more industries and one or more business processes will be why future employers seek you out.

onsdag 1. juni 2016

Industry Speaks: Top 33 Big Data Predictions for 2016

By Alex Woodie

What will happen in big data in 2016? You’d think that would be a cinch to answer, what with all the deep neural net and prescriptive analytic progress being made these days. But in fact the big data predictions from the industry are all over the map.Datanami received dozens of predictions from prominent players in the industry. Here is a culled collection of the most interesting ones.

Oracle sees the rise of a new type of user: the Data Civilian. “While complex statistics may still be limited to data scientists, data-driven decision-making shouldn’t be,” Big Red says. “In the coming year, simpler big data discovery tools will let business analysts shop for datasets in enterprise Hadoop clusters, reshape them into new mashup combinations, and even analyze them with exploratory machine learning techniques.”

Nucleus Research is going out on a limb and predicting the death of big data as we know it. “In the past two years everyone and their dog seems to have launched a big data solution of some kind. It’s time for the shiny object syndrome to stop,” it says. “Instead of attacking the monolithic and daunting task of big data analysis, users will approach and access it like any data.”

Since even canines are pulling down big VC money for data startups, it may be time to start asking tough questions, according to Keri Smith, senior vice president at Opera Solutions. “What is the real ROI of a big data solution?” Smith asks. “How can companies get beyond departmental deployments to maximize the value of big data across the enterprise? And what are the meaningful use cases across a variety of verticals? If your company isn’t asking these questions and actively seeking answers, it should soon.”

We’ll see the rise of Data Jedis in 2016, says Matt Bencke, CEO of Spare5. “More jobs will be changed by AI than ever before and the ‘Data Jedis’ will become the most sought after employees,” he writes. “Machine learning+human insights will infiltrate new industries including healthcare and security and employees will need to adapt to providing a different service or get left behind in 2015.”

Data science will be big in banking, predicts Mike Weston, CEO of data science consultancy Profusion. “The financial industry is one of the pioneers of data science techniques,” he writes. “Nevertheless, the adoption of data science has been far from uniform across all banking services. In 2016 I expect this picture to change. Better use of data and personalisation of services will move from the financial markets to retail banking. It will have a profound impact on marketing, customer service and product development.”

The prospect of advanced AI giving rise to robot overlords scares Elon Musk. But according to Jans Aasman, a cognitive scientist and CEO ofFranz, AI should be placed the “friendlies” column. “Artificial intelligence and cognitive computing will make personalized medicine a reality, help save the lives of people with rare diseases and improve the overall state of healthcare in 2016 and beyond,” he says.

Chief Data Officers (CDOs) will become the “new it girl” of information tech, complicating office politics forever, argues Michael Ludwig, head of Blazent’s Office of the CTO. “Driven by the complexity of big data and the need for complete and accurate data, the CDO will become increasingly important,” he writes. “As a result, the CTO and CIO will need to make room for the CDO, and tension will emerge within the C-suite until clearly defined roles and associated teams are established.”

Not everybody sees it that way, including Craig Zawada, Chief Visionary Officer at PROS. “In 2016, we’ll begin to see erosion in the appointment of Chief Data Officers, a role of the past. Instead, Chief Insight Officers will emerge in 2016 as crucial leaders in the big data compilation process.”

CIOs, yeah baby!

But can the mighty CIO get his mojo back? Cazena founder and CEO Prat Moghe’s looks into his crystal ball, and says it’s so. “In 2016, CIOs will take advantage of enterprise-ready cloud services to become brokers of cloud services that meet IT mandates for governance, compliance and security as well as business needs for agility and responsiveness,” he writes.

Streaming analytics will start to mature and prove its worth in the big data lineup, predicts Phu Hoang, the CEO and co-founder ofDataTorrent. “While lots of companies have already accepted that real-time streaming is valuable, we’ll see users looking to take it one step further to quantify their streaming use cases. In the next year, customers using streaming tools will reach new levels of sophistication and demand a quantified ROI for streaming analytics,” he says.

Real-time analytics will be hot next year. We get it. But one technology—Apache Kafka–stands taller than the rest, according to MongoDB‘s VP of strategy Kelly Stirman. “Kafka will become an essential integration point in enterprise data infrastructure, facilitating the creation of intelligent, distributed systems,” Stirman writes. “Kafka and other streaming systems like Spark and Storm will complement databases as critical pieces of the enterprise stack for managing data across applications and data centers.”

Like drums? Then you’re going to love 2016, says Badri Raghavan, the chief data scientist at FirstFuel Software. “In the months ahead, we will see organizations and individuals tap data and analytics to deliver personalized and engaging experiences across industries including energy, sports, social good and music. For instance, people will be able to use data to change a song based on their personal preferences (e.g., lots of drum).”

How will the IoT impact the semiconductor business? IT legend Ray Zinn has a few thoughts on that. “You will see greater divisions between design and fabrication,” he writes. “Fabs will have the mission of scale to serve a few billions consumers and the nascent Internet of Things (IoT) markets. Design will become uniquely divorced from fabrication, splitting the market risk. Design firms will survive best by innovation, and fabs through ruthless efficiency. The question is what comes next? There will inevitably be new markets and devices that will drive a new growth spurt. The IoT is the sleeping giant, but I doubt the only one snoozing.”

Machine learning, big data automation, and artificial intelligence were big in 2015, and will get bigger next year, says Abdul Razack, SVP & head of platforms, big data and analytics at Infosys. “In 2016, the pace at which enterprises more widely adopt artificial intelligence to replace manual, repetitive tasks will rapidly increase,” Razack says, citing the $1 billion AI investmentmade recently by Toyota. Big data automation is already growing, but next year “it will be more widely used to accentuate the unique human ability to take complex problems and deliver creative solutions to them.” The self-driving cars from Tesla have built-in machine learning, but next year, “machine learning will quietly find its way into the household, making the objects around us not just connected.”

Lots of people see exciting things happening in the big data space in 2016. Not Charles Caldwell, the vice president of solutions engineering and services at Logi Analytics. “When I look ahead to 2016, I don’t see a lot of exciting things happening. Other vendors have come out with their predictions around cloud, visual analytics and mobile, but most of those things are old trends that are settling down. In my opinion, 2016 will be a year of consolidation and ground building for the next big thing.”

The “Not In Your Wildest Dreams” award goes to Peter Eicher, senior manager for product marketing at Catalogic Software. We’re not talking about his prediction that copy data management (CDM) “is a technology whose time has come as evidenced not only by the new vendors in the space but by old school players chiming in with ‘me too’ arguments.” That makes total sense. No, we’re calling Peter out for his crazy prediction that the New York Knicks win the NBA Championship. “Yeah, not happening,” he admits. “I can’t be right all the time. On the other hand, that prediction has been wrong for 42 years running. One of these days….”

The “Debbie Downer” award for big data goes to BlueTalonCEO Eric Tilenius for his prediction that the pace of big data breaches at major enterprises may rise. “In 2016, the lack of unified data governance could lead to the biggest security disruption that enterprises have ever faced—comparable to the disruption caused to the traditional enterprise perimeter by the entry of mobile,” he writes. “Relying on a fragmented approach to control data access, where inconsistent policies are applied across an ever-changing data landscape, will leave gaping holes in the protection of enterprise data.”

Are you into microservices? If not, you will be soon, according to SaaS heavy Workday. “It’s clear that the on-premise versus cloud battle is over. Cloud has won,” the company says. “Yet, not all cloud architectures will be created equal. Microservices architectures will go beyond the realm of consumer Internet designs like Netflix and become the most important architecture advancement in enterprise applications since the shift to the cloud.”

Big data is hard, and companies will struggle with it next year, says Ulrik Pederson, CTO of TARGIT. “2016 will see an expansion of big data analytics with tools that make it possible for business users to perform comprehensive self-service exploration with big data when they need it, without major hand holding from IT,” he writes. “Corresponding with my first predication, I anticipate a huge increase in advanced analytics projects across industries. However, that doesn’t mean they’ll be successful…. I wouldn’t be surprised to hear of many vendors and customers struggling to implement successfully.”

The International Institute of Analytics sees the rise of analytics microservices to facilitate embedded analytics. The IIA also sees progress being made in areas of cognitive technology, data science, and data curation. Oh, and the analytics talent crutch will ease as many new university program come online, the group says.

Elnur/Shutterstock.com

People who aren’t data geeks will get into the big data swing of things, says Bruno Aziza, Chief Marketing Officer of OLAP-on-Hadoop provider AtScale. “As Hadoop becomes more accessible to non-data geeks, marketers will begin to access more data for better decision making,” he writes. “Hadoop’s deeper and wider view of data will enable marketers to capture behaviors leading to decisions and understand the processes underlying customer journeys.”

We’ll see more HPC tech making its way into the mainstream, particularly as it pertains to storage, predicts storage giant DDN. “Storage, data management and application acceleration technologies from the HPC industry will continue being tapped at even a higher rate in 2016 to meet the evolving requirements of performance and scale and will replace traditional IT infrastructures at even a higher rate,” the company says.

Impressed with open source big data tech? You haven’t seen anything yet, says Pentaho CEO Quentin Gallivan. “The explosion of cool new tools like Spark, Docker, Kafka, Solr–emerging open source tools designed to enable large-scale, high-volume analytics on petabytes of data are moving from the ‘awkward teenager’ phase to the ‘bearded hipster’ phase,” Gallivan writes.

Spark will kill MapReduce, but save Hadoop, says Monte Zweben, co-founder and CEO of RDBMS-on-Hadoop vendor Splice Machine. “MapReduce is quite esoteric. Its slow, batch nature and high level of complexity can make it unattractive for many enterprises,” he writes. “Spark, because of its speed, is much more natural, mathematical, and convenient for programmers. Spark will reinvigorate Hadoop, and in 2016, nine out of every 10 projects on Hadoop will be Spark-related projects.”

But that doesn’t mean every Spark project will involve Hadoop, says Bob Muglia, the CEO ofSnowflake Computing. “Today, Spark is part of Hadoop distributions and is widely associated with Hadoop. Expect to see that change in 2016 as Spark goes its own way, establishing a separate, vibrant ecosystem. In fact, you can expect to see the major cloud vendors release their own Spark PaaS offerings. Will we see an Elastic Spark? Good chance.”

Organizations will reset on Apache Hadoop, says Dan Graham, general manager of enterprise systems atTeradata. “As Hadoop and related open source technologies move beyond knowledge gathering and the hype abates, enterprises will hit the reset button on (not abandon) their Hadoop deployments to address lessons learned – particularly around governance, data integration, security, and reliability.

The junk drawer problem is one of the Hadoop community’s biggest challenges. But never fear–Master Data Man(agement) is here! “MDM will become ubiquitous,” writes Manish Sood, CEO and founder of Reltio. “MDM as a discipline has long only been affordable by large companies with big IT teams and budget for hardware, software and multi-year implementation projects…A new breed of data-driven applications will come built-in with MDM as table stakes. As a consequence of delivering both operational and analytical functionality, the reliable data foundation of each application is powered by an MDM engine.”

Hadoop will be at a crossroads in 2016, but which fork will it take? Mike Maciag, COO ofAltiscale, give us his prediction. “In 2016, we will see industry standards for Hadoop solidify. In the beginning of 2015, we saw the launch of the Open Data Platform Initiative (ODPi), which established standards for how key projects in the Big Data ecosystem can work together. ODPi doubled in membership during the course of the year as the benefits to standardization for customers became even more clear. We expect to see more growth and recognition in 2016, allowing new technologies and applications to meet the Hadoop ecosystem standards being established by the ODPi.”

We’ll see the emergence of IoT 2.0 predicts Zebra Technologies. “The IoT market will transition to more mature, industry and adaptable solutions from what used to be closed, proprietary first-generation offerings. With an open-source approach, organizations will be able to choose from a larger pool of service providers and their respective APIs.”

The IoT may hearken the rise of a post-scarcity economy, predicts OpenText CEO Mark Barrenechea. “Imagine algorithms as apps for applying big data analysis over the connected masses of information generated by the IoT and its billions upon billions of connected devices in every aspect of our lives,” he writes. “Owning the data, analyzing the data, and improving and innovating become the keys to corporate success—all empowered by a connected digital society.”

The rise of converged platforms that can handle both analytic and transactional workloads will take a leap forward, foresees John Schroeder, CEO of MapR Technologies. “In 2016, we will see converged approaches become mainstream as leading companies reap the benefits of combining production workloads with analytics to adjust quickly to changing customer preferences, competitive pressures, and business conditions. This convergence speeds the ‘data to action’ cycle for organizations and removes the time lag between analytics and business impact.”

Another proponent of a single stack emerging in 2016 is Stefan Groschupf, the CEO ofDatameer. “When a technology category is new, various companies emerge with individual products that aim to provide a solution for a portion of the space,” he writes. “This leaves customers buying a number of tools and trying to learn how to use them together. Eventually, that just won’t do, and customers tend towards an integrated stack of products – or a widely-scoped product – from a single vendor. 2016 will mark the beginning of that transition for big data products.”

Outsourcing will be big in 2016, predicts Anil Kaul, CEO of big data service provider Absolutdata. “A gigantic amount of valuable information can be generated from big data, but accessing this could be challenging and it typically lies beyond the scope of routine business intelligence,” he writes. “Many companies today are partnering with third parties to create and execute big data analytics strategies. Integrating external experts into the big data team may be the best way for companies to stay ahead in this quickly evolving space.”

tirsdag 26. april 2016

Top 10 Big Data Technologies Of Present Times

Over last few years Big Data technologies are getting due attention and there are several trends and innovations in this space in recent times.

Wednesday, October 22, 2014: Big Data is a concept which is quite broad and comprises several trends and technology developments. Over last few years Big Data technologies are getting due attention and there are several trends and innovations in this space in recent times. Here we'll discuss top ten emerging Big Data technologies.

1. Column-oriented databases:

Traditional databases are excellent in online transaction processing but when it comes to query performance while data volumes grow, these databases fall short on performance. The new column-oriented databases store data and focuses on columns and not rows. It allows huge data compression and faster query times.

2. Streaming Big Data analytics:

There are several projects in this section including Storm, Spark, Data Torrent, Spring XD and SQL Stream. Apache Storm is an open source distributed real-time computation system which simplifies streams of data and real-time processing. Spark is a data processing platform which is compatible with Hadoop. DataTorrent is a real-time streaming platform which enables businesses to perform data processing. Spring XD supports streams for event driven data while SQLStream provides a distributed stream processing platform for streaming analytics, visualization and continuous integration of machine data.

3. Schema-less databases, or NoSQL databases:

This database category includes key-value stores and document stores. This database focuses on storage and retrieval of large volumes of unstructured, semi-structured or even structured data.

4. SQL-in-Hadoop:

This technology includes Apache Hive, Shark, Apache Drill, Presto and Phoenix among many others. It helps in making queries and it also manages large datasets in distributed storage. Shark is a data warehouse system which supports Hive's query language. Apache Drill is an Apache incubation project and it's designed for scalability. It's backed by MapR. Presto is an open source distributed SQL query engine and Phoenix is an open source SQL query engine for Apache Hbase.

5. MapReduce:

It's a programming paradigm which allows massive job execution scalability against thousands of servers or clusters of servers. Its two tasks are Map task and Reduce task. It converts any input dataset into different set of value pairs while reducing set of tuples.

6. Hadoop:

Hadoop is an open source platform for handling Big Data which can work with multiple data sources. It has other applications too and it's largely used for changing data like location-based data from weather or traffic sensors, web-based or social media data or machine-to-machine transactional data.

7. PIG:

PIG brings the Hadoop project close to developers and business users and it's used by Perl like language allowing query execution over data stored on a Hadoop cluster. PIG was a project by Yahoo! But now it's completely open source.

8. Big Data Lambda Architecture:

Lambda Architecture is a hybrid platform which combines real-time data and pre-computed data to provide a near-real time view of the data at all times. Its frameworks include Summingbird by Twitter and Lambdoop.

9. PLATFORA:

It almost copies Hadoop and it requires developer knowledge to operate. It's a platform which turns queries into Hadoop jobs with immediate effect and creates an abstraction layer to simplify the datasets in Hadoop.

10. SkyTree:

It's a high performance machine learning data analytics platform which handles Big Data. It's an essential part of Big Data.

Courtesy: TechRepublic and InfoQ

Sanchari Banerjee, EFYTIMES News Network

Advertisement

torsdag 23. juni 2016

The New Rules for Becoming a Data Scientist

onsdag 1. juni 2016

Industry Speaks: Top 33 Big Data Predictions for 2016

tirsdag 26. april 2016

Top 10 Big Data Technologies Of Present Times

GoorooThink by Gooroo.io

Bloggarkiv

Advertise Sidebar

Advert

Advertise

Advertise