Advertisement

mandag 6. juni 2016

Life Cycle of a Data Science Project

When working with big data, it is always advantageous for data scientists to follow a well-defined data science workflow. Regardless of whether a data scientist wants to perform analysis with the motive of conveying a story through data visualization or wants to build a data model- the data science workflow process matters. Having a standard workflow for data science projects ensures that the various teams within an organization are in sync, so that any further delays can be avoided.


The end goal of any data science project is to produce an effective data product. The usable results produced at the end of a data science project is referred to as a data product. A data product can be anything -a dashboard, a recommendation engine or anything that facilitates business decision-making) to solve a business problem. However, to reach the end goal of producing data products,data scientists have to follow a formalized step by step workflow process. A data product should help answer a business question. The lifecycle of data science projects should not merely focus on the process but should lay more emphasis on data products. This post outlines the standard workflow process of data science projects followed by data scientists.







Are you interested in learning how to implement the practical aspects of a data science project?


Write to: kontakt@beyondit.no, mob: 004794875183


Data science projects do not have a nice clean lifecycle with well-defined steps like software development lifecycle(SDLC). Usually, data science projects tramp into delivery delays with repeated hold-ups, as some of the steps in the lifecycle of a data science project are non-linear, highly iterative and cyclical between the data science team and various others teams in an organization. It is very difficult for the data scientists to determine in the beginning which is the best way to proceed further. Although the data science workflow process might not be clean, data scientists ought to follow a certain standard workflow to achieve the output.





If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.


People often confuse the lifecycle of a data science project with that of a software engineering project. That should not be the case, as data science is more of science and less of engineering. There is no one-size-fits-all workflow process for all data science projects and data scientists have to determine which workflow best fits the business requirements. However, there is a standard workflow of a data science project which is based on one of the oldest and most popular-CRISP DM. It was developed for data mining projects but now is also adopted by most of the data scientists with modifications as per the requirements of the data science project.


According to a recent KDnuggets poll on – “What main methodology are you using for your analytics, data mining, or data science projects?” CRISP-DM remained the top methodology/workflow for data mining and data science projects with 43% of the projects using it.


Every step in the lifecycle of a data science project depends on various data scientist skills and data science tools. The typical lifecycle of a data science project involves jumping back and forth among various interdependent data science tasks using variety of data science programming tools. Data science process begins with asking an interesting business question that guides the overall workflow of the data science project.


CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!
Standard Lifecycle of Data Science Projects


Data science project lifecycle is similar to the CRISP-DM lifecycle that defines the following standard 6 steps for data mining projects-
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment


Lifecycle of data science projects is just an enhancement to the CRISP-DM workflow process with some alterations-
Data Acquisition
Data Preparation
Hypothesis and Modelling
Evaluation and Interpretation
Deployment
Operations
Optimization


1) Data Acquisition

For doing Data Science, you need data. The primary step in the lifecycle of data science projects is to first identify the person who knows what data to acquire and when to acquire based on the question to be answered. The person need not necessarily be a data scientist but anyone who knows the real difference between the various available data sets and making hard-hitting decisions about the data investment strategy of an organization – will be the right person for the job.
Data science project begins with identifying various data sources which could be –logs from webservers, social media data, data from online repositories like US Census datasets, data streamed from online sources via APIs, web scraping or data could be present in an excel or can come from any other source. Data acquisition involves acquiring data from all the identified internal and external sources that can help answer the business question.
A major challenge that data professionals often encounter in data acquisition step is tracking where each data slice comes from and whether the data slice acquired is up-to-date or not. It is important to track this information during the entire lifecycle of a data science project as data might have to be re-acquired to test other hypothesis or run any other updated experiments.


2) Data Preparation

Often referred as data cleaning or data wrangling phase. Data scientists often complain that this is the most boring and time consuming task involving identification of various data quality issues. Data acquired in the first step of a data science project is usually not in a usable format to run the required analysis and might contain missing entries, inconsistencies and semantic errors.
Having acquired the data, data scientists have to clean and reformat the data by manually editing it in the spreadsheet or by writing code. This step of the data science project lifecycle does not produce any meaningful insights. However, through regular data cleaning, data scientists can easily identify what foibles exists in the data acquisition process, what assumptions they should make and what models they can apply to produce analysis results. Data after reformatting can be converted to JSON, CSV or any other format that makes it easy to load into one of the data science tools.
Exploratory data analysis forms an integral part at this stage as summarization of the clean data can help identify outliers, anomalies and patterns that can be usable in the subsequent steps. This is the step that helps data scientists answer the question on as to what do they actually want to do with this data.
“Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. — said John Tukey, an American Mathematician


3) Hypothesis and Modelling

This is the core activity of a data science project that requires writing, running and refining the programs to analyse and derive meaningful business insights from data. Often these programs are written in languages like Python, R, MATLAB or Perl. Diverse machine learning techniques are applied to the data to identify the machine learning model that best fits the business needs. All the contending machine learning models are trained with the training data sets.


4) Evaluation and Interpretation

There are different evaluation metrics for different performance metrics. For instance, if the machine learning model aims to predict the daily stock then the RMSE (root mean squared error) will have to be considered for evaluation. If the model aims to classify spam emails then performance metrics like average accuracy, AUC and log loss have to be considered. A common question that professionals often have when evaluating the performance of a machine learning model is that which dataset they should use to measure the performance of the machine learning model. Looking at the performance metrics on the trained dataset is helpful but is not always right because the numbers obtained might be overly optimistic as the model is already adapted to the training dataset. Machine learning model performances should be measured and compared using validation and test sets to identify the best model based on model accuracy and over-fitting.
All the above steps from 1 to 4 are iterated as data is acquired continuously and business understanding become much clearer.


5) Deployment

Machine learning models might have to be recoded before deployment because data scientists might favour Python programming language but the production environment supports Java. After this, the machine learning models are first deployed in a pre-production or test environment before actually deploying them into production.


6) Operations/Maintenance

This step involves developing a plan for monitoring and maintaining the data science project in the long run. The model performance is monitored and performance downgrade is clearly monitored in this phase. Data scientists can archive their learnings from a specific data science projects for shared learning and to speed up similar data science projects in near future.


7) Optimization

This is the final phase of any data science project that involves retraining the machine learning model in production whenever there are new data sources coming in or taking necessary steps to keep up with the performance of the machine learning model.
Having a well-defined workflow for any data science project is less frustrating for any data professional to work on. The lifecycle of a data science project mentioned above is not definitive and can be altered accordingly to improve the efficiency of a specific data science project as per the business requirements.

DeZyre’s Data Science training in Python and R programming course, helps you learn about the entire lifecycle of data science projects right from data acquisition to model evaluation.















34 kommentarer:

  1. Denne kommentaren har blitt fjernet av forfatteren.

    SvarSlett
  2. Thank you for explaining the life cycle in a clear manner.its very help so keep posting such a nice articles.
    Click here:
    Data Science Online Training

    SvarSlett
  3. Thank you for sharing your article. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
    Data Science Training in chennai at Credo Systemz | data science course fees in chennai | data science course in chennai velachery | data science course in chennai omr

    SvarSlett
  4. After reading this blog i very strong in this topics and this blog really helpful to all.
    Big Data Hadoop Online Course

    SvarSlett
  5. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Data Science training in Chennai
    Data science training in Bangalore
    Data science online training

    SvarSlett
  6. Pleasant Tips..Thanks for Sharing….We keep up hands on approach at work and in the workplace, keeping our business pragmatic, which recommends we can help you with your tree clearing and pruning in an invaluable and fit way.
    Data science training in pune
    Data Science Interview questions and answers
    Data science training in bangalore

    SvarSlett
  7. Thanks for sharing the information.it is very useful and interesting.
    <a href="https://bigclasses.com/aws-online-training.html>aws online training</a>

    SvarSlett
  8. Nice one! This post is amazing and very important. Thanks for sharing.

    PPC Management Pricing
    Social Media Packages

    SvarSlett
  9. pmp certification by 360DigiTMG is the best one in Hyderabad and is a Registered Education Provider (R.E.P.) by PMI to conduct training for this globally recognized certification.
    pmp certification
    pmi acp certification

    SvarSlett
  10. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    machine learning course

    artificial intelligence course in mumbai

    SvarSlett
  11. Thanks for Educating people Through this Articles and Blogs...The Blogs about data analytics course is Good...Keep doing this Good work...
    Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

    SvarSlett
  12. I want you to thank for your time of this wonderful read!!! I definately enjoy every little bit of it and I have you bookmarked to check out new stuff of your blog a must read blog!data science bootcamp malaysia

    SvarSlett
  13. Very nice blog and articles. I am realy very happy to visit your blog. Now I am found which I actually want. I check your blog everyday and try to learn something from your blog. Thank you and waiting for your new post.data science course in delhi

    SvarSlett
  14. Thanks for sharing this information. I really like your blog post very much. You have really shared a informative and interesting blog post .data science certification

    SvarSlett
  15. I have checked this link this is really important for the people to get benefit from.data science course in malaysia

    SvarSlett
  16. Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
    data science course delhi

    SvarSlett
  17. I looked at some very important and to maintain the length of the strength you are looking for on your website
    data science course

    SvarSlett
  18. Nice post it is really an interesting article we are also providing the web design services in bangalore. We are the leading
    Web Design Company in Bangalore
    Website Developers in Bangalore

    SvarSlett
  19. Really good information to show through this blog. I really appreciate you for all the valuable information that you are providing us through your blog.
    visit : Digital Marketing Training in Chennai || Digital Marketing Course in Chennai

    SvarSlett
  20. First You got a great blog .I will be interested in more similar topics. i see you got really very useful topics, i will be always checking your blog thanks.
    data scientist training and placement in hyderabad

    SvarSlett
  21. Very clear explanation about life cycle of Data Science project. Well-written and informative blog.
    Data Science Course in Hyderabad

    SvarSlett
  22. I wish more writers of this sort of substance would take the time you did to explore and compose so well. I am exceptionally awed with your vision and knowledge.
    data scientist training in hyderabad

    SvarSlett
  23. Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best.
    data scientist training in malaysia


    SvarSlett
  24. I just wanted to comment on this blog to support you. Nice blog and informative content. Keep sharing more blogs with us. All the best for your future blogs.
    Data Science Course Training in Hyderabad
    Data Science Course and Placements in Hyderabad

    SvarSlett
  25. Your work is very good and I appreciate you and hopping for some more informative posts.
    data scientist course

    SvarSlett
  26. Dies ist ein großartiger, inspirierender Artikel. Sie stellen wirklich sehr hilfreiche Informationen. Danke für das Teilen... Umzugsunternehmen Berlin

    SvarSlett