Data Science Jedi: software

torsdag 11. august 2016

Another approach to Personal Finance

Re-Inventing Personal Finance using Data Science

Existing software and new approach

Usually existing Personal Finance applications are boring, because they are all dependent of manually input of your data, in right segment, the right amount, just boooring. In addition, you can count on manual input fails together with the impossibility of live update your financial status, to make it even worst experience. These and many other reasons make the existing Personal Finance applications nearly useless.

To avoid manual input of data into your application, you need a live feed from your transaction data (credit card usage, bank payments etc...) and only manual input for cash amounts. However, cash is very small problem, as we tend to avoid it as much as possible and instead we mostly buy with electrons.

Most of the banks offer to their customers a digital bank account where all the transactions are visible and that can the best source to avoid manual input. So, why we do not ask for built-inn application that will serve as Personal Financial app with even more possibilities to serve you.

The solution

This application can save lives, can make you better at your personal finance, can avoid financial crisis and help banks get better understanding of you as customer. It is not only you as a person that benefits, but the entire society and even the bank itself. Bank can have much better credit scoring for their customers and can avoid risky loans, risky bank interests for a particular customer etc…

To build (in) this app we need to consider many things and specially the approach that Business Intelligence solutions can serve to us, but keeping in mind security and impersonation as we work with very critical data.

Therefore, I am delighted to represent you PFI that stands for Personal Finance Intelligence, which represents a non-usual approach to Personal Finance solutions existing in market today.

Personal Finance Intelligence (PFI) aims to be a built-in Business Intelligence application inside your digital bank service to serve you as personal finance and budget planner assistant.

Inspired by a Norwegian TV Show “Luksusfellen”, this Business Intelligence app approach may be a solution for all these who fail to maintain well their own economy, and for those who want to perform their economy, save more and last but not least the bank itself.
The fundament of this concept is a Customer Analytics Data Center that would have the power process data on the transaction level. The duty of the data center will be to collect, structure, clean, model and present the data to the bank customers as usual Personal Finance application do, but in addition, data will be updated automatically. This is the reporting (presentation) layer of your financial status (picture), but this application can offer you much more and here is why!

In addition to a standard PF application, this solution include also bench-marking against an standardized customer (Ola Norman) that represent the data set of Min, Max or Average segmented by customer's choice and properties, for example: How I stand against customers from 28-35 years old, from east Oslo, in buying food and beverage this month?
To have more control and plan well your own economy, targeting will be an integrated service inside application where users (bank customers) can put their targets (manually) for costs or income or can leave the application algorithm fill that with projection based on each customer historical data. You can activate a flagging service, so you are warned when approaching certain limits in your expenses and run algorithm to optimize the use of remaining budget and you do not get broke.

Big Data can help make it even better

Last technological findings can make this approach even more interesting and meaningful. Imagine what Big Data and Data Science can do by adding external data for customers that will allow opening of their social media accounts to the bank application. Social media behavior is very important and can bring very important segmentation inside customer categorizations.

Machine learning algorithms can help make the decision and budgeting much better based on other decisions and budgeting techniques.

Considerations

As I mentioned before, data impersonation and security are a showstopper as we are going to work with bench marking data sets that implies set of other customer’s data. Here we can have a potential data leak from one customer to another, so our system must ensure consistency in both sides and the bank system has everything under control. Transaction details of customers can make banks expose their ‘hidden’ costs and fees. Many banks will hesitate to offer this service to their customers just because of this; in other side customers have legitimate right to have such information.

Conclusion

Beneficiary to this approach are not only the customers and world economy, but also the bank itself in cases when they want to perform customer evaluation (credit check) and behave reaction to certain financial statuses. Today’s Credit scoring system lacks on better decisions because they miss important data.

I am on the way to build business concept and the technical architecture of this approach. My team and I would love to share this approach on details, including implementation, if any company, association or bank in the world is interested to offer this service to their customers. This can be the best preventive for World financial system to stay sustainable and not crush as it did before.

© Copyright All rights reserved to Besim Ismaili 03051982

Oslo, January 2015

søndag 5. juni 2016

10 Minutes to Data Science

I just followed John Hopkin's Executive Data Science team. In the first chapter of the course they talk,

*In Data Science, the importance is science and not data. Data Science is only useful when we use data to answer the question.*
It actually true in some point. I've actually seen too many companies that brag how big they data are but they don't actually know how to pose a question. If the data can't be used to answer the question that makes the company growth, it's not a good idea. They press even more to the point of data, data, data but really is in the end, machine learning will just feed into it, to make a better prediction to answer the question. It's critical to pose a question first, then try to get/build data to answer the question. And more importantly, don't be afraid to get other data from sources outside your company.
When investigating a problem and communicating to the broader audience, it's important to find the right question to answer, and then find the data related to answering the questions. This how all data science work will look like. Even in A/B testing when we measure the confidence interval between control and experiment group, the real question is still about how we can use data to answer our question.
In that course, Jeff Leek continues in term of Money Ball. We can find some evaluation metrics to measure player's skills, but the key important to answer questions is, "Can we be a winning team with a small budget?" Creating the best predictive model is not the most important, in the case of Netflix prize, one-million-dollar algorithm can't be implemented because it's impossible to scale with all of the customer combined.

Statistics and Machine Learning

Statistics mainly divided into two parts, Descriptive Statistics and Inferential Statistics. Descriptive Statistics, as the name implied, is using statistics to better understand the data. This includes using summary statistics and visualization to explore the data. Jake Vanderplas, the author of Python Data Science Handbook, showed how to use this method to understand the pattern of Seattle's Bicycle Habits in his blog.
While Inferential Statistics let you use Hypothesis testing and Confidence Interval to make an inference about your assumption. Suppose you have two group with numerical variable (female age vs male age), and you want to found whether these groups is significantly different with each other. Statistical Inference need you to follow experiment design, so that your inference can generalize well to your population of interest, and also found correlation that suggest causation. This method is useful to get insight, whether the relationship of two variable is correlate with each other.
Machine Learning is one of the fields in artificial intelligence that give a machine a capability to learn about your data. Thanks to the modern era, where computers have grown computation power and hype of data science , machine learning has grown into broad area. Two of the interesting topic is Supervised Learning and Unsupervised Learning. Supervised Learning is where given a set of input and output, machine learning tried to predict future output given future input. Think of it as a student that given a bunch of papers of quiz where it has a wrong and right answer, he learns and will able to answer the quiz. While the example of Unsupervised Learning is clustering algorithm like we discussed earlier.
Machine Learning is a different use case when compared to Statistical Inference. Have you seen Kaggle competitions? take a look around at their leaderboard, and all that score is extremely close. Often top 100 hundred is already a winner. They're willing to go into 2-3 times more complexity to get 1% increase. This is not practically possible, and we saw from Netflix prize, they give up on the one million dollar code because it's computation is expensive. So if you in for the accuracy games, go ahead! Otherwise, a simpler model is better.
So what do you use when you want to make a prediction? use Statistical modeling to understand what your prediction is. Use machine learning to make your prediction better. Statistical modeling concerns about the complexity model because we have to understand it, but machine learning scale as complexity increase. Moreover, Machine Learning concern about parameter tuning to performance, statistics concern about parsimonious model (get understanding bett

Software Engineering

So why is software engineering is important in data science? Because often you will have to get data using programming language. Sure there is some reporting that you can be download from Google Analytics or any dashboard in your company, but it's only summary and aggregation metrics. Things will get harder if you need advanced or specialized metrics that would need you to create aggregation yourself. Log data will always have some messy data. And if you have human-input data, there's going to be a human error. In order to do that, you need engineering skills. Pulling data from database alone need some programming or SQL skills. In case you need additional data from open data in the web, you also need programming to get data through API.
So engineering skill is a critical part of data science. In fact, it's so critical that you won't get the fun part, analyzing and making inference/prediction if you don't have engineering skills. You don't even know how to get data and clean it. At least software engineer alone can still do something. They can get some data and validate through analyzing inconsistency, and make some descriptive statistics to analyze the data.

Toolbox

One of data science toolbox is often a debate between choosing R or Python. But they are two different things. Python comes from software engineer background. Since Python famous first at web development, gathering and manipulating data has becomes very important. On the other hand, R has statistician background. There is widely variant of statistical packages available. And as statistician also use visualization to get insight about data, it also rich with visualization packages. So when processing the data use Python, and doing R when you want to analyze. Of course, sometimes these two overlaps, and you can choose to stick to one language. You can manipulate string in R, or do statistical analysis in Python. You have few choices when you want to present your result of analysis. If narrative, you can use Jupyter or Rmarkdown to storytelling your analysis in a narrative way. Sometimes you end-up want to engage your audience based on your findings. In this case, you want to create interactive visualization. D3.js is great in this area.
So when you talk about data science, think more about the question that you want to answer. Can you get some information if you have to answer the question? Do you have useful data to answer your question? If you could even answer it, is the answer practically possible? By doing this series of questions, it will avoid your missteps in the long run.

Advertisement

torsdag 11. august 2016

Another approach to Personal Finance

søndag 5. juni 2016

10 Minutes to Data Science

GoorooThink by Gooroo.io

Bloggarkiv

Advertise Sidebar

Advert

Advertise

Advertise