- Del på LinkedIn
- Del på Facebook
- Del på Twitter The love of Star Wars is strong in my family. I have it, my husband has it, and my kids have it. With The Force Awakens debuting on Friday, December 18, I thought I'd revisit a project I worked on two years ago.
With my love of all things Star Wars and a personal interest in data science and text analytics, I thought it could be fun to explore the scripts from the original Star Wars trilogy (A New Hope, The Empire Strikes Back, and Return of the Jedi). I teamed up with my husband, Adam Maness, and we worked together to ingest the data (acquired from the Internet Movie Script Database) and analyze it.
As an aside, the first time I delivered this presentation was at a conference in DC. My presentation was technically scheduled in a time slot after the conclusion of the conference. I was worried that no one would stick around to watch my talk, so I started tweeting about Princess Leia delivering the presentation to try to drum up some support.
I was scheduled to fly in that morning, so I decided to "cinnamon bun" my hair, then change into my Leia costume when I got to the conference. After completing one "bun", I realized the dress wouldn't go on over my hair. I ended up having to wear the entire get-up to the airport.
I expected some strange looks, but I actually had a lot of fun. A guy in security asked me what I was trying to do. I looked at him and said: I'm on a diplomatic mission to Alderaan! I had others ask me about my religion (the Force, of course). I got more than a few wishes of "may the Force be with you", and took pictures with some folks. Want to feel like a celebrity? Walk around and airport dressed as Princess Leia. I highly recommend it!
Back to the scripts! First, how are we defining data science? I found this graphic by Drew Conway from Project FLA, and thought it summed up my views on data science nicely:
The math and statistics knowledge is a given for Data Jedi (Data Jedi just sounds cooler than Data Scientist...). In order to really be efficient and effective, they also need some expertise, and some hacking skills. In the realm of the Data Jedi, the hacking skills are the data skills. If a Data Jedi always has to rely on someone else to wrangle their data, they won't be as agile as they could be. The next piece really speaks to that fact.
The data process, as is typically the case, was the most time consuming part of the project. We kept going back and forth on the best way to represent the data. There are a number of different approaches that could have been pursued. At the end of the day, we had to figure out how to take large chunks of unstructured data, and add appropriate structure. We also recognized that it would be important to derive some additional structured variables to give us more to look at and make the analysis more interesting.
To provide some context, the original data looked something like this:
It's just a wall of text. This is where the Data Jedi's Discipline comes into play. In our example, SAS® Data Integration Studio was the Discipline. SAS®Data Integration Studio is a point-and-click environment that enabled us to build and edit data as well as manage the metadata.
The primary data process involved creating a reference to the original data, specifying the delimiters and file parameters, then viewing the output.
It's just a wall of text. This is where the Data Jedi's Discipline comes into play. In our example, SAS® Data Integration Studio was the Discipline. SAS®Data Integration Studio is a point-and-click environment that enabled us to build and edit data as well as manage the metadata.
The primary data process involved creating a reference to the original data, specifying the delimiters and file parameters, then viewing the output.
The first version of the dataset looked like this:
We found that the initial pass wasn't going to work without some additional parameters. Since the ultimate plan was to analyze the text, having the text broken out by line break would lead to inaccurate results. Just like analytics are iterative, so are data processes.
After some trial and error, we ended up with a dataset that combined the data from all three movies into a dataset called Trilogy. We were able to associate character lines with the proper character, pull out location information, and associate lines with the source film.
The next step was to begin analyzing the data using text analytics. SAS®Contextual Analysis is the Data Jedi's lightsaber. It's an elegant weapon, for a more civilized age.
With SAS® Contextual Analysis, we were able to classify, cluster, create term and concept maps, build rules, and score the data. The end goal was to create more structure from the unstructured data.
With SAS® Contextual Analysis, we were able to classify, cluster, create term and concept maps, build rules, and score the data. The end goal was to create more structure from the unstructured data.
Initial exploration indicated some problems in the data. Taking a look at the terms list and their associated synonyms showed that Luke was appearing as a person, a location, an organization, and a proper noun. That means that when analyzed with text analytics, he'd appear multiple times in a variety of contexts.
Ideally we'd have Luke just appearing as a person, or better yet, a Jedi. To accomplish this, we created a synonym list. We went through the exercise for most of the key characters. Here's a small snippet of the synonym list:
Now armed with a synonym list, we can move on to some visuals. Let's start with one of the most pivotal characters in the entire saga...Darth Vader:
This term map picks up on the relationship between Darth Vader, the Emperor, and Luke Skywalker.
This term map picks up on the relationship between Darth Vader, the Emperor, and Luke Skywalker.
Next, let's take a look at Captain, then Admiral Piett. In case you aren't sure who Piett is, here's a picture. He's a very solemn guy.
In this term map we can see Piett's rise from captain to admiral of the Star Destroyer thanks to Admiral Ozzel's untimely demise at the mercy of Darth Vader's Force choke. It's also apparent that he's often standing on the bridge of the ship:
Next, we have markers for the Dark Side--fear, anger, hatred:
Digging into the Force shows the connection to the Dark Side:
...and the coup de grace. This one is my personal favorite because it shows the relationship between Luke Skywalker and bringing balance back to the Force. This part makes me very excited to see what happens in The Force Awakens!
Exploration of unstructured data doesn't stop with term mapping. We can also look at topic clusters. This is an example of automated topic extraction. It shows the connections between Darth Vader, the Emperor, Luke Skywalker, and the word father. It shows a word cloud and some context down below:
Exploration of unstructured data doesn't stop with term mapping. We can also look at topic clusters. This is an example of automated topic extraction. It shows the connections between Darth Vader, the Emperor, Luke Skywalker, and the word father. It shows a word cloud and some context down below:
Exploring that topic shows us that we have a contextual rule that was automatically generated around Darth Vader:
We can drill into the rule to see the syntax. You can also modify the rule, if necessary. Machines sometimes work with language in ways that a person may not. In this example. Darth Vader is classified as darth, Darth Vader, or vader. We might want to remove "darth" because that title is a Sith title, and isn't specific to Vader.
We also have the power to build our own rule sets and classifications based on business rules. In this example, I created a few categories--Bad Guys, Good Guys, Systems (as in planetary), The Dark Side, The Force, and Weapons. In the example below, we can see that the Bad Guys category has rules outlined for Darth Vader, Jabba the Hutt, Emperor, Admiral Piett, Bounty Hunters, and Sith. "Bad Guys" is purely subjective on my part. Bounty Hunters aren't really "bad", but for my example, I've labeled them in that manner.
Rules don't have to be complex or complicated. In the case of my Bounty Hunters, the actual rules are just classifiers. If the name Boba Fett is mentioned, lump him into Bounty Hunters under Bad Guys, For reporting purposes, Boba, Greedo, IG-88, Dengar and Bossk are all going to show up as Bounty Hunters, and not as individuals.
After having engaged in the data management processes, visual exploration, and rule creation, we end up with a dataset that has 14 variables. That's 14 structured fields from that original wall of text!
What can we do with all of this new, structured data? We can further visualize it! SAS® Visual Analytics is the Data Jedi's Mind Tricks. These are the visualizations you were looking for! Let's look at a few distributions of the new variables. First, the distribution of spoken lines. These shouldn't surprise many fans. I love Luke Skywalker. I really do, but the man whines a lot. Han is always ready with a snarky or sarcastic remark, and C-3PO is an anxiety-ridden mess.
Next, a breakdown in mentions of the Dark Side versus the Force:
Finally, a breakdown in the mention of weapons, with the Death Star getting an overwhelming majority of the mentions:
Our Data Jedi's Mind Tricks also has the ability to create word clouds with topic detection and sentiment analysis. Keep in mind, this is a conglomeration of the scripts from all three movies in the original trilogy. Here's a word cloud with some contextual information below about Luke:
...and who doesn't love Ewoks?
I hope you have enjoyed this analytical tour through the first three films in the Star Wars saga. The original paper that this post was based on can be found here: Star Wars and the Art of Data Science.
Have questions? Want to see more? Contact me!
Have questions? Want to see more? Contact me!
Enjoy The Force Awakens, and may the Force be with you!
Ingen kommentarer:
Legg inn en kommentar