- Del på LinkedIn
- Del på Facebook
- Del på Twitter The love of Star Wars is strong in my family. I have it, my husband has it, and my kids have it. With The Force Awakens debuting on Friday, December 18, I thought I'd revisit a project I worked on two years ago.

As an aside, the first time I delivered this presentation was at a conference in DC. My presentation was technically scheduled in a time slot after the conclusion of the conference. I was worried that no one would stick around to watch my talk, so I started tweeting about Princess Leia delivering the presentation to try to drum up some support.
I was scheduled to fly in that morning, so I decided to "cinnamon bun" my hair, then change into my Leia costume when I got to the conference. After completing one "bun", I realized the dress wouldn't go on over my hair. I ended up having to wear the entire get-up to the airport.
I expected some strange looks, but I actually had a lot of fun. A guy in security asked me what I was trying to do. I looked at him and said: I'm on a diplomatic mission to Alderaan! I had others ask me about my religion (the Force, of course). I got more than a few wishes of "may the Force be with you", and took pictures with some folks. Want to feel like a celebrity? Walk around and airport dressed as Princess Leia. I highly recommend it!
Back to the scripts! First, how are we defining data science? I found this graphic by Drew Conway from Project FLA, and thought it summed up my views on data science nicely:

The data process, as is typically the case, was the most time consuming part of the project. We kept going back and forth on the best way to represent the data. There are a number of different approaches that could have been pursued. At the end of the day, we had to figure out how to take large chunks of unstructured data, and add appropriate structure. We also recognized that it would be important to derive some additional structured variables to give us more to look at and make the analysis more interesting.
To provide some context, the original data looked something like this:
It's just a wall of text. This is where the Data Jedi's Discipline comes into play. In our example, SAS® Data Integration Studio was the Discipline. SAS®Data Integration Studio is a point-and-click environment that enabled us to build and edit data as well as manage the metadata.
The primary data process involved creating a reference to the original data, specifying the delimiters and file parameters, then viewing the output.

The primary data process involved creating a reference to the original data, specifying the delimiters and file parameters, then viewing the output.
The first version of the dataset looked like this:

After some trial and error, we ended up with a dataset that combined the data from all three movies into a dataset called Trilogy. We were able to associate character lines with the proper character, pull out location information, and associate lines with the source film.

With SAS® Contextual Analysis, we were able to classify, cluster, create term and concept maps, build rules, and score the data. The end goal was to create more structure from the unstructured data.
Initial exploration indicated some problems in the data. Taking a look at the terms list and their associated synonyms showed that Luke was appearing as a person, a location, an organization, and a proper noun. That means that when analyzed with text analytics, he'd appear multiple times in a variety of contexts.

Ideally we'd have Luke just appearing as a person, or better yet, a Jedi. To accomplish this, we created a synonym list. We went through the exercise for most of the key characters. Here's a small snippet of the synonym list:


Next, let's take a look at Captain, then Admiral Piett. In case you aren't sure who Piett is, here's a picture. He's a very solemn guy.







We can drill into the rule to see the syntax. You can also modify the rule, if necessary. Machines sometimes work with language in ways that a person may not. In this example. Darth Vader is classified as darth, Darth Vader, or vader. We might want to remove "darth" because that title is a Sith title, and isn't specific to Vader.










Have questions? Want to see more? Contact me!
Enjoy The Force Awakens, and may the Force be with you!
Ingen kommentarer:
Legg inn en kommentar