I’m surprised that neither Stephen Wolfram nor Nick Felton haven’t yet tackled the “change in my pocket” analysis.
I’ve talked before about issues with using social media data to predict election outcomes. Yesterday Mashable’s 78th infographic of the day looks at a new wrench in the gears: spam:
The same techniques used by social spammers advertising free iPads and Viagra are now being used to spread bogus political messages across social media, blogs and news sites.
Yet another instance that should bring home that point that social media mention volume isn’t everything. A low-noise data source (likely coming from a tool with robust filtering capabilities) is the only way to reduce the impact of this nonsense on data quality.
This is the blog post-equivalent of arriving on the platform just as the 3 train arrives. Two items on NYC Subway data came on my radar this morning:
- The infographic above uses ridership data to reveal the busiest and calmest stations.
- The MTA is going to open up real-time data on trains to developers. I have an app on my phone that does some whiz-bang things with augmented reality and planning trips, but it has a kludgy system for notifying of delays and can’t tell me when the next train will actually arrive.
Here’s a Big Problem for practitioners of social listening to solve: what happens when the “people” responsible for Consumer-Generated Media aren’t actually people? Whether you’re taking a sample or analyzing in aggregate, the pool is contaminated.
I didn’t have a chance to pick an NCAA bracket this year. I’m not too upset, as it means that my winning streak is intact (I won my office pool several years ago with a bracket titled “I actually hate Duke”). While they don’t account for the psychology of an office pool, I take a hard look at predictions from FiveThirtyEight’s Nate Silver and others before I complete my bracket.
This time of year is also exciting to mathematically-inclined sports fans because it means that MIT’s Sloan Business School hosts its Sports Analytics Conference.
I recognize that the point of this post is measuring what matters in the digital space, but this section totally reads like the treatment for Moneyball 2. Somewhere Jonah Hill is getting ready for his second Oscar nomination.
I find it interesting how the tone here is so much more nonchanalant than it was for the New York Times piece on Target a little while ago
- The video associated with this post is great and well worth your 15 minutes. But two bullet points deserve to be shared in their entirety:
- 5. "Data quality sucks, just get over it."
- That is the title of my post from June 2006. And look how far we've come. : )
- Multiply all of that a million times when it comes to big data. We will have dirty data. We will have no idea what to do with videos or spoken text or (omg!) social media overload. We will be missing primary keys. We will suffer from a lack of clean meta data (or sometimes any meta data!). We will realize the shallow limits of sentiment analysis. We will cry from the pain of the painful business process fixes that usually result in good data.
- And yet, we are standing on a mountain of gold.
- Do the best you can in terms of collecting, processing, and storing data of the cleanest possible quality. Know when to shift to data analysis. Start making decisions. Make small ones at first. (Remember, even they will be revolutionary, as these datasets have never come together!) Make bigger ones over time, as you understand the limitations of what you are dealing with.
- Here's the kiss of death: Big data implementation projects where the first touch of an Analyst will come 18 months after the project was first conceived. You see, the world would have changed so dramatically in 18 months that nothing you possibly spec'ed for is relevant any more.
- Think smart. Move fast. Slowly become Godlike over time.
- 6. Eliminating noise is even more important than finding a signal.
- This might be a little controversial. But stay with me.
- Thus far in the history data analysis the objective for our queries has been trying to find the signal amongst all the noise in the data. That has worked very well. We had clean business questions. The data size was smaller and the data set was more complete and we often knew what we were looking for. Known knowns and known unknowns. (See video above.)
- With big data, it is so much more important to be magnificent at knowing what to ignore. You must know how to separate out all the noise in the disparate huge datasets to even have a fighting chance to start to look for the signal.
- It is amazing but true. If you are not magnificent at knowing what to ignore, you'll never get a chance to pay attention to the stuff to which you should be paying attention.
- Your business savvy. Your analytical gut instinct. Tuning your algorithms to first ignore and then hunt for insights. That is what will have a material impact.
- http: //www.kaushik.net/avinash/big-data-imperative-driving-big-action/
Sorry folks, I’m taken :-)