alexjamesfitz:

The New York Times sheds some more light on PRISM here. Two key points:

1) All the technology companies specifically denied giving the NSA “direct access” to their servers. As many began to suspect, the methods discussed in this report don’t involve “direct access,” but they certainly involve “access.”

2) Tech companies might be denying PRISM because the employees with knowledge of the program are barred from speaking about it.

2009 NPR interview with Palantir seems like PRISM http://n.pr/196Owkl though I am not certain.

(via shortformblog)

Storage herder

These are excerpts from two recent posts by @StorageHangover. The first is about a trendy favorite, Big Data, and why it brings responsibility along with opportunity.

Via Bigger data isn’t better data

Bigger is better. It’s a lesson that’s drilled into us from childhood —  there are people, groups or even entire countries trying to be the biggest… Biggest building, biggest car, biggest ball of twine — they’re all out there. You ate the biggest steak on the menu? You’re awesome — here’s your big tee shirt.

Bigger is Just Bigger

The modern data center is not immune to the Bigger is Better Syndrome. Big servers, big networks and, most of all, big data. Sorry, that should be Big Data with a capital “B” and “D” because Big Data is its own Big Thing and clearly of Big Importance.

Everything is about Big Data these days — how to store it, how to make it accessible, how to manage it, how to secure it and make more of it. And don’t forget how to get it back in case you lose it.

Next, StorageHangover discusses something more general, but less intuitive.

Remember, it isn’t applicable to tapes only!

Via Data storage on tape! What tape?

The value of storing data on tape for the long-term is lost… Companies change backup systems, business applications… When generations of tapes are shipped to the vault, the technology used to create or read that data is often swapped out multiple times during the tape’s existence.

The result is a proprietary, locked repository containing years of a company’s identity … As time passes, the value of the data continues to decrease while the complexity of accessing the data increases.

The risk associated with maintaining that data also continues to increase over time.

GraphChi is the little dog

GraphChi[huahua] is a spin-off of the GraphLab[rador retriever] project:
GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel.

The motivation for GraphChi is for anyone with a PC to be able to run large scale graph analyses, such as those that represent social networks. I wonder if it is applicable to my favorite, useful, operations research methods? Those weren’t relevant to social networks. Instead, we used the linear programming with the Simplex method and similar algorithms to find shortest paths. Also, variations on the Traveling Salesman problem was used for transportation logistics, for both airline scheduling and shipping fleets.

GraphChi, and I think GraphLab are both open source, and available on Google Code Project Hosting.

Intel Graph

Then there’s also Graph Builder, originally from Intel. That is also a GraphLab project. Based on the one trackback on the 2 Nov 2012 blog post, this is the status:
Yet it is pretty hard to do distributed graph analytics. Intel will opensource their GraphBuilder, which allows you to construct large-scale graphs on top of Hadoop and process them with GraphLab

Nimsoft is the company who acquired Watchmouse, a long-time favorite of mine for performance monitoring, security, probably more. Nimsoft is probably best characterized as a comprehensive cloud computing provider. I read a post on their company blog, about the intense focus on big data:

This technology enables real time analysis of social network information… it is all about mining the trillions of trivial postings that we collectively make every second on the social networks of our choice. I know that “trivial” is harsh.

Why such interest in it then?

Mining and refining trivia can be used in innovative ways. Mostly it is about targeted advertising, but it is also about trend recognition and the analysis of collective thinking.

Sometimes the analysis of collective thinking is intriguing. Google’s Social Collider is one example. Another is Cultoromics, which was associated with development of the N-Gram Viewer. That was great. But such work also has a tendency to let questionable ideas roam farther than they would otherwise, particularly when they benefit from the veneer of faux-analytics.

Why else is big data so compelling? Well, there is technological challenge too.

What we have been using so far is inadequate for this job. With classical technology, and particularly SQL based databases, retrieval performance degrades exponentially with volume. Even the concept of “collect, store and analyze” has to be rethought. Now it is more like “collect, cache, analyze, store result”. To do that in real time with a variable and unpredictable arrival rate of data requires massive parallelism and efficiency of execution. It reminds me of the early days of computing when data storage structures were designed for performance and code execution times were measured and constantly optimized.

Technology innovators race to produce the fastest, most efficient, and most linear performance profile analytical tool. Are they doing this in order to accomplish anything productive? Sentiment analysis and the zeitgeist and the living pulse of our collective psyche, desires and dreams is cool to contemplate. Beyond that… I don’t have a clear vision beyond that point.

Post-script

Consider too this article about venture capital funding pouring into big data companies (ComputerWorld). Some of these companies are neither start-up’s, nor particularly innovative in storage or processing of enormous data sets:

Curt Monash warned investors to beware of the hype surrounding the technology. “A great example of hype is anybody calling Birst a ‘big data’ or ‘big data analytics’ company,” he said.

A prior Computerworld article described how Birst recently received $26 million in funding from Sequoia Capital and others, and has raised a rather hefty $46 million overall. Yet Birst went into business back in 2005, as a cloud-based business intelligence service. It has only recently begun presenting its products and software as a tool set for analyzing and deriving deeper meaning from petabyte-scale data sets.

As Curt Monash says:

“If anything, Birst is a ‘little data’ analytics company that claims, as a differentiating feature, that it can handle ordinary-sized data sets as well.”

nosql:

By Forresters Boris Evelson and Brian Hopkins:

Big data: Techniques and technologies that make handling data at extreme scale economical.

And the explanation of the 4Vs: volume, velocity, variability, variety:

BigData 4 Vs: volume, velocity, variability, variety

Original title and link: A Definition of Big Data (NoSQL database © myNoSQL)

Maybe…

Machine learning in Wonderland: bigger data or better algorithms

empiricator:

Data flows in constantly and time-sensitive predictions need to be made in order to efficiently manage supply volatility. Solving this problem requires machine learning, econometrics, and statistical tools.

Bigger Data or better algorithms?

(Source: metamarketsgroup.com, via empiricator)

Interview with Amy Heineike - Mathematician

simplystatistics:


Amy Heineike is the Director of Mathematics at Quid
, a startup that seeks to understand technology development and dissemination through data analysis….

I recall reading about the oddly-named Quid via TechCrunch some months ago, and later about Amy Heineike’s achievement in snagging a much desired job among the data, mathematics and statistics crowd! Both Amy and Quid seem to be doing  well!

Background

I dug around and unearthed that TechCrunch post introducing Quid (September 2010), whose primary focus seemed to be critique of the Quid site’s typeface. (There is no reason a quantitative analysis-based company is obliged to have an ugly website!)

Better coverage was available via a more recent New York Times article (October 2011):

Quid tracks job listings, customer wins and funding valuations at first, second, and third funding rounds to give venture investors a better yardstick — and better quantitative tools — in valuing startups…In fact, Quid is an offshoot of YouNoodle.com, who offered an intriguing calculator that it claimed could predict the exit or liquidity valuation of startups.

The following are excerpts from Simply Statistics’ interview of Ms. Heineike:

What skills do you think are most important for statisticians moving into the tech industry?

Technical statistical [expertise] is the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users… A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).

Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done. Later stage teams and big tech companies may have roles that are purer statistics.

There is a real opportunity for people who have good statistical and computational skills [at a graduate degree level] to get into the startup and tech scenes now. Getting involved in an open source project, working with version control in a team, or sharing your code on Github are all good ways to start.

Its really important to be able to show that you want to build products though. Imagine the clients or users of the company and see if you get excited about building something that they will use.

Go ahead and read the full interview on Simply Statistics tumblr.

Make sure to leave enough time to have a look around in general, as it is a fine tumblr for the mathematically and statistically inclined, well, for those with practical inclinations!

The problem with Big Data

evilmartini:

So today I’m unleashing my pent up rage on the “Big Data” crew; devotees and neophytes alike.

You do not have a big data problem. You have a functional ignorance problem.

Go back and read that a few times if necessary. Or to put it another way:

“Before you turned to big data, did you first try ‘small data’(tm)”

Or to put it yet a third and more direct way

“What’s your question?”

Most people who are “turning to big data” in their time of need don’t even know the question that they are questing for. As a result, many of the current “big data” set (pun intended) are collecting exabytes of data to hide their collective ignorance…

Start-ups are cheapening the term by using it to prop up and endless series of questionable business models and generally bad ideas.

The companies that are at the forefront of “Big Data” are there because they are solving interesting problems - at scale. Most people mistake problems at scale for the scale of their problems. What do I mean by this? Consider, Google set out to build a system that could search the web and help you find the information you wanted when you wanted it. They did not set out to build a business on top of map reduce… Netflix started out with the mission of helping me find the movies that I want to watch and get them to me… They didn’t start out with the mission of becoming the leader in applied machine learning and recommendations. The product provides the reason for the technology. 

What we now call machine learning, was called descriptive statistics in the ’60s and ’70s…

    Models often have hidden flaws that only get exposed in the real world. Your model can be fine with test data, but break on production data. The sad part is that you won’t know that it’s broken until your auto suggestion algorithm starts recommending home Euthanasia kits to people searching for elder care books.

    This post isn’t to vilify every company that mentions data analysis as being core to their product… This post is an attempt to throw the wet blanket of reality onto the bonfire of investment that seems to be throwing perfectly good VC cash down the drain in the hopes that analyzing the data from your last game-mechanic-social-coupon-buying hot thing will finally have them making money instead of spending it….

    Three places to look before you hit big data

    Classical descriptive statistics.

    Every time you map reduce without drawing a box plot, God murders a marmoset. For most data sets, starting with box plots and histograms does no harm and provides valuable insight on how to proceed. Far too often, we have one tiny nail to drive and you reach for a gigantic sledge hammer.

    Simulation

    Computers are awesome at simulating things… Instead of recording down every possible piece of information - try recording a little bit of information and simulating the rest. This very technique lies at the heart of one of the more powerful techniques in the statistical tool box: Monte Carlo Simulation.

    Ignore it and it Might go Away/ be unimportant

    That’s right - my favorite technique is to go focus on something else: Like your business model… Some questions just aren’t that important or intriguing. Let me revise that: MOST questions just aren’t that important. Are you asking questions that are central to improving your product or service?

    We live in interesting times for data. If we are careful with the ideas that now available to us, and the terms we use to describe them, data big and small will start driving decisions that can help everyone.

    If we continue down this road of using “big data” as a crutch for weak business models and terrible “products” … VCs, investors and industry will see our statistically driven ideas as mere tech-bubble snake oil.

    (Source: evilmartini)

    The sum is greater than the parts

    nosql:

    Someone Is Monetizing Big Data and It Is Not for Our Benefit

    Similarly some of the banks have admitted that they will be mining data related to the transactions we perform to understand our buying behavior. This data can then be sold to retailers or e-marketers to generate specific offers that may suit our lifestyle. It may be creepy to get an e-coupon out of the blue on your birthday (or anniversary) from a retailer that you would have shopped with some time back, but it could also have some nice benefits. On top of that, each one of us leaves behind digital tracks when we search or browse through different sites looking for something on the internet. If such data can be tagged to us, it can demonstrate our common interests.

    Call me a privacy freak, but I find this unacceptable. And I have a very hard time understanding what’s in it for us[1].


    1. That’s the mildest form to say that I cannot really imagine any benefits for us.  

    Original title and link: Someone Is Monetizing Big Data and It Is Not for Our Benefit (NoSQL database©myNoSQL)

    My concern is that the aggregation of personal data, and its ease of access, will magnify exposure. A lot. Network effect. That sounds vague and foreboding. If so, peruse the article, then have a look at the SocialIntelligence company website. 

    myNoSQL

    nosql:

    Very interesting customer base numbers for Sybase IQ, Vertica, SAND Technology, Infobright published by Curt Monash—most are in the hundreds, except for Sybase IQ.

    This got me thinking what numbers would NoSQL companies have—is any of them sharing such numbers?. I’d speculate that most of them are in the tens, with 10gen (MongoDB) leading the space with probably a couple of hundreds at best.

    Original title and link: Columnar DBMS Vendor Customer Metrics (NoSQL database©myNoSQL)