« Netflix New Releases for May 24th, 2011 | Main | Former Netflix CFO Barry McCarthy Joins Eventbrite's Board »

Comments

Tedder42

they are geographically constrained. They can only compete for talent in the (very busy) lower SF market.

Shops that are geographically distributed are faring much better- Google is a perfect example of this, but Amazon distributes their engineers out of Seattle to a degree.

The tech market is completely backwards of the normal economy.

BP

I'm not sure how I feel about Netflix using Cassandra. I agree with Ted Dziuba in that it's a solution without a problem for engineers who think they're clever, but I could be agreeing out of ignorance as I'll never work on a system that scales to that level.

It just seems to me that everything Cassandra touches has problems and eventually moves away from the platform. Reddit has had endless problems with Cassandra and Twitter moved away from it almost as fast as they adopted it. Even Facebook, who invented it, have dropped it.

I hope this makes the Netflix backend perform better... but it just seems to keep getting worse and worse as they insist upon dabbling in whatever new and fashionable interests them that given week. Me? I want the website from 2007 back. You know, the one that just plain worked.

-BP

Knaldskalle

I don't know anything about anything with regards to the engineering/technical side of things, but I'm with BP in wanting the "old" site back. Lately it's been a buggy experience, with search pages being "currently unavailable", clicking search results just refreshes the search page, clicking links to Netflix from elsewhere leads to the front page (and not the movie linked to).

Jeremy Hanna

So in response to BP and others:
1. Reddit's problem was not with Cassandra, it was with EBS volumes on EC2. They've since moved to ephemeral nodes and it works great. Also, their system isn't 100% Cassandra. They were negatively affected by the AWS outage last month because some of their postgres machines were using EBS in the US-East region.

2. Twitter has moved away from Cassandra? No, no it hasn't. They had originally thought to move their entire tweet store to Cassandra but thought the engineering resources would be better used to do newer projects like the Rainbird realtime analytics project, among other things. See http://techcrunch.com/2011/02/04/twitter-rainbird/

3. Facebook has never used open-source Cassandra. They threw Cassandra over the wall into open-source land in 2008. They haven't contributed to it beyond that nor have they used any updates to Apache Cassandra. Their old and very special purpose Cassandra still powers inbox search for now on a cluster of 150-200 machines. They chose a different route for their new messaging system, for among other reasons they have a lot of in-house hadoop/hbase developers and a lot of ex-googlers, upon which hadoop/hbase is based. Again, they have very little experience with Apache Cassandra, if any. So to say they were moving away from Apache Cassandra, even though they "tested" with it, is disingenuous.

3. Any instability with the current netflix streaming doesn't necessarily have to do with Cassandra.

4. Speaking of stability, Netflix weathered the AWS US-East outage just fine in part because of Cassandra's replication and use of EC2 ephemeral nodes. So in that case it was of help, not a liability.

Please don't seed doubt about a technology that you don't know the complete story on. We're using it where I work and many other companies are using it. It works great.

BP

Oh great, I say I won't post here anymore and then you come along.

OK, let's play.

(1) http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html - Cassandra is one of the responsible parties for reddit's constant downtime. Everyone knows this. Reddit admit it. Nowhere in my post did I say Cassandra was the sole reason reddit went down, but you're attempting to give off the impression that it wasn't responsible at all. Who's trying to push an agenda here again?

(2) http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html - They were originally going to use Cassandra to store tweets as well, but decided not to. This remains unchanged, AFAIK. When you refuse to change over the software that powers the core of your service, and instead relegate it to the most useless aspects of your backend, that sends an extremely strong negative signal.

(3) LOL. That's the stupidest fucking response I have ever read. "Facebook has never used open-source Cassandra" - fucking so? http://www.quora.com/Is-Cassandra-still-powering-inbox-at-Facebook - Facebook finds Cassandra to be lacking for their new inbox/e-mail/im system, so they are moving away to HBase. Seriously, though, I'm not sure saying that "Facebook uses (x) to perform (y)" is all that smart of an argument considering the massive amounts of problems the service has.

(3) YOU HAVE TWO THREES! OH FUCK THE SYSTEM HAS FAILED! I never said it did. Learn to read. I said relying on new unproven fashionable tech is crap when you are running an uptime and response time critical application. Cassandra can't handle either at the large volume it was specifically designed for.

(4) You're being disingenuous again. Netflix weathered the AWS US-East outage because their fucking nodes weren't affected. They said so in their blog post on the subject. Fucking really, you're going to lie to make your point? Cassandra had nothing to do with their uptime and everything to do with the way Netflix manages their backend at the server level.

Perhaps you should include in your post a disclaimer that you work for a consultancy firm that has a strong self-interest in pimping out Cassandra. "Where I work" includes your clients, and you certainly wouldn't want to be shit-talking the software you're selling them support for, now would you?

Don't be disingenuous, punk. It works great because you're paid to say it works great.

I bet you fucking love Ruby on Rails.

-BP

BP

Man, I'd really kill for Roy to come in here and set the story straight, regardless of who is right or wrong.

Someone stand in front of a mirror and say his name three times fast. Remember to have a laptop ready with this page open so he can respond after gutting you with a meat hook.

-BP

BP

Ah crap, the syntax of that sentence sucked. It looks like I'm saying Netflix weren't affected by the outage at all, when in reality I'm trying to say that the way they engineered their backend is what saved their asses. If they were using Cassandra merely for its performance gains then they would have been screwed, but because they're Smart Peoples (tm) who do stuff like this (http://www.quora.com/Did-Netflix-suffer-any-disruption-during-the-recent-AWS-outage) they weren't taken down like so many AWS-dependent websites were. Perhaps that's because Netflix are using Cassandra correctly whereas everyone else isn't, but I'd also wager that the engineers at Netflix are smart enough not to rely purely on Cassandra for critical applications because they know that it's unreliable.

I really don't get the whole NoSQL thing. It's fashionable, but seems extremely limited in actual applicability. It's an incompetently over-engineered solution to a problem that doesn't exist. Google don't even use it for their most important service (Adsense), and they bloody pioneered the entire concept.

Of course, I could be wrong and, most likely, I am. I'm not an engineer. I don't understand the systems the same way you likely do. What I do have a keen spider sense for is bullshit. Had you come along and simply said "NoSQL is in its infancy, it has problems but it is great for " then there wouldn't be a problem. Instead you come in and start arguing technicalities.

TIL NBA Western Conference Finals Referees Also Moonlight as Software Developers. For reals.

-BP

Jeremy Hanna

BP:
re: reddit - that post was about their original move to cassandra. they tried to replace their cache with a four node cassandra cluster. it was bad planning but they were bitten by the fact that bootstrapping was a pain in cassandra 0.6. that particular issue was addressed in 0.7. The incident I thought you were referring to happened about two months ago - they blogged about it but it was essentially that latency within amazon's network for ebs instances was variable and is a bad fit for cassandra so they moved to ephemeral.

re: twitter - I know several members of their cassandra team there. the decision to not move their tweet store was a matter of resource allocation. they still use it heavily and have several clusters. I was in the huge packed room at Strata in February when Kevin Weil talked about Rainbird and what it does for them. It sure didn't sound like some toy project to make the research guys happy.

re: facebook. they just open sourced it and that's it. the community picked it up, dusted it off, and have done a lot of great things with it.

I'm not sure where you get the impression that it's unreliable. I don't think reddit would still be using it a year later if they thought it was crap and they could use something else. I don't think twitter would have devoted so many resources to it if they thought it was as unreliable as you say. I don't think netflix would be operational very long if they had to sort of rely on something. Face it, it's had its warts over the last year, but it solves some pretty specific problems. It sounds like you're just trying to play the NoSQL is nonsense card, but the dollars invested in it by all of these companies contradict that. I know there is a lot of marketing hype about a lot of tech these days, but when you need this kind of system (not just because it's cool), it works very well.

Jeremy Hanna

And no, I don't get paid to say nice things about Cassandra as you implied. I work for The Dachis Group in Austin, Texas.

Also, Netflix was safe in AWS partially at least because Cassandra is able to run on ephemeral nodes and if a node goes down, which it occasionally does in ec2, data is not lost.

I wasn't trying to attack you. I just get tired of people making inaccurate conclusions drawn from parts of various stories.

BP

Thank you, Jeremy. That's a fantastic reply.

I don't mean to be playing the "NoSQL is nonsense" card. If there's a use for technology, and in this case it's obvious more mature technologies weren't cutting it in data-and-IO heavy applications like social networking, then I firmly believe new technologies will be created and, eventually, mature. I can see how my posts would definitely give off that impression, and I'm sorry I didn't clarify or research my points better. Thanks for being the bigger man and stopping this snowball before it got any bigger.

That said, and please stick with me here: Cassandra powers, at least in part, a lot of services that are having massive problems with reliability. Most of these services are being run from the other people's managed clouds (Twitter being the exception, I believe?). Again, most of these services rarely, if ever, communicate their issues to their users - reddit being a glorious exception. That's fantastic that you were able to sit in on the talk at Strata, but for a large chunk of the rest of us, downtime is rarely ever explained at length and even when it is it is detail light. That leads to some fairly rampant speculation, of which I'm guilty.

Combine all of the above with the fact that, when downtime is explained, it almost always is the result of administrator error. "We misconfigured our servers" or "We weren't monitoring for errors" or the likes. I realize that as an early adopter of a software that's being field tested you are always going to run into problems, which unfortunately can give the impression that it is the software that is unreliable instead of the admins that are incompetent, lazy, forgetful or just plain doing the best they can with what they've got. And that sucks, because it affects everyone involved.

I'm not trying to convince you of my viewpoint, or even explain away the errors in my arguments, merely why I thought the way I did. You've got to admit, Web 2.0 has an extremely bad reputation for (1) extensive downtime, (2) massive error rates, and (3) not interacting with its users when both of the above eventually happen, and at length. It's unfair to solely blame the software, but a lot of rather brilliant people were positing that is was in fact the software, and an especially poor implementation of it, that was the problem.

It would appear that those opinions could very well be every bit as outdated as the facts surrounding Cassandra's previous faults.

I still maintain that Cassandra could eventually go the very same route as Ruby. Again, I'm not an engineer, I don't understand the levels at which this software is meant to work and if I tried my head might likely explode. That said, some extremely smart people have made arguments against a large portion of what Cassandra is supposed to do, and knowing how fad heavy a lot of the younger Web 2.0 crowd can be, it's easy to believe that it is being used for purposes it is ill-suited for, if it is really suited for any. I couldn't claim to know. It's hard to wade through what is simply tech evangelizing and what is simply turf pissing matches, though I thought, perhaps wrongly, I did a good job of separating it all.

What I can claim to believe, though empirical evidence is so often flawed, is that Reddit is less reliable than Twitter, which is less reliable than Facebook, which is less reliable than even the busiest of traditionally hosted (eg non-distributed, non-cloud based) sites (like imgurl). That is an unfair comparison, but Netflix now is far less reliable than Netflix circa 2007, and aside from a massive jump in subscribers, they've also substantially changed the way they host and serve data. It is easy, although most likely unfair, to blame the software and its implementation. You make a convincing argument that Netflix might have far more downtime without their current setup, but the fact is that the amount of errors people have been having over the past few months make a far more convincing argument that they're doing something very wrong. Again, maybe it's just an issue of scale. You'd know better than I.

Sorry for the long post. Thanks, again!, and very much, for your response. You give me hope that a lot of time isn't being wasted reinventing the wheel, again, and after reading of Cassandra's progress 2010, I'm thinking it might pay off.

-BP

Jeremy Hanna

BP - sorry if I gave a negative impression. Also when I said disingenuous, I was referring to facebook's explanation of how they reviewed different solutions. I don't know but I don't think that decision had as much to do with technology.

I know that all of these bleeding edge technologies have their rough edges, but they're all progressing. They fulfill very specific needs for large data systems - in the case of cassandra and hbase at least. As with anything new, they do get used and misused for things they aren't a good fit for.

I know what you mean about reliability - I think all of these different companies are pushing limits. I've got a roku and have seen the streaming errors recently and hope they can be fixed as well. I'm pretty impressed by what they're able to do though, despite that. I'm sure it will get better.

Royrapoport

@BP, I'm flattered. Sadly, I fear your faith in me is misplaced.

Let's get the disclaimer out of the way and note that I don't speak for my employer (now, or ever on hackingnetflix).

I don't want to address too specifically what Netflix is doing with Cassandra, partially out of my own ignorance, partially because I try to not disclose things that we may prefer to not disclose, and partially because that's not particularly interesting to me. I do want to address some other points, however.

Firstly, reference: Ted Dziuba's NoSQL rant

Ted's rant is certainly entertaining, but I don't find it particularly compelling. Some specific points I take issue with are:


  • It's about a year old, and Cassandra 0.7 addresses his concern about downtime for column family definition changes. 0.7 is stable, BTW;
  • His condescending message to DBAs ("In the meantime, DBAs should not be worried, because any company that has the resources to hire a DBA likely has decision makers who understand business reality.") is entertaining, but misplaced. I've talked to our DBAs. They're excited to work with Cassandra (and, in fact, we've repurposed some DBAs away from Oracle and onto Cassandra. This was an entirely voluntary effort on the part of the DBAs involved);

I want to make a different case, though, and it's in reference to two comments y'all made earlier:


  • NoSQL is in its infancy; and
  • Man, Netflix sure seemed a lot more stable back in 2007.

I think a lot of what we're doing -- and that's a pretty global definition of "we," which includes not just Netflix, but also Twitter, Facebook, and others with explosive growth -- involves having to either manage technology that is in its infancy or build new technology and capabilities (and if you're building something new, by definition it is in its infancy). And I think that'll inevitably cause you problems.

A slight detour:

Back in 1997, I took a job working as a senior unix/network engineer for Macromedia. Macromedia at the time was getting more and more web traffic against its (static) website. We were hosting the shockwave and flash players, and they were becoming very popular.

In 1998, the Macromedia website was hosted on five Sun Ultra Enterprise 2s (UE2s). These were moderately-priced (IIRC, $20-$40K) servers, and the site ran so well on them that when we had a load balancer failure that resulted in 4/5 systems being out of load, we found out because some people reported the site was running "kinda slow." The site used very simple technologies (mostly Apache and static HTML files), and was rock solid, bullet proof, and basically hard to kill.

Around 1998, I got a new CIO (my boss^2, if I recall correctly). His name was Stephen Elop, and you may know him better as the current CEO of Nokia. Stephen went on a full-blown evangelizing campaign focusing on the assertion our static site sucked, and really what we needed was to have a fully dynamic, personalized website. 13 years later, this seems obvious, but at the time this was a considerably interesting and fresh perspective.

Macromedia's new, personalized, website (engineered with the help of a big consultancy) was just choke-full of potential. Emphasis on "choke." Our personalization software (Broadvision) was buggy; the vendor recommended we reboot our servers "as often as the business will tolerate;" and the hardware requirements to run this thing were staggering. We went from five UE2s running the site to a lovely three-tier model with Sun Enterprise 4500s in the app and db roles. These were heinously expensive -- on the order of around $100K or so, not fully configured -- and when we inevitably had to pump them full of RAM (they were awesome because they could support up to 14GB of RAM (!!)), we paid through the nose. I seem to recall we were paying about $7000/GB to fill these machines up, and we were running them all full of RAM.

And the site's reliability and uptime suffered horribly. Which, as one of the people responsible for the site being up, made me sad.

You could question Stephen's timing, somewhat; you could certainly question his choice of personalization platform. But you cannot, even with the benefit of hindsight, question his advocacy for us having a dynamic site.

Features and capabilities are complications. This page about Watch Complications is really interesting to me, and I really like its statement that "The more complications in a watch, the more difficult it is to design, create, assemble, and repair." That's true not just for watches.

The truth is that the more complicated what you do is, the more you offer your customer, the harder it is for you to support what you've got. And the earlier you do this, the higher the pioneer tax you pay.

One of the things I love about working here is that Netflix is so committed to increasing the the scope of what it offers its customers. I've been a customer since around 2001, with a break around 2002, and then solidly since 2003. In that time, I've gone from being able to get DVDs to DVDs, Blu-Ray, be able to watch movies on my PC, then Mac, and now today, at home, I've got a Wii, PS-3, XBox, AppleTV, Samsung BD player, and two Android phones which someday soon will be able to stream Netflix. We've gotten tons more content, and better at offering it to people. We've expanded the device UI from "here's your queue, and god help you if you want anything outside of it" to "you'll never have to log into the website again" (well, almost).

We've been adding complications (in the horology sense) all over the place. At the same time, Netflix has a breathtakingly courageous approach to risk and to trailblazing (one of the reasons why we chose to go to the Cloud -- in some respects before it was really ready for us). "Risk-averse" isn't in our dictionary. We try to take smart risks, but it's better to fall on one's face than sit on one's butt. In short, I've never before worked at a company that was more brash than me :).

This combination of tendencies -- expand capabilities, adopt early -- can (and does) hurt uptime. And I've got to tell you -- that sucks. I hate it when customers can't stream. I hate it when family members call me to report a problem with the site. I hate it when I sit down (as I did Sunday evening) to watch the episode of Luther I was in the middle of earlier ... and fail.

It'll get better. I really do believe that. The 2007 website, in context of what we can do today, sucks for a full-featured online offering these days. Nobody wants it back. But as time goes on, and this stuff gets more established, I do believe our reliability (which, mind you, isn't necessarily terrible -- I suspect we may be doing better than Twitter, but don't have any numbers to compare) will improve. Part of my job these days is to actually help get this better by getting us aware of issues faster; helping us understand when an issue is ongoing where it's originating; and giving people vastly smarter than I am the tools they need to dig in and fix their stuff quickly, so you, and my sister, and my spouse at home see some moving pictures the next time you hit 'play' on your choice of NRD.

I've gone and written a post of practically Jollyesque length. Sorry about that -- this is interesting stuff to me. If you've actually managed to read to the end of this, I hope it's not been a waste of time. If not ... that's OK, I'm used to TL;DR.

The comments to this entry are closed.

Sponsors

Third-Party Netflix Sites