Google Plenary Session
- Google-scale services will have total machine failure many times a day
- Fancy machines won’t reduce MTBF enough to avoid needing to write fault-tolerant software; if you have fault-tolerant software you can buy tons of cheap hardware and just go for massive redundancy
- ~1000 machines will see your query
- Failures are too often to have human-involvement
- GFS replicates 3x for security - but may be more for load (heavily used chunks are copied more)
Data center cooling failure: “When the fire trucks show up you know there’s a problem”Commodity hardware is the best way to get fault tolerance
“Cheap and nasty works” - disks stacked on motherboards w/acrylic, held together by velcro (“but make sure there’s enough cooling”).“The gleaming racks from the boom are all gone”“We treat commodity disks like server disks and pay the price sometimes”
Anecdote about disks which got so overloaded that the ASICs desoldered off the boardWrite better software since you need it anyway (the whole discussion keeps coming back to this principle)
When you can unplug an entire rack and not lose any data you’re doing it rightQ&A
Sun engineer asked whether they’re doomed (paraphrasing only slightly)
A: If you can fit everything on one or two machines or lack the talent to make non-trivial investment massive fault-tolerant software buy high-quality hardware. The server hardware & software community has largely failed to deliver true fault tolerance. Network, other non-trivial equipment is not “cheap” hardware.Security - does Google have special problems
A: No more than any other large siteQuestions about slopped-together machines
A: “It got us through the dot-com days but it has a lot of problems”Disposal costs
A: Couldn’t comment in detail (IPO restrictions) but “we can’t afford to give anything away”Return to cheap hardware discussion
A: It gets down to how well your problem can be subdivided. Some will be stuck on big-iron foreverScope of the engineering for Google-level fault tolerance
A: It’s more a mindset then a single project - “a bunch of smart people worked on it”.Software testing: how does Google test things
A: Testing at every level. Extensive real-time monitoring and partial rollout (e.g. <1% of Google’s traffic is a sizable test case)Hardware design
A: Yes, the disks are velcroed inSoftware deployment
A: Massive home-brew systemWill Google release GFS?
A: Tentative internal discussionsHas Google every had a massive failure?
A: Yes - all the time. (e.g. Data center on fire) Continue to hammer the “get fault tolerance right” message.Can Google release the “Day in Google” map animation?
A: Possibly, yesHow many 9s reliability?
A: IPO - can’t commentAre quoted MTBFs actually true?
A: I don’t know but we use disks in unexpected waysFREENIX
Glitz: Hardware Accelerated Image Compositing Using OpenGL
Peter Nilsson and David Reveman, Umeå University
- Part of Cairo (PDF 1.4 vector-based graphic subsystem)
- Significant performance improvements vs. imlib/xrender/etc. - often order-of-magnitude increases
High Performance X Servers in the Kdrive Architecture
Eric Anholt, LinuxFund.org
How Xlib Is Implemented (and What We’re Doing About It)
Jamey Sharp, Portland State University
Sysadmin Guru Session
- Political problems, often requires outback networks
- Least privilege / Kerberos systems
Config management: almost everyone does it, usually cfengineTicketing systems: why not tie a ticket number to every config change?
RT very popular - the only system which most users likedEventum http://dev.mysql.com/downloads/other/eventum/Project management systems
Nobody likes them - expensive, bad products for the wrong problemDocumentation
Wikis are very popularSearch is criticalMoving FAQs from tickets to docsWorking with users
Surveys - frequently but make sure they’re open-ended enough. No substitute for simply talking with people.UseLinux: Data
Linux Genomics
- Extensive collaboration - weblogs, forums, public database, etc.
- Heavy use of OSS: Linux, MySQL, Perl plus a significant amount of scientific software
Thin-client Linux
- Business needs for Cardiologist
- Easy & fast networking
- Shared file acesss
- Easy expansion: sending techs out is expensive
The case for thin clients
Hardware, support, software, and training costs are all lowerOngoing costs plummet: specific example in the speaker’s medical practice per-user-cost went from $830/year to $230.The tools
OpenLDAP, Zabbix monitoring, LTSPThe results
100 users on xSeries 335, ~20Mbps on 100BT
blog comments powered by Disqus