Hadoop SDK and Tutorials for Microsoft .NET Developers

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

You can read more about the partnership between Hortonworks and Microsoft here, and you can download a preview of HDP for Windows here, or sign up for HDInsight over here. And if you’re hungry for more Hadoop tutorials, grab our own Hortonworks Sandbox here.

The post Hadoop SDK and Tutorials for Microsoft .NET Developers appeared first on Hortonworks.

Hortonworks

Moving Hadoop Beyond Batch with Apache YARN

Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective.  Once ready, our team at Hortonworks will apply our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.

In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.

First-generation success inspires second-generation focus

In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts.  And so the first generation of Hadoop was a purpose-built system for web-scale data processing that was embraced by Yahoo! as well as other technology-savvy early adopters such as Facebook.

As usage at Yahoo! began to expand so did the number of ways that users wanted to interact with the data stored in Hadoop. As with any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data processing and Apache HBase as a NoSQL database.

These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system with MapReduce at the core (i.e. batch-oriented data processing).

Do we need SQL ON Hadoop or SQL IN Hadoop?

Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive, analyzing data streams as they arrive, and more.  And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.

Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop.  All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users.  Since Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these additional interactive SQL use cases.

But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system.  For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues for the other applications and jobs running in the cluster – not a good outcome to say the least.

YARN enables SQL IN Hadoop and many more applications

When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0.  By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating system.

Getting back to the SQL ON Hadoop point, with YARN we now have the ability to run SQL IN Hadoop. For by being IN Hadoop (built on YARN), it becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed. Why stop at SQL? What about machine learning or modeling? What about processing events (data) as they arrive? Would it be not nice to manage all of these through a common system?

Enter YARN.

By turning Apache Hadoop 2.0 into a multi application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.

And so that is the trailer for the story for Hadoop 2.0: Unleashing the Power of YARN. Coming soon to a cluster near you, summer of 2013! Stay tuned!

The post Moving Hadoop Beyond Batch with Apache YARN appeared first on Hortonworks.

Hortonworks

E-Commerce Services mit PORTICA GmbH Marketing Support

PORTICA GmbH zeigt mit der SimpleShow die Möglichkeiten der E-Commerce Services. Der Fulfillment Dienstleister PORTICA bietet zahlreiche Möglichkeiten für Sh…
Video Rating: 5 / 5

Welkom bij de eerste eCommerce Service Provider van Nederland. Wij realiseren uw eCommerce succes door onze nauwe samenwerking met de door ons geselcteerde p…

SVForum’s Big Data Conference: Business Enterprise

Panel Discussion: Business Enterprise Lenin Gali, Ubisoft Jeffrey Krone, Zettaset, Inc. Yuvaraj Athur Raghuvir, SAP Rohit Valia, Platform Computing, an IBM C…
Video Rating: 0 / 5

http://www.patrickschwerdtfeger.com/sbi Strategic Business Insights – this video defines and discusses “Big Data” and its implications for business managemen…
Video Rating: 5 / 5

The Evolution Of Hacker News

hacker-news

The idea of a VC having its own news aggregator was a bit outlandish in 2007. But Y Combinator was in an unusual position in those days anyway. Startup accelerators had been a highly visible part of the dot-com crash, and Silicon Valley was still skeptical of the concept nearly a decade later. So YC set out to be something different — a community of hackers building companies on their own terms.

Hacker News was initially built by YC co-founder Paul Graham as a demonstration of Arc, a new programming language he’d been working on. He quickly realized that it could help bring together the companies he was supporting and the rest of the folks who wanted in. With 1.6 million page views and 200,000 unique visitors on a given weekday, it’s now a key part of the venture firm’s success.

But the site quickly took off, as former Redditors flocked to it to talk about tech and startups (the site was then known as Startup News).

Having a big audience isn’t really the goal. In comparison, Hacker News’ inspiration and the first big YC exit, Reddit has seen as much as 4.4 million page views in a given day.

A Community For Ex-Redditors

As Graham explains, as the site started seeing traction immediately, he realized this wasn’t just a way to test Arc. He wanted to make Hacker News a place to recreate the way Reddit felt in the good old days, when most of its community was made up of hackers. As Reddit drew more traffic, the hacker focus of the site evolved. The community’s user base became diluted as it grew, and Hacker News was a new home for some of the early Reddit hackers.

Graham writes in February of 2007:

Reddit used to have a good concentration of startup-related links, but that was because so many of Reddit’s initial users were connected in some way to Y Combinator. Now that Reddit is so much more popular, the top links tend to be images, or videos, or political news.

Another goal of Hacker News, says Graham, was to be a place where founders could share ideas and communicate. In the spirit of Y Combinator’s own incubator, Hacker News was focused on being a community for entrepreneurs and founders in the tech community: a place where they could freely post and where Y Combinator could also get to know potential founders and leaders in the tech world.

“From the beginning we had a real community, and some of the core group of refugees from Reddit are still prominent on Hacker News today,” Graham explains. Part of what attracted many to Hacker News was its simplicity and voting system. The product’s UI, design and color scheme have remained relatively constant over the past six years.

Thomas Ptacek, one of the site’s first users, explains that he was a Flashdot user and then a Reddit user, and flocked to Hacker News (at the time Startup News) because it was more relevant to the technology and startup community. He found Hacker News to be a refreshing change from past forums where the quality of commenting was declining.

Here’s how Hacker News works: Users submit links to stories, and stories are ranked according to a voting system, similar to Reddit. The difference between Hacker News and Reddit, however, is the voting system. While you can vote stories up, you cannot vote stories down (but you can flag stories). According to Graham, 100 upvotes will get a story to the top of the front page of the site. You can only downvote a comment if you have enough “karma” on the site, which is another compelling element of Hacker News. The Karma factor is determined by the number of upvotes on a user’s submission and comments minus the number of downvotes.

In terms of the design, Graham says he wanted Hacker News to look like your list of processes in a terminal window. The look and feel of the site was aimed at hackers themselves who are familiar with tabular data.

Graham will occasionally add new features, some of which are on the backend of the site. For example, as comments get more deeply nested and heated in terms of exchange, the reply link takes longer to appear. There is a purposeful drag implemented on this, says Graham, because deeply nested discussions are rarely interesting.

Another subtle feature addition: a flame-war detector. Graham has been consistently deploying and updating proprietary software that determines whether there is a flame war, where people argue heatedly. When these flame wars take place (which Graham says can often get ugly and personal), the story in which the commenting is taking place is moved further down the page.

Graham has also created sophisticated spam-detection software, which was just updated with new code six months ago. With the update, Graham says that it’s rare for spam to last on the site for more than 10 minutes. If a user does spam the site or engages in personally vicious behaviors, they run the risk of being banned. But in an interesting twist, called “hellbanning,” the user may not actually know they are banned.

On the backend, Hacker News runs on one core, and Graham calls this a “remarkable feat of scaling.”

In terms of human moderation, Graham himself has been spending three to four hours per day simply moderating the site. And that’s in addition to all of his duties running Y Combinator. While a number of other YC alums have moderating abilities, Graham has been the main human element of the site. “It was becoming my life,” he says. Around six months ago, Graham brought on someone else, who he chose not to name, to moderate the site. He says the individual is affiliated with Y Combinator and is a “prudent and thoughtful guy,” and has been doing a great job ever since.

Hacker News has a strong affiliation with Y Combinator, as well. Graham explains that founders usually all create a Hacker News account when they apply, and that user name is the founder’s identity at Y Combinator. Hacker News also features a jobs page that shows any jobs available at Y Combinator companies. He adds that this jobs portal is very useful for Y Combinator, as the majority of the site’s audience is made up of programmers and engineers.

There is also an internal page that is only visible to YC founders that has a list of recent stories about YC startups. And if you are a YC founder, your username will show up in orange to other YC founders to enable these entrepreneurs to recognize and meet each other.

Graham says that Hacker News gets a lot of complaints that it has a bias toward featuring stories about Y Combinator startups, but he says there is no such bias. Instead, the culture at the incubator is to use Hacker News, and with more than 1,000 YC alumni who have graduated from the incubator, many of these founders are still active on the news site and post links to their fellow founders’ launches and news.

“It was a small intellectual village and now it is a giant city.”

Growth has its downside. What keeps Graham up at night is worrying about the dilution of quality of the Hacker News. He explains that the site was community of insiders in the hacker world, and it has gradually been getting diluted. “That is what I spend all my time thinking about,” he says.

He worries that Hacker News will become what he calls “an old crumbling building.”

“The community has been in a perpetual but slow decline because the site is growing,” he says.

Ptacek agrees that the value of Hacker News has changed a bit. “I don’t get a community feel as much, whereas in the beginning it was a small group of people who all know each other,” he says. “It’s less likely now to see the same people from thread to thread.”

One of Graham’s biggest pain points is the “schoolyard quarrels” he finds on the site on a daily basis, and wishes “users would stop misbehaving.” He cites the example of users organizing voting rings to purposefully vote up stories, which caused Graham to develop additional software to detect this. He adds that more users are trolling under newly created accounts, and are deliberately starting flame wars on the site.

“I wish I could get people to stop posting comments that are stupid or mean,” he says. “It takes only one or two negative comments and a discussion turns into a flame war.”

Graham adds that he gets a lot of vitriol from users personally with accusations of bias or censoring. He clarifies that he, and the other human editor, rarely take links down unless they are dupes. Even with tabloid or gossip stories that surface, Graham will not take them down. Users with high karma points tend to flag these stories, he adds, and they can then be taken down.

“Hacker News makes me sad a lot,” says Graham. “I wish the community would behave the way they did when it was a little village.”

Users are noticing Graham’s frustrations. Ptacek says that he observes that Graham is careful not to tell people what to say or think, but it’s clear that he wants people to treat each other better and he gets more sad over time.

Could This Be A Business?

While Graham is open about not wanting to be the next Reddit, it’s hard to ignore the fact that Hacker News could be a business. Reddit is reportedly raising cash at a $ 400 million valuation. While Hacker News has a fraction of the traffic that Reddit does, the smaller site could actually have an impressive valuation as a business without any funding or employees.

Graham himself uses the site as his primary source of news. He’s even found Y Combinator companies through Hacker News. A user in the community posted a link to Watsi, a non-profit that allows people in dire need of medical care to raise money for procedures and health care. He noticed Watsi the second time it was posted on Hacker News and thought it was an amazing idea. He cold-called the founders and convinced them to be the first ever YC-backed nonprofit. And Graham recently took a first board seat at Watsi, his first board position ever.

But Graham is adamant that Hacker News is not a business and would not become a business. There are no ads on the site, and he has no interest in making money from ads. He admits that through the jobs page he indirectly makes money, as he is an investor in Y Combinator companies and will inevitably profit if the company’s hires help the business. Nor would not be interested in selling the site.

While it’s clear that Graham has his frustrations with the community, when he talks about the site’s defining moments, he sounds like he is speaking about his own child. One of his most distinct memories about the site is the day following Steve Jobs’ death, when every story on the front page was about the Apple founder.

“Users did it collectively as a tribute, and I found this a really remarkable way to show the power of a community. I thought this is really a living, breathing thing. It was like a bunch of birds flying through the sky forming themselves as an S.”

“There are really good reasons to engage with Hacker News,” says Ptacek. “There is no better place to stay engaged with the hacker community…At the end of day it is a message board. Having a place where you can reach and talk to groups of people is an important concept.”

As for the future of Hacker News, it’s clear that Graham is focused on maintaining quality and making sure that the community treats each other with respect and kindness. “I hope that most Hacker News readers know that I am doing this for their sake,” he says.


TechCrunch