The Open Universe: hadoop

Showing posts with label hadoop. Show all posts

Wednesday, October 14, 2015

Data as a Service: JBoss Data Virtualization and Hadoop powering your Big Data solutions

Guest blog by Syed Rasheed, Senior Product Marketing Manager
Twitter @Junooni, eMail [email protected]

Red Hat and Cloudera, announce the formation of a strategic alliance. From JBoss perspective, the key objective of the alliance is to leverage big data enterprise-wide and not let Hadoop become another data silo. Cloudera combined with Red Hat JBoss Data Virtualization integrates Hadoop with existing information sources including data warehouses, SQL and NoSQL databases, enterprise and cloud applications, and flat and XML files. The solution creates business-friendly, reusable and virtual data models with unified views by combining and transforming data from multiple sources including Hadoop. This creates integrated data available on-demand for external applications through standard SQL and web services interfaces.

The reality at vast majority of organization is that data is spread across too many applications and systems. Most organizations don’t know what they’ve lost because their data is fragmented across the organization. This problem does not go away just because an organization is using big data technology like Hadoop; in fact, they get more complicated. Some organizations try to solve this problem by hard coding the access to data stores. This simple approach inefficiently breaks down silos and brings lock-in with it. Lock-in makes applications less portable, a key metric for future proofing IT. This approach also impedes organizational agility because hard coding data store access is time consuming and makes IT more complex, incurring technical debt. Successful business need to break down the data silos and make data accessible to all the applications and stakeholders (often a requirement for real time contextual services).

A much better approach to solving this problem is abstraction through data virtualization. It is a powerful tool, well suited for the loose coupling approach prescribed by the Modern Enterprise Model. Data virtualization helps applications retrieve and manipulate data without needing to know technical details about each data store. When implemented, organizational data can be easily accessed using a simple REST API or via familiar SQL interface.

Data Virtualization (or an abstracted Data as a Service) plugs into the Modern Enterprise Platform as a higher-order layer, offering the following advantages:

Better business decisions due to organization wide accessibility of all data
Higher organizational agility
Loosely coupled services making future proofing easier
Lower cost

Data virtualization is therefore a critical part of the big data solution. It facilitates and improves the use of big data in the enterprise by:

Abstracting big data into relational-like views
Integration with existing enterprise sources
Adding real time query capabilities to big data
Providing full support for standard based interfaces like REST and OData in addition JDBC and ODBC.
Adding security and governance to the big data infrastructure
Flattening data siloes through a unified data layer.

Want to learn more, download, and get started with JBoss Data Virtualization visit http://www.jboss.org/products/datavirt

Data Virtualization by Example https://github.com/datavirtualizationbyexample

Interested in community version then visit http://teiid.jboss.org/

Be Sociable, Share!

Tuesday, September 29, 2015

Red Hat and the Strata+Hadoop World Conference

This week I am at the Strata plus Hadoop World Conference in New York City which is presented by O'Reilly and Cloudera. Red Hat has a booth and Product highlights include JBoss Data Virtualization, Red Hat Storage and Red Hat Enterprise Linux. Take a look at some of the information being highlighted below:

The theme of the conference is Make Data Work. The Keynotes will be streamed live during the conference on Wednesday, September 30 and Thursday, October 1. The live stream schedule is below:

Wednesday, September 30

8:45am - 8:50am

Wednesday keynote welcome

Roger Magoulas @rogerm (O'Reilly Media), Doug Cutting @cutting(Cloudera), Alistair Croll @acroll (Solve For Interesting)

8:50am - 9:05am

The next generation

Mike Olson @mikeolson (Cloudera)

9:05am - 9:15am

Playing with, and for, data

AnnMarie Thomas @amptMN (School of Engineering and Schulze School of Entrepreneurship, University of St. Thomas)

9:15am - 9:25am

What 0-50 million users in 7 days can teach us about big data

Joseph Sirosh (Microsoft)

9:25am - 9:30am

Improving Medical Decision Making with Predictive Analytics on Big Data

Ron Kasabian (Intel), Michael Draugelis @mdraugelis (Penn Medicine)

9:30am - 9:35am

The race to modernize BI: What it is and why so urgent?

Tim Howes @howes28 (ClearStory Data)

9:35am - 9:40am

Unleashing the power of big data today

Jim McHugh (Cisco)

9:40am - 9:50am

A Transition to Interactive Music Consumption + Data

Joy Johnson @joyjohnson (AudioCommon)

9:50am - 10:00am

Data vs creativity: The last battleground?

David Boyle @beglen (BBC Worldwide)

10:00am - 10:10am

On reflection: What the White House needs from you

DJ Patil @dpatil (White House Office of Science and Technology Policy)

10:10am - 10:25am

Improving decisions

Katherine Milkman @Katy_Milkman (Wharton School at the University of Pennsylvania)

10:25am - 10:30am

O'Reilly Announcements

Ben Lorica @bigdata (O'Reilly Media)

10:30am - 10:45am

Context Computing

Jeff Jonas @jeffjonas (IBM)

Thursday, October 1

8:45am - 8:50am

Thursday keynote welcome

Roger Magoulas @rogerm (O'Reilly Media), Doug Cutting @cutting(Cloudera), Alistair Croll @acroll (Solve For Interesting)

8:50am - 9:00am

Data science for mission

Doug Wolfe (CIA)

9:00am - 9:10am

Privacy protection and reproducible research

Daniel Goroff @DGoroff (Alfred P. Sloan Foundation)

9:10am - 9:20am

The big data dividend

Jack Norris @Norrisjack (MapR Technologies)

9:20am - 9:25am

The rise of the citizen data scientist

Ben Werther @bwerther (Platfora)

9:25am - 9:30am

Patterns from the future

Paul Kent @hornpolish (SAS)

9:30am - 9:45am

Doing it Wrong: 10 Problems with Qualitative Data

Farrah Bostic @farrahbostic (The Difference Engine)

9:45am - 9:50am

IBM sponsored keynote

Shivakumar Vaithyanathan (IBM)

9:50am - 10:00am

What does it take to apply data science for social good?

Jake Porway (DataKind)

10:00am - 10:20am

Haunted by data

Maciej Ceglowski @pinboard (Pinboard.in)

10:20am - 10:40am

In praise of boredom

Maria Konnikova (The New Yorker | Mastermind)

10:40am - 10:45am

Closing remarks

Roger Magoulas @rogerm (O'Reilly Media), Doug Cutting @cutting (Cloudera), Alistair Croll @acroll (Solve For Interesting)

Tuesday, July 28, 2015

What is the Hadoop Ecosystem?

https://www.facebook.com/hadoopers

In some of our Articles and Demos we have examples of JBoss Data Virtualization (Teiid) using Hadoop as a Data Source through Hive. When creating examples of Data Virtualization with Hadoop Environments such as Hortonworks Data Platform, Cloudera Quickstart, etc. there are alot of open source projects included. I wanted to highlight some of those so that you have an overview of the Hadoop Ecosystem. You can find the information below as well as more projects detail in the ecosystem from the hadoop ecosystem table.

Map Reduce - MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
HDFS - The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables.
Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.
Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.
Mahout - Machine learning library and math library, on top of MapReduce.

Also, you can visit the Big Data Insights Page to learn more about Red Hat Products in relation to the Hadoop Ecosystem.

Wednesday, October 15, 2014

Beyond Big Data Webinar Series

We are starting the Data Integration Webinar Series which includes 5 webinars that are described below. Register at https://vts.inxpo.com/scripts/Server.nxp?LASCmd=AI:4;F:APIUTILS!51004&PageID=657CAFB4-296B-4008-B864-F38CDE62F376

Data Integration Webinar Series

Information is power. It gives your organization the competitive edge it needs to win. But, with so much data and so little time, how do you leverage all this data to actually yield valuable business insights?

Big data solutions are a great start. But what about being able to easily access the right data, at the right time?

Watch the webinar series to learn how to overcome challenges of big data, data bottlenecks, and data integration. Topics in this 5-part series are:

Implementing a data strategy and architecture to avoid data problems
Delivering data-rich user experiences with high performance and scalability
How to quickly and easily create a virtual data services layer
Analytics based decision making
Avoiding Hadoop data silos

The 3 big problems with data and how to avoid them

Date: November 5, 2014
Time: 11:00 a.m. – 12:00 p.m. EST

No matter what your organization looks like, chances are you're wrestling with at least one of the following data challenges:

Data silos that are difficult to access when needed
Point-to-point integration that simply does not scale
Data sprawl leading to security and compliance risks

Join this webinar to learn how to implement a data strategy and architecture to avoid these problems.

Speakers:
Syed Rasheed, senior product marketing manager, Red Hat
Ken Johnson, director of product management, Red Hat

Slow data is a fast way to lose your best customers

Date: November 12, 2014
Time: 11:00 a.m. – 12:00 p.m. EST

Worried about big data? You should be terrified by “slow data.” When customers are forced to experience slow or inaccurate service they will run, not walk, away from your organization. Meeting real-time customer expectations while juggling huge volumes of data is a challenge. Learn how your organization can successfully provide stellar data-rich user experiences while maintaining high enterprise performance and scalability.

Speaker:
Vamsi Chemitiganti, chief solution architect, Red Hat

Integration intervention: get your apps and data up to speed

Date: : November 19, 2014
Time: 11:00 a.m. – 12:00 p.m. EST

The most efficient and agile applications and services can be dragged down by the point-to-point data connections of a traditional data integration stack. Virtualized data services eliminates this friction and speeds up your applications.

Join this webinar to see how to quickly and easily create a virtual data services layer to plug data into your SOA infrastructure, allowing your entire solution to operate with agility and efficiency.

Speakers:
Syed Rasheed, senior product marketing manager, Red Hat
Kenny Peeples, JBoss technology evangelist, Red Hat

Making good decisions? Want to? Data analytics is the key

Date: December 2, 2014
Time: 11:00 a.m. – 12:00 p.m. EST

You've turned big data into information, but how can you use it to make better decisions and respond faster to customer needs? How do you turn big data in to smart data?

Join this webinar and learn how to:

Apply advanced business rules to virtualized data services.
Identify and act upon important information that may otherwise be lost in a sea of big data.

Speakers:

Kim Palko, senior product manager, Red Hat
Prakash Aradhya, senior product manager, Red Hat
Kenny Peeples, JBoss technology evangelist, Red Hat

Don't let Hadoop become a new data silo

Date: December 9, 2014
Time: 11:00 a.m. – 12:00 p.m. EST

Organizations today no longer suffer from a lack of data. They suffer from a lack of the right data, at right time. Barriers such as having data spread across too many applications and systems do not go away just because an organization is using big data technology. In fact, they get more complicated.

In this webinar you'll learn how to:

Make calling data from Hadoop as easy as any SQL data source.
Seamlessly combine data from Hadoop-based systems with existing data silos.
Deliver truly unified and actionable information to maximize return on data assets.

Speakers:
Syed Rasheed, senior product marketing manager, Red Hat
Kenny Peeples, JBoss technology evangelist, Red Hat

Thursday, September 11, 2014

Evolving your data into a strategic asset with Hortonworks and JBoss

This week we had the Discover Red Hat and Apache Hadoop for the Modern Data Architecture – Part 2 webinar. I wanted to share the overview along with the demo videos. The replay of the webinar is here. The full slide deck is here. The Hortonworks and Red Hat partner landing page is here. The source code, supporting files and how to guide are here. The part 2 webinar contains 2 use cases that utilize the Hortonworks Data Platform and JBoss Data Virtualization.

Use Case 1 combines sentiment data from Hadoop with data from traditional relational sources.

Objective:
Determine if sentiment data from the first week of the Iron Man 3 movie is a predictor of sales
Problem:
Cannot utilize social data and sentiment analysis with sales management system
Solution:
Leverage JBoss Data Virtualization to mashup Sentiment analysis data with ticket and merchandise sales data on MySQL into a single view of the data.

Use Case 2 revolves around geographically distributed hadoop clusters with Data Virtualization securing data by user role. Use Case 2 combines data from 2 Hortonworks sandboxes.

Objective:
Secure data according to Role for row level security and Column Masking
Problem:
Cannot hide region data from region specific users
Solution:
Leverage JBoss Data Virtualization to provide Row Level Security and Masking of columns

The walk through the demos are below for both Use Cases. There are 4 videos, 2 for each Use Case which cover the setup/configuration and then running the demo.

Use Case 1 Configuration and Running

Use Case 2 Configuration and Running

Wednesday, September 3, 2014

Discover Red Hat and Apache Hadoop for the Modern Data Architecture

I will be doing 2 joint webinars in September with Hortonworks. Please register here and join us for Hadoop and Data Virtualization Use Cases.

As the enterprise's big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red hat and Hortonworks have collaborated to integrate HDP with Red Hat's proven platform technologies.

Join us in this interactive series, as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.

Here's what you'll be signing up for:

Webinar 1, September 3 @10am PST: Red Hat and Hortonworks: Delivering the open modern data architecture
Webinar 2, September 10 @10am PST: Red Hat JBoss Data Virtualization and HDP: Evolving your data into strategic asset (demo/deep dive)
Webinar 3, September 17 @10am PST: Red Hat JBoss and Hortonworks: Enabling the Data Lake (demo/deep dive)

You'll receive an individualized email for each webinar (3 in total) upon registration.

Speakers:
John Kreisa, VP Strategic Marketing, Hortonworks
Raghuram Thiagarajan, Director, Product Management, Hortonworks
Robert Cardwell, VP Strategic Partnerships and Aliiances, Red Hat
Syed Rasheed, Senior Principal Product Marketing. Red Hat
Kim Palko, Principal Product Manager, Middleware, Red Hat
Kenneth Peeples, Principal Product Marketing Manager, Middleware, Red Hat