Showing posts with label bigdata. Show all posts
Showing posts with label bigdata. Show all posts

Wednesday, October 14, 2015

Data as a Service: JBoss Data Virtualization and Hadoop powering your Big Data solutions

Guest blog by Syed Rasheed, Senior Product Marketing Manager
Twitter @Junooni, eMail [email protected]


Red Hat and Cloudera, announce the formation of a strategic alliance. From JBoss perspective, the key objective of the alliance is to leverage big data enterprise-wide and not let Hadoop become another data silo. Cloudera combined with Red Hat JBoss Data Virtualization integrates Hadoop with existing information sources including data warehouses, SQL and NoSQL databases, enterprise and cloud applications, and flat and XML files. The solution creates business-friendly, reusable and virtual data models with unified views by combining and transforming data from multiple sources including Hadoop. This creates integrated data available on-demand for external applications through standard SQL and web services interfaces.
The reality at vast majority of organization is that data is spread across too many applications and systems. Most organizations don’t know what they’ve lost because their data is fragmented across the organization. This problem does not go away just because an organization is using big data technology like Hadoop; in fact, they get more complicated. Some organizations try to solve this problem by hard coding the access to data stores. This simple approach inefficiently breaks down silos and brings lock-in with it. Lock-in makes applications less portable, a key metric for future proofing IT. This approach also impedes organizational agility because hard coding data store access is time consuming and makes IT more complex, incurring technical debt. Successful business need to break down the data silos and make data accessible to all the applications and stakeholders (often a requirement for real time contextual services).
redhat-jboss-datavirt
A much better approach to solving this problem is abstraction through data virtualization. It is a powerful tool, well suited for the loose coupling approach prescribed by the Modern Enterprise Model. Data virtualization helps applications retrieve and manipulate data without needing to know technical details about each data store. When implemented, organizational data can be easily accessed using a simple REST API or via familiar SQL interface.
Data Virtualization (or an abstracted Data as a Service) plugs into the Modern Enterprise Platform as a higher-order layer, offering the following advantages:
  • Better business decisions due to organization wide accessibility of all data
  • Higher organizational agility
  • Loosely coupled services making future proofing easier
  • Lower cost
Data virtualization is therefore a critical part of the big data solution. It facilitates and improves the use of big data in the enterprise by:
  • Abstracting big data into relational-like views
  • Integration with existing enterprise sources
  • Adding real time query capabilities to big data
  • Providing full support for standard based interfaces like REST and OData in addition JDBC and ODBC.
  • Adding security and governance to the big data infrastructure
  • Flattening data siloes through a unified data layer.
Want to learn more, download, and get started with JBoss Data Virtualization visit http://www.jboss.org/products/datavirt
Data Virtualization by Example https://github.com/datavirtualizationbyexample
Interested in community version then visit http://teiid.jboss.org/

Be Sociable, Share!

Tuesday, July 28, 2015

What is the Hadoop Ecosystem?

https://www.facebook.com/hadoopers

In some of our Articles and Demos we have examples of JBoss Data Virtualization (Teiid) using Hadoop as a Data Source through Hive.  When creating examples of Data Virtualization with Hadoop Environments such as Hortonworks Data Platform, Cloudera Quickstart, etc. there are alot of open source projects included.  I wanted to highlight some of those so that you have an overview of the Hadoop Ecosystem.  You can find the information below as well as more projects detail in the ecosystem from the hadoop ecosystem table.

Map Reduce - MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
HDFS - The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables.
Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.
Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.
Mahout - Machine learning library and math library, on top of MapReduce.

Also, you can visit the Big Data Insights Page to learn more about Red Hat Products in relation to the Hadoop Ecosystem.


Monday, May 18, 2015

Data Virtualization Primer - Introduction

This week we are starting the Data Virtualization Primer which I am splitting into 3 series - The Basics, The Connectors and the Solutions.  My goal is to publish one or two articles a week, each one covering a topic that can be reviewed in a short amount of time.  Demos and examples will be included and some of the topics will be broken into multiple parts to help easily digest them.  The planned outline is below as well as our first topic which is Data Virtualization Introduction.

Series 1 - The Basics
  1. Introduction
  2. The Concepts (SOAs, Data Services, Connectors, Models, VDBs)
  3. Architecture
  4. On Premise Server Installation
  5. JBDS and Integration Stack Installation
  6. WebUI Installation
  7. Teiid Designer - Using simple CSV/XML Datasources (Teiid Project, Perspective, Federation, VDB)
  8. JBoss Management Console
  9. The WebUI
  10. The Dashboard Builder
  11. OData with VDB
  12. JDBC Client
  13. ODBC Client
  14. DV on Openshift
  15. DV on Containers (Docker)
Series 2 - The Connectors

This series will cover each connector including example demos of each.

Series 3 - The solutions
  1. Big Data Example
  2. IoT Example
  3. Cloud Example
  4. Mobile Example
  5. JBoss Multi-product examples (Fuse/BPMS/BRMS/Feedhenry/DG)