The Open Universe: July 2015

Thursday, July 30, 2015

Friday is System Administrator Day

Wait… what exactly is SysAdmin Day? Oh, it’s only the single greatest 24 hours on the planet… and pretty much the most important holiday of the year. It’s also the perfect opportunity to pay tribute to the heroic men and women who, come rain or shine, prevent disasters, keep IT secure and put out tech fires left and right.

At this point, you may be thinking, “Great. I get it. My sysadmin is a rock star. But now what?” Glad you asked! Proper observation of SysAdmin Day includes (but is not limited to):

Cake & Ice cream
Pizza
Cards
Gifts
Words of gratitude
Custom t-shirts celebrating the epic greatness of your SysAdmin(s)
Balloons
Streamers
Confetti

Friday, July 31, 2015, is the 16th annual System Administrator Appreciation Day. On this special international day, give your System Administrator something that shows that you truly appreciate their hard work and dedication. (All day Friday, 24 hours, your own local time-zone).

Let’s face it, System Administrators get no respect 364 days a year. This is the day that all fellow System Administrators across the globe, will be showered with expensive sports cars and large piles of cash in appreciation of their diligent work. But seriously, we are asking for a nice token gift and some public acknowledgement. It’s the least you could do.

Consider all the daunting tasks and long hours (weekends too.) Let’s be honest, sometimes we don’t know our System Administrators as well as they know us. Remember this is one day to recognize your System Administrator for their workplace contributions and to promote professional excellence. Thank them for all the things they do for you and your business.

Wednesday, July 29, 2015

Java EE and Node.js Comparison

I was looking for a quick comparison of JavaEE and Node.js and came across a blog from StrongLoop that I thought did a good job of a quick comparison including an example for both.

In their post, they showed how to create a REST based web service using Java EE. This service returned a list of beers from the fantastic Pabst Brewing Company. Then created the same application using StrongLoop LoopBack and Node. With very little code in Node, they created a REST web service that returns the name and description of our beers. In addition, the LoopBack API also provides default actions for the entire CRUD process.

The table below shows some of the comparisons discussed in the blog:

Feature	Java EE	Node.js
Great IDE support	Yes, multiple choices including Eclipse, Sublime and Idea	Yes, multiple choices including Visual Studio, Eclipse, Sublime
Dependency management	Maven	NPM
Enterprise ready	Yes, in use today	Yes, in use today
Large ecosystem of libraries	Yes	Yes
Requires JVM	Yes	No
Common frameworks	Spring, JEE	Express
Database support	Yes	Yes
ORM frameworks	Yes	Yes
Testing Frameworks	Yes	Yes

References:

https://strongloop.com/strongblog/node-js-java-getting-started/

http://blog.shinetech.com/2013/10/22/performance-comparison-between-node-js-and-java-ee/

http://www.infoworld.com/article/2883328/java/java-vs-nodejs-an-epic-battle-for-developer-mindshare.html

Tuesday, July 28, 2015

What is the Hadoop Ecosystem?

https://www.facebook.com/hadoopers

In some of our Articles and Demos we have examples of JBoss Data Virtualization (Teiid) using Hadoop as a Data Source through Hive. When creating examples of Data Virtualization with Hadoop Environments such as Hortonworks Data Platform, Cloudera Quickstart, etc. there are alot of open source projects included. I wanted to highlight some of those so that you have an overview of the Hadoop Ecosystem. You can find the information below as well as more projects detail in the ecosystem from the hadoop ecosystem table.

Map Reduce - MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
HDFS - The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables.
Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.
Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.
Mahout - Machine learning library and math library, on top of MapReduce.

Also, you can visit the Big Data Insights Page to learn more about Red Hat Products in relation to the Hadoop Ecosystem.

Monday, July 27, 2015

Suggested Minimum Requirements for Data Virtualization Server and Designer

This week I wanted to share some minimum requirement recommendations for Data Virtualization (Teiid) Server and Designer which are below. You can find more detail in the Product Documentation such as the three considerations that help determine the minimal JVM footprint - concurrency, data volume and plan processing. Also you can find more detailed environment recommendations through the sizing tool. The JBoss Data Virtualization Sizing Architecture Tool is a simple web application that helps plan your JBoss Data Virtualization deployment. It presents a series of questions to gather information about the business environment for the deployment. When all questions are answered, the tool recommends a JBoss Data Virtualization configuration to support the business requirement. This recommendation includes:

The recommended number of servers
The recommended number of cores
How much memory is required, and the JVM size for needed, for each node

The following recommended minimum requirements should be thought of as a starting point. These should be adjusted from the additional suggestions from the formulas in the product documentation, sizing tool and based on expected usage.

JBoss Developer Studio Integration Stack (Teiid Designer) – without application server

2 GB RAM will get you started, but more is needed for large models
Modern Processor
500 MB disk space for installed product files
2+GB for model projects and related artifacts

The minimum sizing for the DV server is:

16 GB JVM memory size
Modern multi-core (dual or better) processor or multi-socket system with modern multi-core processors
20+ GB Disk Space that's needed for JBoss server product and DV components:
1 GB disk for installed product files
5+ GB for log files and deployed artifacts
15 GB (default) for BufferManager maxBufferSpace
If Modeshape (repository) will be used, will need to bump up the file space by a minimum 5GB more.

Thursday, July 23, 2015

Connecting to Cloudera Quickstart Virtual Machine from Data Virtualization and SQuirreL

http://www.redhat.com/en/files/resources/en-rhjb-ventana-research-infographic.pdf

One of the great capabilities of JBoss Data Virtualization is the ability to connect to Hadoop through Hive which was added as part of Data Virtualization (DV) 6.0. This gives us the ability to aggregate data from multiple datasources that include big data. This also gives us the ability to analyze our data through many different tools and standards. From the Re-Think Data Integration infographic, more than one-quarter of companies see virtualizing data as a critical approach to integrating big data. With DV 6.1 Cloudera Impala was added for fast SQL query access to data stored in Hadoop. So I wanted to add an example of how to use Cloudera Impala as a Data Source for DV. To find out how Cloudera Impala fits into the Hadoop ecosystem take a look at the Impala Concepts and Architecture documentation.

Differences between Hive and Impala

First let's take a look at an overview of Hive and Impala. Cloudera Impala is a native Massive Parallel Processing (MPP) query engine which enables users to perform interactive analysis of data stored in HBase or HDFS. The Apache Hive data warehouse software facilitates querying and managing large data sets residing in distributed storage and provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. I grabbed some of the information from the Cloudera Impala FAQ.

How does Impala compare to Hive (and Pig)? Impala is different from Hive because it uses its own daemons that are spread across the cluster of queries. Since Impala does not rely on MapReduce it avoices the startup overhead of MapReduce jobs which allows Impala to return results in real time.

Can any Impala query also be executed in Hive? Yes. There are some minor differences in how some queries are handled, but Impala queries can also be completed in Hive. Impala SQL is a subset of HiveQL. Impala is maintained by Cloudera while Hive is maintained by Apache.

Can I use Impala to query data already loaded into Hive and HBase? There are no additional steps to allow Impala to query tables managed by Hive, whether they are stored in HDFS or HBase. Impala is configured to access the Hive metastore.

Is Hive an Impala requirement? The Hive metastore service is a requirement. Impala shares the same metastore database as Hive, allowing Impala and Hive to access the same tables transparently. Hive itself is optional.

What are good use cases for Impala as opposed to Hive or MapReduce? Impala is well-suited to executing SQL queries for interactive exploratory analytics on large data sets. Hive and MapReduce are appropriate for very long running batch-oriented tasks such as ETL.

Cloudera Setup

There are several options that Cloudera offers to test their product. There are quickstart VMs, Cloudera Live and Local install for CDH. For quick setup I chose a quick start Virtual Box Virtual Machine for CDH 5.4.x for a single-node Hadoop Cluster with examples for easy learning. The VMs run CentOS 6.4 and are available for VMWare, VirtualBox and KVM. All of them require 64-bit host OS. I also chose a bridged network, increased the memory and CPUs. Once the VM is downloaded you extract the files from the ZIP, import the VM and make network/memory/CPU setting changes. Then the VM can be started.

Once you launch the VM, you are automatically logged in as the cloudera user. The account details are:

username: cloudera
password: cloudera

The cloudera account has sudo privileges in the VM. The root account password is cloudera. The root MySQL password (and the password for other MySQL user accounts) is also cloudera. Hue and Cloudera Manager use the same credentials. Then we want to browse to the Cloudera Manager at quickstart.cloudera:7180/cmf/home. We want to make sure our services are started such as Impala, Hive, YARN, etc.

Now that we have the VM setup we want to make sure we can add data. I ran through Cloudera's Tutorial Exercise 1-3 at quickstart.cloudera/tutorial/home. Then we can see the tables and data through the Hue UI (quickstart.cloudera:8888) through impala and hive.

Now we have our Big Data Environment running so let's move onto testing Impala and Hive.

SQuirreL Testing

SQuirreL SQL Client is a free open source graphical SQL client written in Java that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands etc. First we have to download the JDBC drivers for Impala and Hive. First we will go through Impala in SQuirrel. I downloaded the Impala JDBC v2.5.22 driver. Then I unzipped the jdbc4 zip file. We add the JDBC driver by clicking on the drivers tab, then the add driver button. In the extra class path tab click on Add, browse to the extracted jar files and add them along with the Name and Class Name.

Once the driver is added the driver should have a green check mark.

Next we click on the Aliases tab and add a new connection.

We add the driver and URL and then click connect. Once connected we can browse to the tables, select a table and preview content.

Now we have previewed the data through Impala through SQuirreL. Now we want to test Hive as well. We download the Hive 1.2.1 driver from Hive Apache. We do the same as above and add the driver by pointing to the jars in the lib directory of the download and use Driver Class org.apache.hive.jdbc.HiveDriver. Once the driver is added then we create a session connecting to Hive.

We can view the tables and content which shows we can now use Hive and Impala through DV to access the data in Cloudera.

Data Virtualization Testing

Now we can move onto testing Cloudera Impala with DV. DV 6.1 GA can be downloaded from jboss.org. We will also use JBoss Developer Studio and the JBoss Integration Stack.

Data Virtualization Setup

Run through the DV install instructions to install the product. Then we want to update the files for the Cloudera Impala JDBC driver.

To configure the Cloudera's Impala as datasource with DV

1) Make sure the server is not running.
2) In the modules directory create the org/apache/hadoop/impala/main folder.

3) Within the folder we want to create the module.xml file.

<?xml version="1.0" encoding="UTF-8"?>
<module xmlns="urn:jboss:module:1.0" name="org.apache.hadoop.impala">
    <resources>
      <resource-root path="ImpalaJDBC4.jar"/>
      <resource-root path="hive_metastore.jar"/>
      <resource-root path="hive_service.jar"/>
      <resource-root path="libfb303-0.9.0.jar"/>
      <resource-root path="libthrift-0.9.0.jar"/>
      <resource-root path="TCLIServiceClient.jar"/> 
      <resource-root path="ql.jar"/>        
    </resources>

    <dependencies>
        <module name="org.apache.log4j"/>
        <module name="org.slf4j"/>
 <module name="org.apache.commons.logging"/>
        <module name="javax.api"/>
        <module name="javax.resource.api"/>        
    </dependencies>
</module>

Note that the resources points to the jar files included with the Impala driver.

4) Now copy the above JDBC jar files in the resources section to the folder.
5) Next we update the standalone.xml file in the standalone/configuration folder. The driver and datasource can be added.

<datasource jndi-name="java:/impala-ds" pool-name="ImpalaDS" enabled="true" use-java-context="true">
     <connection-url>jdbc:impala://10.1.10.168:21050/;auth=noSasl</connection-url>
     <driver>impala</driver>
</datasource>

<driver name="impala" module="org.apache.hadoop.impala">
     <driver-class>com.cloudera.impala.jdbc4.Driver</driver-class>
</driver>

6) Now we can start the server

JBoss Developer Studio Setup

Run through the install instructions for installing the JBoss Developer Studio and installing the Teiid Components from the Integration Stack. Create a new workspace, ie dvdemo-cloudera. Then we will create a example view.

1. Create a New Teiid Model Project, ie clouderaimpalatest

2. Next Import metadata using JDBC from a database into a new or existing relational model using Data Tools JDBC data source connection profile.

3. Next we create a new connection profile with the Generic JDBC Profile Type.

4. We create a new driver, ie Impala Driver, by adding all the jars and setting the connection settings. The username/password should be ignored.

5. Next we import the metadata and select the types of objects in the database to import. We will just choose the tables and then the sources folder.

6. We add a new server and set it as externally managed. We start the server externally and then click the start button within JBDS,

7. Within the sources folder in the Teiid Perspective we right click one of the tables then Modeling and Preview Data. If everything is setup properly then Data will display.

8. Now we can create a layer or View to add an abstract layer. In our case we are just going to create a View for a one to one to Source example. But to show the power of DV we would normally aggregate or federate multiple sources into a view either in this layer or create another layer above that uses the lower layers for greater abstraction and flexibility. We will test with the customers table. After creating our view with table we tie the source table to the view. We also set the primary key on the customerid so when we create the VDB OData is available. We can also preview the data on the view.

9. We create a VDB that we can deploy and execute to the server.

10. After right clicking on the clouderaimpalatest.vdb we click on deploy so it is deployed to the server. Next we can browse to the OData to show the data as a consumer.

-First we take a look at the metadata

-Then we can list all the customers

References

https://hive.apache.org/
https://developer.jboss.org/wiki/ConnectToAHadoopSourceUsingHive2

http://www.cloudera.com/content/cloudera/en/downloads/connectors/impala/jdbc/impala-jdbc-v2-5-5.html

https://github.com/onefoursix/Cloudera-Impala-JDBC-Example

Tuesday, July 21, 2015

Tested Integrations for Data Virtualization

Every JBoss® Data Virtualization (JDV) release is tested and supported on a variety of market-leading operating systems, Java™ Virtual Machines (JVMs), and database combinations. In addition, Red Hat provides a tool to help plan your JBoss Data Virtualization deployment JBoss Data Virtualization Sizing Architecture Tool

The following databases and database drivers were tested as part of the JBoss Data Virtualization release process for use as a persistence store for platform components. A persistence store is used by product components to store state or other information. Product components include the ModeShape hierarchical database, S-RAMP repository and Teiid/JDV external materialized views.

Databases	JDBC Driver
IBM DB2 9.7	IBM DB2 JDBC Universal Driver Architecture 4
Oracle 12c	Oracle JDBC Driver v12
Oracle 11g	Oracle JDBC Driver v11
MySQL 5.5	MySQL Connector/J 5
Microsoft SQL Server 2012	Microsoft SQL Server JDBC Driver 4
Microsoft SQL Server 2008 R2 SP2	Microsoft SQL Server JDBC Driver 4
PostgreSQL 9.2	JDBC4 Postgresql Driver, Version 9

The following were tested as part of the JBoss Data Virtualization release process as data sources that may be accessed through the platform.

Data source	Database Driver/Supported Interface/Notes
Oracle 12c	Oracle JDBC Driver v12
Oracle 11g R2	Oracle JDBC Driver v11
Oracle 10g R2	Oracle JDBC Driver v10
IBM DB2 9.7	IBM DB2 JDBC Universal Driver Architecture v4
Microsoft SQL Server 2008, 2012	Microsoft SQL Server JDBC Driver 4
Sybase ASE 15	jConnect for JDBC3.0 v7
MySQL 5.5	mysql-connector-java v5.5
MySQL 5.1	mysql-connector-java v5.1
PostgreSQL 9.2	JDBC4 PostgreSQL Driver, Version 9.2
PostgreSQL 8.4	JDBC4 PostgreSQL Driver, Version 8.3
Teradata 12	Teradata JDBC driver, v13.00
Netezza 6.0.2	Netezza JDBC driver, v6
Greenplum Database 4.1	JDBC3 PostgreSQL Driver, Version 8.3
Ingres 10	Ingres JDBC driver, v3.0
MongoDB v2.4	via Teiid MongoDB resource adapter
Apache Cassandra 2.1 (technical preview)	via Teiid Cassandra resource adapter
JBoss Data Grid 6.4	via Data Grid HotRod client
JBoss Data Services Platform 5.x	JBoss Data Services Platform (Teiid) JDBC Driver
MetaMatrix 5.5.4	MetaMatrix 5.5.4 JDBC Driver
LDAP 3	via Teiid LDAP resource adapter
Salesforce.com	API version 22
SAP NetWeaver Gateway	OData + SAP Annotations for OData v2.0
ModeShape Hierarchical Database	via bundled ModeShape JDBC driver
Files	Delimited and Fixed-length
XML Files
XML over HTTP	HTTP 1.1
SOAP Web Services	SOAP 1.1, WSDL 1.2
Cloudera CDH 5	v2.1 Impala JDBC Driver v0.13, 0.14 Hive JDBC Driver
Hortonworks Data Platform 2	v0.13, 0.14 Hive JDBC driver
Apache Hive 0.12, 0.13, 0.14	v0.12, 0.13, 0.14 Hive JDBC driver
Google Spreadsheets	via Teiid Google resource adapter
Microsoft Excel 2010	Apache POI
Microsoft Access 2010	JDBC-ODBC bridge driver