Architecting Hbase Applications by Jean-Marc Spaggiari and Kevin O’Dell


0356cef2cc4c4b4.jpg Author Jean-Marc Spaggiari and Kevin O’Dell
Isbn 9781491915813
File size 12.1 MB
Year 2015
Pages 400
Language English
File format PDF
Category programming


 

Architecting HBase Applications Jean-Marc Spaggiari and Kevin O’Dell Boston Architecting HBase Applications by Jean-Marc Spaggiari and Kevin O’Dell Copyright © 2015 Jean-Marc Spaggiari and Kevin O’Dell. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected] . Editor: Marie Beaugureau Production Editor: FILL IN PRODUCTION EDI‐ TOR Copyeditor: FILL IN COPYEDITOR Proofreader: FILL IN PROOFREADER September 2015: Indexer: FILL IN INDEXER Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2015-08-25: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491915813 for release details. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91581-3 [LSI] Table of Contents 1. Underlying storage engine - Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Ingest/pre-processing Processing/Serving User Experience 7 8 13 2. Underlying storage engine - Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Table design Table schema Table parameters Implementation Data conversion Generate Test Data Create avro schema Implement MapReduce transformation HFile validation Bulk loading Data validation Table size File content Data indexing Data retrieval Going further Bigger input file One region table Impact on table parameters 17 17 19 21 22 22 22 23 28 29 30 30 32 34 37 39 39 39 39 iii CHAPTER 1 Underlying storage engine - Description The first use case that will be examined is from Omneo a division of Camstar a Sie‐ mens Company. Omneo is a big data analytics platform that assimilates data from disparate sources to provide a 360-degree view of product quality data across the sup‐ ply chain. Manufacturers of all sizes are confronted with massive amounts of data, and manufacturing data sets comprise the key attributes of Big Data. These data sets are high volume, rapidly generated and come in many varieties. When tracking prod‐ ucts built through a complex, multi-tier supply chain, the challenges are exacerbated by the lack of a centralized data store and no unified data format. Omneo ingests data from all areas of the supply chain, such as manufacturing, test, assembly, repair, ser‐ vice, and field. Omneo offers this system to their end customer as a Software as a Service(SaaS) model. This platform must provide users the ability to investigate product quality issues, analyze contributing factors and identify items for containment and control. The ability to offer a rapid turn-around for early detection and correction of prob‐ lems can result in greatly reduced costs and significantly improved consumer brand confidence. Omneo must start by building a unified data model that links the data sources so users can explore the factors that impact product quality throughout the product lifecycle. Furthermore, Omneo has to provide a centralized data store, and applications that facilitate the analysis of all product quality data in a single, unified environment. Omneo evaluated numerous NoSQL systems and other data platforms. Omneo’s par‐ ent company Camstar has been in business for over 30 years, giving them a well established IT operations system. When Omneo was created they were given carte blanche to build their own system. Knowing the daunting task of handling all of the data at hand, they decided against building a traditional EDW. They also looked at other big data technologies such as Cassandra and MongoDB, but ended up selecting 5 Hadoop as the foundation for the Omneo platform. The primary reason for the deci‐ sion came down to ecosystem or lack thereof from the other technologies. The fully integrated ecosystem that Hadoop offered with MapReduce, HBase, Solr, and Impala allowed Omneo to handle the data in a single platform without the need to migrate the data between disparate systems. The solution must be able to handle numerous products and customer’s data being ingested and processed on the same cluster. This can make handling data volumes and sizing quite precarious as one customer could provide eighty to ninety percent of the total records. As of writing this Omneo hosts multiple customers on the same cluster for a rough record count of +6 billion records stored in ~50 nodes. The total combined set of data in the HDFS filesystem is approximately 100TBs. This is impor‐ tant to note as we get into the overall architecture of the system we will note where duplicating data is mandatory and where savings can be introduced by using a uni‐ fied data format. Omneo has fully embraced the Hadoop ecosystem for their overall architecture. It would only make sense for the architecture to also takes advantage of Hadoop’s Avro data serialization system. Avro is a popular file format for storing data in the Hadoop world. Avro allows for a schema to be stored with data, making it easier for different processing systems such as MapReduce, HBase, and Impala/Hive to easily access the data without serializing and deserializing the data over and over again. The high level Omneo architecture is shown below: • Ingest/pre-processing • Processing/Serving • User Experience 6 | Chapter 1: Underlying storage engine - Description Ingest/pre-processing Figure 1-1. Batch ingest using HDFS API The ingest/pre-processing phase includes acquiring the flat files, landing them in HDFS, and converting the files into Avro. As noted in the above diagram, Omneo currently receives all files in a batch manner. The files arrive in a CSV format or in a XML file format. The files are loaded into HDFS through the HDFS API. Once the files are loaded into Hadoop a series of transformations are performed to join the rel‐ evant data sets together. Most of these joins are done based on a primary key in the data. In the case of electronic manufacturing this is normally a serial number to iden‐ tify the product throughout its lifecycle. These transformations are all handled through the MapReduce framework. Omneo wanted to provide a graphical interface for consultants to integrate the data rather than code custom mapReduce. To accom‐ plish this they partnered with Pentaho to expedite time to production. Once the data has been transformed and joined together it is then serialized into the Avro format. Ingest/pre-processing | 7 Processing/Serving Figure 1-2. Using Avro for a unified storage format Once the data has been converted into Avro it is loaded into HBase. Since the data is already being presented to Omneo in batch, we take advantage of this and use bulk loads. The data is loaded into a temporary HBase table using the bulk loading tool. The previously mentioned MapReduce jobs output HFiles that are ready to be loaded into HBase. The HFiles are loaded through the completebulkload tool. The completebulkload works by passing in a URL, which the tool uses to locate the files in HDFS. Next, the bulk load tool will load each file into the relevant region being served by each RegionServer. Occasionally a region has been split after the HFiles were created, and the bulk load tool will automatically split the new HFile according to the correct region boundaries. 8 | Chapter 1: Underlying storage engine - Description Figure 1-3. Data flowing into HBase Once the data is loaded into the staging table it is then pushed out into two ofthe main serving engines Solr and Impala. Omneo is using a staging table to limit the amount of data read from HBase to feed the other processing engines. The reason behind using a staging table lies in the HBase key design. One of the cool things about this HBase use case is the simplicity of the schema design. Normally many hours will be spent figuring out the best way to build a composite key that will allow for the most efficient access patterns, and we will discuss composite keys in the later chapters. However, in this use case the row key is a simple MD5 hash of the product serial number. Each column stores an Avro record. The column name contains the unique ID of the Avro record it stores. The Avro record is a de-normalized data set contain‐ ing all attributes for the record. After the data is loaded into the staging HBase table, it is then propagated into two other serving engines. The first serving engine is Cloudera Search(Solr) and the sec‐ ond is Impala. Here is a diagram showcasing the overall load of data into Solr: Processing/Serving | 9 Figure 1-4. Managing full and incremental Solr index updates The data is loaded into Search through the use of a custom MapReduceIndexerTool. The default nature of the MapReduceIndexerTool is to work on flat files being read from HDFS. Given the batch aspect of the Omneo use case, they modified the indexer tool to read from HBase and write directly into the Solr Collection through the use of MapReduce. The above diagram illustrates the two flows of the data from HBase into the Solr Collections. There are two collections in play for each customer, in this case there is CollectionA(active), CollectionB(backup), and an alias that links to the “active” Collection. During the incremental index only the current Collection is upda‐ ted from the staging HBase table through the use of the MapReduceIndexerTool. In the above diagram the HBase staging table is loading into CollectionA and the alias is pointing to the active Collection (CollectionA). The dual collections with an alias approach offers the ability to drop all of the documents in a single collection and reprocess the data without suffering an outage. This gives Omneo the ability to alter the schema and push it out to production without taking more downtime. Part two of the above diagram illustrates this action; the MapReduceIndexerTool is reindexing the main HBase table into CollectionB while the alias is still pointing to Col‐ lectionA. Once the indexing step complete, the alias will be swapped to point at 10 | Chapter 1: Underlying storage engine - Description CollectionB and incremental indexing will be pointed at CollectionB until the dataset needs to be re-indexed again. This is where the use case really gets interesting. HBase serves two main functions in the overall architecture. The first one is to handle the MDM(master data manage‐ ment) since it allows updates. In this case, HBase is the system of record that Impala and Solr use. If there is an issue with the Impala or Solr datasets, they will rebuild them against the HBase dataset. In HBase attempting to redefine the row key typically results in having to rebuild the entire dataset. Omneo first attempted to tackle faster lookups for secondary and tertiary fields by leveraging composite keys. It turns out the end user likes to change the primary lookups based on the metrics they are look‐ ing at. This is one of the reasons Omneo avoided leveraging composite keys, and used Solr to add extra indexes to HBase. The second and most important piece is HBase actually stores all of the records being served to the end user. Lets look at a couple sample fields from Omneo’s Solr schema.xml: Looking at some of the fields in the schema.xml shown above we can see that Omneo is only flagging the HBase rowkey and the required Solr fields(id and version) as stored which will directly write these results to HDFS. The other fields are flagged as indexed; which will store the data in a special index directory. The index field makes a field searchable, sortable, and facetable, it is also stored in memory. The stored fields are fields retrievable through search and persisted to HDFS file system. The typical records that Omneo ingests can have many columns presents in the data ranging from 100s to 1000s of columns depending on the product being ingested. For the purposes of faceting and natural language searching typically only a small subset of those fields are necessary. The amount of fields indexed will vary per customer and use cases. This is a very common pattern as the actual data results displayed to the customer are being served from the application calling scans and multigets from HBase based on the stored data. Just indexing the fields serves two purposes: Processing/Serving | 11 • All of the data and facets are served out of memory offering tighter and more predictable SLAs. • The current state of Solr Cloud on HDFS writes the data to HDFS per shard and replica. If HDFS replication is set to the default factor of 3, then a shard with two replicas will have 9 copies of the data on HDFS. This will not normally affect a Search deployment as memory or CPU is normally the bottleneck before storage, but it will use more storage. • Indexing the fields offers lighting fast counts to the overall counts for the indexed fields. This feature can help to avoid costly SQL or pre-HBase MapReduce based aggregations The data is also loaded from HBase into Impala tables from the Avro schemas and converted into the Parquet file format. Impala is used as Omneo’s data-warehouse for the end users. The data is populated in the same manner as the Solr data with incre‐ mental updates being loaded from the HBase staging table and full rebuilds being pulled from the main HBase table. As the data is pulled from the HBase tables it is denormalized into a handful tables to allow for an access pattern conducive to Impala. The model used is another portion of the secret sauce of Omneo’s business model. The model is shown below: 12 | Chapter 1: Underlying storage engine - Description Figure 1-5. Nah, we can’t share that. Get your own! User Experience Normally we do not spend a ton of time looking at the end application as they tend to be quite different per application, but in this case it is important to discuss how everything comes together. Combining the different engines together in a stream‐ lined user experience is the big data dream. This is how companies move from play‐ ing around to truly delivering a monumental product. User Experience | 13 Figure 1-6. Overall data flow diagram including the user interaction The application makes use of all three of the serving engines in a single interface. This is important for a couple of key reasons. The first is, increased productivity from the analyst. The analyst no longer has to switch between different UIs, or CLIs. Second, the analyst is able to use the right tool for the right job. One of the major issues we see in the field is customers attempting to use one tool to answer all of the questions. By allowing Solr to serve facets and handle natural language searches, HBase to serve the full fidelity records, and Impala to handle the aggregations and SQL questions Omneo is able to offer the analyst a 360 degree view of the data. 14 | Chapter 1: Underlying storage engine - Description Figure 1-7. Check out this Solr Goodness Let’s start by looking at the Solr/HBase side of the house. These are the two most intertwined services of the Omneo application. As mentioned before, Solr stores the actual HBase Row Key and indexes the vast majority of other fields that the users like to search and facet against. In this case as the user drills down or adds new facets the raw records are not served back from Solr, but rather pulled from HBase using a mul‐ tiget of the top 50 records. This allows the analyst to see the full fidelity record being produced by the facets and searches. The same thing holds true if the analyst wishes to export the records to a flat file, the application will call a scan of the HBase table and write out the results for end user. User Experience | 15 Figure 1-8. Leveraging Impala for custom analytics On the Impala side of the house, also known as Performance Analytics, models are built and managed to handle SQL like workloads. These are workloads that would normally be forced into HBase or Solr. Performance Analytics was designed to run a set of pre-packed application queries that can be run against the data to calculate Key Performance Indicators(KPIs). The solution does not allow for random free-form SQL queries to be utilized as long running rogue queries can cause performance deg‐ radation in a multi-tenant application. In the end the users can select the KPIs they want to look at, and add extra functions to the queries (sums, avg, max, etc). 16 | Chapter 1: Underlying storage engine - Description CHAPTER 2 Underlying storage engine Implementation In the previous chapter, we described how Omneo uses the different Hadoop technol‐ ogies to implement their use-case. In this chapter we will look in detail at all the dif‐ ferent parts involving HBase. Implementation will not go into each and every detail but will give you all the required tools and examples to understand what is important in this phase. As usual when implementing an HBase project, the first thing we consider is the table schema which is the most important part of every HBase project. This can sometimes be easy, like for the current use-case, but can also sometimes require a lot of time. It is a good practice to always start with this task, keeping in mind how data is received from your application (write path) and how you will need to retrieve it (read path). Read and write access patterns will dictate most of the table design. Table design Table schema Table design for the Omneo use-case is pretty easy, but let’s work through the steps so you can apply a similar approach to your own table schema design. We want both read and write paths to be efficient. In Omneo’s case, data is received from external systems in bulk. Therefore, unlike other ingestion patterns where data is inserted one single value at a time, here it can be processed directly in bulk format and doesn’t require single random writes or updates based on the key. On the read side, the user needs to be able to retrieve all the information for a specific sensor very quickly by searching on any combination of sensor id, event id, date and event type. There is no way we can design a key to allow all those retrieval criteria to be efficient. We will 17 have to rely on an external index which, given all of our criteria, will give us back a key we will use to query HBase. Given that the key will be retrieved from this external index and that we don’t have to look-up or scan for it, we can simply use a hash of the sensor ID, with the column qualifier being the event ID. You can refer to “Generate Test Data” on page 22 to have a preview of the data format. Sensors can have very similar IDs, such as 42, 43, 44. However, sensor IDs can also have a wide range (e.g. 40000-49000). If we use the original sensor ID as the key, we might encounter hot-spots on specific regions due to the keys’ sequential nature. You can read more about hot-spotting in ???. Hashing keys One option to deal with hotspotting is to simply pre-split the table based on those different known IDs to make sure they are correctly distributed accross the cluster. However, what if in the future, distribution of those IDs changes? Then splits might not be correct anymore, and we might end-up again by hot-spotting some regions. If today all IDs are between 40xxxx and 49xxxx, regions will be split from the beginning to 41, 41 to 42, 42 to 43, etc. But if tomorrow a new group of sensors is added with IDs from 40xxx to 39xxx, they will end up in the first region. Since it is not possible to forecast what the future IDs will be, we need to find a solution to ensure a good distribution whatever the IDs will be. When hashing data, even 2 initially close keys will produce a very different result. From our example above, 42 will produce 50a2fabfdd276f573ff97ace8b11c5f4 as its md5 hash, while 43 will produce f0287f33eba7192e2a9c6a14f829aa1a. As you can see, unlike the original sensor IDs 42 and 43, sorting those two md5 hashes puts them far from one another. And even if new IDs are coming, since they are now translated into a hexadecimal value, they will always be distributed between 0 and F. Using such a hashing approach will ensure a good distribution of the data across all the regions while given a specific sensor ID, we still have direct access to its data. The hash approach can not be used when you need to scan your data keeping the initial order of the key, as the md5 version of the key perturbs the original ordering, distributing the rows through‐ out the table. Column qualifier Regarding the column qualifier, the event ID will be used. The event ID is a hash value received from the downstream system, unique for the given event for this spe‐ cific sensor. Each event has a specific type such as “alert”, “warning”, or “RMA” (Return Merchandise Authorization). At first, we considered using the event type as a column qualifier. However, a sensor can encounter a single event type multiple times. Each “warning” a sensor encountered would overwrite the previous “warning”, unless 18 | Chapter 2: Underlying storage engine - Implementation we used HBase’s “versions” feature. Using the unique event ID as the column qualifier allows us to have multiple events with the same type for the same sensor being stored without having to code extra logic to use HBase’s “versions” feature to retrieve all of a sensor’s events. Table parameters Most HBase table parameters should be considered to improve performance. How‐ ever, only the parameters that apply to this specific use-case are listed in this section. The list of all the existing parameters for table creation are available in chapter ???. Compression The first parameter we examine is the compression algorithm used when writing table data to disk. HBase writes the data into HFiles in a block format. Each block is 64 KB by default, and is not compressed. Blocks store the data belonging to one region and column family. A table’s data contains related information and usually has common pattern. Compressing those blocks can almost always give good results. As an example, it will be good to compress column families containing logs and cus‐ tomer information. HBase supports multiple compression algorithms: LZO, GZ (for GZip), SNAPPY and LZ4. Each compression algorithm will have its own pros and cons. For each algorithm, consider the performance impact of compressing and decompressing the data versus the compression ratio, i.e. was the data sufficiently compressed to warrant running the compression algorithm. Snappy will be very fast in all operations but will have a very low compression ratio, while GZ will be more resource intensive but will compress better. The algorithm you will choose depends on your use-case. It is recommended to test a few of them on a sample dataset to validate compression rate and performance. As an example, a 1.6GB CSV file generates 2.2GB of uncompressed HFiles while from the exact same dataset it uses only 1.5GB with LZ4. Snappy compressed HFiles for the same dataset take 1.5GB too. Since read and write latency are important for us, we will use Snappy for our table. Be aware of the availability of the various compression libraries on vari‐ ous Linux distributions. For example, Debian does not include Snapply libraries by default. Due to licensing, LZO and LZ4 libraries are usually not bundled with com‐ mon Apache Hadoop distributions, and must be installed separately. Table design | 19

Author Jean-Marc Spaggiari and Kevin O’Dell Isbn 9781491915813 File size 12.1 MB Year 2015 Pages 400 Language English File format PDF Category Programming Book Description: FacebookTwitterGoogle+TumblrDiggMySpaceShare Lots of HBase books, online HBase guides, and HBase mailing lists/forums are available if you need to know how HBase works. But if you want to take a deep dive into use cases, features, and troubleshooting, Architecting HBase Applications is the right source for you. With this book, you’ll learn a controlled set of APIs that coincide with use-case examples and easily deployed use-case models, as well as sizing/best practices to help jump start your enterprise application development and deployment. Learn design patterns-and not just components-necessary for a successful HBase deployment Go in depth into all the HBase shell operations and API calls required to implement documented use cases Become familiar with the most common issues faced by HBase users, identify the causes, and understand the consequences Learn document-specific API calls that are tricky or very important for users Get use-case examples for every topic presented     Download (12.1 MB) Introduction To React The Data Access Handbook Expert Sql Server In-memory Oltp Rails 4 In Action: Revised Edition Of Rails 3 In Action Open-source Esbs In Action Load more posts

Leave a Reply

Your email address will not be published. Required fields are marked *