IBM leads BigInsights for Hadoop out behind barn. Shots heard
IBM has announced the reTIrement of the basic plan for its data analyTIcs software platform, BigInsights for Hadoop.
The basic plan of the service will be reTIred in a month, on December 7 of this year.
"IBM took BigInsights for Hadoop behind the shed and only heard a gunshot..."
This is a recent report by the well-known British media The Register on the offline production of IBM products BigInsights products.
BigInsights is a big data analytics product that IBM has added to IBM's analytics technology capabilities on Apache Hadoop. After the dilemma of the two-year-long future, IBM finally decided to close it.
Coincidentally, an article by Gartner recently pointed out that "more than 70% of Hadoop deployments fail to deliver the business value of the antenna..."
What happened to Hadoop Big Data?
From the perspective of the DBMS database management system, we analyze the capabilities of common products: RDBMS, MPP, Hadoop, NoSQL and NewSQL. What are the characteristics of these types of products for data processing?
2. Comparison of several common data technologiesWe first try to unify the concept of big data, the first abused term. According to Gartner, big data has the following characteristics (3 V):
Volume: The amount of data is large enough
Velocity: Data access concurrency is high enough for real time
Variety: More types of data
On the other hand, big data is also data. The management of conventional data is inseparable from our familiar ACID transactionality to ensure atomicity, consistency, isolation and persistence when working with data. With these few metrics, we can compare the above list of products.
Here, according to the four dimensions, several popular database management techniques are scored. Taking the 5-point system as an example, 5 points is the highest score, indicating that it has the best ability. 1 is divided into the lowest score, indicating that the ability is the weakest. In fact, NewSQL products similar to TIDB or CockroachDB have appeared recently, but database software is one of the most complicated software, because it has to meet the usage scenarios of various applications. If history is a mirror, then at least 3 years or so of these NewSQL performance can be adequately evaluated. So here we skip it for the time being.
Let's take a look at the reasons for the scores of various databases.
3. Relational databaseThe RDBMS Relational Database Management System is the oldest database type. Relational databases are represented by Oracle, SQLServer, MySQL, PostgreSQL, etc., which are the databases we are most familiar with. feature is:
1. Stand-alone architecture limitation, limited processing data, usually less than a few TB (score 2)
2, subject to business, the concurrency is not high, but usually a millisecond response (score 3)
3, rigorous relationship model, unable to process unstructured data (score 1)
4, transactional, unparalleled (score 5)
4. MPP number warehouseMPP, the full name of the Massive Parallel Processing database, is often used to implement enterprise data warehouse and ODS requirements. The generation of MPP is mainly used to solve the problem of data volume management capability of relational databases. The MPP database divides the data into partitions and distributes them to each horizontal expansion node, and the unified management calculation is performed by the scheduling node. Each time you execute a query, the query is broken down into multiple subqueries and delivered to each compute node for parallel queries. This architecture can scale capacity by adding nodes. The data is fragmented in the MPP system, and each node accesses a portion of its local data. This has a number of performance advantages over shared storage (such as Oracle RAC). Therefore, most MPP systems, such as Teradata, Greenplum, Vertica, etc., use this shared nothing and DAS direct storage architecture. In general, MPP systems have a complete and mature SQL optimizer that supports mainstream SQL standards, including geographic analysis, full-text search, and data mining. With the exception of GP, almost all MPP systems are closed source systems and are generally associated with expensive and complex words.
MPP is theoretically infinitely horizontally expandable, but in practice it is often difficult to exceed the number of nodes of a hundred or so due to control nodes or coordination nodes. So the VOLUME score is 4 points instead of a perfect score. The main operation of the MPP system is the analytic application scenario. The number of concurrent transmissions is often low. It is optimized for multi-node parallel analysis rather than high concurrency, so the score on VELOCITY is 2 points. The MPP is also roughly based on the relational model. The processing of unstructured data is basically as incompetent as the RDBMS, so the score is 1.
5. HadoopThe next one to play is Hadoop, which is sorted in chronological order. Apache Hadoop is an open source software released in 2007. Hadoop is based on Google's published MapReduce and HDFS technologies. The greatest thing about it is that it allows companies to manage large amounts of data on very inexpensive x86 servers. Until then, organizations needed to purchase expensive enterprise-class storage devices to manage massive amounts of data. From this point of view, Hadoop technology has brought great value to the enterprise. This is really the strength of Hadoop. However, Hadoop's weaknesses are also a basket: security, data management, query speed, complexity, and more. After 10 years of development, many of these places have already had a relatively good solution. Only this data query speed is still a pain in many Hadoop deployments. The reason for this low performance is that HDFS, the mechanism used by Hadoop to store files, is inseparable from HDFS. HDFS does not support indexing. For example, you want to find the pronunciation and definition of a strange word that you don't know in the dictionary. In order to find this uncommon word, you may need to rummaging through the entire dictionary because you can't use pinyin. Search. Finding content in HDFS is by scanning (SCAN), which is to find the data you want from the beginning to the end. Imagine the performance of this operation.
Hadoop's scoring situation:
1, based on x86 cheap server and low-end storage massive expansion, easily support TB / PB level data volume, VOLUME score 5 points
2. The HDFS file storage system receives all the data in all formats, and scores 5 points on VARIETY.
3, performance Hadoop is unceremonious accounted for the last, but concurrent access is still okay, so give 2 points
4, ACID transactional is even more than eight poles, get 1 point.
Jiangmen Hongli Energy Co.ltd , https://www.honglienergy.com