Service Tickets
Gain technical support
检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
检测到您已登录华为云国际站账号,为了您更好的体验,建议您访问国际站服务网站 https://www.huaweicloud.com/intl/zh-cn
不再显示此消息
中国站
简体中文The rapid growth of data volumes has created significant challenges for traditional centralized databases:
In contrast with standalone databases, distributed databases can extend mission-critical functions to multiple nodes or even multiple regions, including transaction management, data storage, and data query. There are three main types of distributed database deployments:
1. Distributed middleware + standalone database
Such a distributed database system consists of two layers:
a) The upper layer is distributed middleware. It provides a set of unified sharding rules. The upper layer can parse SQL statements, forward requests, and combine results.
b) The bottom layer consists of open-source MySQL or PostgreSQL standalone databases, providing data storage and execution capabilities.
The type of deployment uses a mature kernel to address the scalability issues. It is eco-friendly, low-cost, and easy to use.
However, there are obvious shortcomings too. For example, its functionality, global transaction processing, and high availability capabilities need to be enhanced. The deployment is complex, and there are redundant devices. Most importantly, an open-source database kernel is used, so the database is always subject to risks such as open-source code modifications, patent issues, and release methods. Obviously, this solution is not ideal for financial, government, and enterprise customers.
2. Distributed databases based on distributed storage
This type of deployment is based on distributed storage. Most public cloud databases use this type of architecture. Huawei Cloud TaurusDB is a typical example.
This type of deployment enables databases to be reasonably scalable. Data consistency mainly depends on the distributed storage engine. The upper-layer compute nodes are stateless, and the shared storage provides cross-node reads and writes. This architecture makes full use of the advanced features provided by distributed storage, making it easier to establish a competitive edge, but the scalability of this architecture has its limits, especially for write nodes.
This architecture depends heavily on distributed storage and is expensive to deploy offline.
3. Native distributed databases
Native distributed databases are designed based on distributed database theories, such as distributed consistency protocols. They combine distributed storage, transactions, and compute. The system automatically shards data and stores multiple replicas while ensuring consistency among these replicas and transactions through a consistency protocol.
This architecture makes it easier to leverage the advantages of the database itself in terms of performance, complex SQL processing, and enterprise-grade capabilities. A cluster can be scaled as needed, and the scaling is transparent to applications. Data consistency is protected by transaction-layer consistency protocols for improved security. Thanks to flexible deployment and a multi-active architecture, there are few hardware dependencies, so servers can be easily added to a cluster to ensure high availability.
Customers in the finance and government sectors typically have experience with sharding and distributed middleware products, so their barriers to entry are low, and this solution is quite popular among them.
Huawei Cloud GaussDB is a typical example of this solution. Backed by Huawei's more than 20 years of strategic investment into databases, GaussDB has been able to incorporate extensive practical experience from the finance sector. It is a reliable choice for digital transformation, migration of core data to the cloud, and for distributed reconstruction.
Native distributed databases are a type of distributed database transparent to user applications. However, using distributed relational databases involves some challenges:
1. Security and trustworthiness
The complexity of distributed and cloud environments can lead to increased security risks, for example, data leakage and loss. Also, in these environments, it is hard to manage identity authentication, access control, data transmission, and storage security.
2. Correctness and performance of transaction systems
In a distributed database cluster, multiple nodes often work together for a single operation. A unified solution is needed to maintain Atomicity, Consistency, Isolation, Durability (ACID) across the entire database cluster to avoid transaction failures.
In addition, a transaction manager may be a performance bottleneck in scenarios with a large number of concurrent requests, such as obtaining unique transaction identifiers, obtaining global transaction snapshots, and intensive network communication and lock wait events caused by frequent interactions.
3. Distributed query
A distributed system demands better performance to obtain accurate query results as fast as possible.
4. High availability
A distributed database system has to ensure service continuity even when there are exceptions, such as hardware faults or system bugs.
Challenges and key technologies when using distributed databases
GaussDB, a distributed database, provides a wide array of features delivering high performance, high availability, and robust security to address the challenges described here. Some highlights of GaussDB are described below:
In traditional encryption, data is encrypted on servers, which means key administrators still have access. GaussDB is a fully-encrypted database. Users keep their encryption keys in their own hands. Even DBAs do not have access. Encryption and decryption are performed on the client side only. Data is encrypted throughout the full lifecycle, from storage and transmission to query. Even DBAs are blocked from unauthorized access.
Fully-encrypted database
As shown in the figure below, GaussDB does not use the traditional transaction list management approach. Instead, it compares Commit Sequence Numbers (CSNs) to determine transaction visibility.
How GTM-Lite works
When a transaction starts, a CSN is obtained from GTM-Lite, based on the transaction isolation level and used as a snapshot point for executing query statements in the transaction. (For a Repeatable Read transaction, a CSN only needs to be obtained once, when the transaction starts. For a Read Committed transaction, a CSN needs to be obtained each time a SELECT statement is executed.)
When a transaction is committed, a new CSN is requested from GTM-Lite and is recorded.
GTM-Lite uses CSNs to help quickly determine which transactions are visible to a given transaction without having to traverse the transaction list using lots of compute resources. It provides CSNs through lock-free atomic operations, eliminating the need for lock wait. It only requires one CSN for transaction interaction between nodes. The network overhead is irrelevant to the transaction scale. GTM-Lite offers high-performance transaction processing and ensures strong global consistency, eliminating the performance bottlenecks associated with a single GTM.
1. Distributed execution
How does GaussDB process the SQL statements of an application in a distributed database cluster?
1) SQL statements of an application are sent to a CN.
2) The CN uses a database optimizer to generate a distributed execution plan. All DNs process data based on this plan.
3) Data is distributed on each DN using consistent hashing. A DN may need to obtain data from other DNs during data processing. GaussDB provides three types of streams (Broadcast, Gather, and Redistribution) to transfer data between DNs.
4) The DNs return result sets to the CN.
5) The CN sends the results to the application.
Process of a distributed query
Let's take a closer look at how data is transferred between nodes. The execution logic of an SQL statement shown in the following figure is an example.
SQL statement execution logic
Take two DNs as an example. During the execution, data is sent to different DNs based on a redistributed key.
After receiving the joined data of tables C and D, the Redistribute operator calculates whether to send the data to DN1 or DN2 based on the redistributed key. After collecting the redistributed data, the Redistribute Collector sends the data to the upper-layer Join operator for a join calculation.
Data transfer between CNs and DNs
In addition, the GaussDB optimizer selects the Stream operator, which delivers the best SQL performance possible based on the available statistics to transfer data between CNs and DNs.
2. A fully parallel architecture
GaussDB uses a fully parallel architecture, covering node-level parallelism (MPP), thread-level parallelism (SMP), instruction-level parallelism (SIMD), and the LLVM CodeGen technology, to unleash the power of the compute resources and take the query performance to the next level.
1. Redo logs
Redo logs can be used in the following scenarios to improve system availability:
1) If a database suffers from a severe fault that leads to a breakdown, you can use redo logs to restore data.
2) In an HA architecture, primary and standby nodes synchronize data using redo logs.
3) During backup and restoration, point-in-time recovery (PITR) is enabled by archiving redo logs.
GaussDB uses write-ahead logging (WAL) to provide redo logs. When data is updated, the database complies with a no-force-at-commit policy. When a transaction commits, the changes made to the actual objects are not "forced." This enables GaussDB to improve availability without compromising on performance. To ensure that data can be restored when a database is faulty, the write of random, scattered data blocks is delayed by writing consecutive, sequential log entries through WAL. Delaying write for these scattered blocks and using batch write instead can improve the overall performance.
2. Distributed deployment
To ensure system stability and reliability, GaussDB supports multiple HA deployment architectures. Two examples are described here:
1) Two-city three-DC geo-redundancy
Two active-active data centers are deployed in the same city to run services, while a remote data center (DC) is deployed for disaster recovery. High availability can be achieved across nodes, AZs, and data centers in the event of faults. Plus, cross-city remote disaster recovery (DR) capabilities are supported.
HA deployment of three DCs in two cities (regions)
2) Intra-city three-AZ HA + remote DR
A 3-replica cluster is deployed across three logical AZs in a city (Region 1), and another 3-replica cluster is deployed in a single AZ in another city (Region 2). This solution provides node- and AZ-level DR within the same region as well as cross-region remote DR.
Intra-city three-AZ HA + remote DR
How Distributed Database Technologies Will Evolve In the Future
With the advent of new technologies, such as complete resource pooling, new networking architectures, and foundation models, and driven by new requirements and use cases, distributed database technologies are expected to evolve in a number of ways.
How distributed database technologies will evolve in the future
Customers in the government and financial sectors, especially those who need multi-region and multi-center disaster recovery, prioritize high availability. To suit the needs of these customers, Huawei Cloud GaussDB provides a broad range of solutions, including intra-city active-active, remote, and two-city three-DC geo-redundancy solutions, an intra-city active-active enhanced synchronization solution, and asynchronous data replication and multi-region multi-active HA solutions.
In the future, distributed databases will use multi-active architectures to support global deployments.
Hardware and software are complementary. Advanced hardware, such as GPUs, FPGAs, and high-speed networks, combined with Huawei's full-stack software and hardware capabilities in chips, servers, storage, networking, operating systems, and databases, can significantly enhance performance and improve availability. Huawei has been stepping up efforts to promote hardware-software synergy in the following ways:
First, the persistence logic of databases is integrated into the technical foundation for decoupled storage and compute. Distributed databases support large capacities, deliver significantly improved scalability, and provide a consistent experience for customers.
Second, computing tasks are pushed down from compute nodes to storage nodes, and parallel processing is enabled for complex queries. In this way, the computing logic can make full use of the underlying storage pools while still remaining transparent to applications.
Last, distributed databases require excellent performance. Enabling I/O aggregation can boost performance, but network quality is also crucial as transaction speed relies heavily on network and processing latency. Huawei is always striving to minimize network latency in distributed databases.
Hardware-software synergy delivers enterprise-grade reliability, in addition to outstanding performance and scalability.
An enterprise-grade HTAP architecture can be used to join OLTP and OLAP workloads. With this architecture, distributed databases can support complex analytical queries better while making it easier to handle high-concurrency transaction requests. This helps enterprises slash costs and improve decision-making efficiency.
Core HTAP technologies:
1. Transparent routing: Row-store and column-store engines, and row-column combinations are automatically selected for accurate and real-time queries, which is user-friendly and increases commercial value of HTAP products.
2. Performance improvement: OLTP workloads prioritize low latency and high throughput, while OLAP workloads are optimized for handling complex queries. By combining OLTP and OLAP, enterprise-grade HTAP not only provides general execution optimization technologies, such as parallel execution, compilation execution, and vectorized execution, but also speeds up complex queries.
3. Data freshness: An HTAP architecture can meet diverse customer needs by offering high data freshness and exceptional performance.
4. Resource isolation: To ensure both high-performance OLTP and real-time OLAP, HTAP provides a balance between resource isolation, data freshness, and performance improvement.
A cloud-native multi-primary architecture is a step up from a traditional single-active architecture. It can be used to:
1. Minimize service downtime and ensure high availability.
2.Improve performance and scalability by leveraging hardware-software synergy, for instance, by using computing pushdown combined with parallel processing.
Security, compliance, and privacy are now primary concerns for countries, organizations, and individuals alike. These requirements have catalyzed technological advancement. In the future, customers in every industry will have to meet increasingly strict requirements on trustworthiness and security.
Fully-encrypted processing is a key database technology developed by Huawei Cloud to protect privacy. Data, especially sensitive data, is encrypted and secured, throughout its lifecycle. Attackers cannot steal critical information from the database server regardless of the data status. Full-lifecycle privacy and security of enterprise data are guaranteed.
Machine learning (ML) has been widely used to enhance data management. It is used for data cleansing, data analytics, query rewriting, and database diagnostics. However, traditional ML algorithms cannot address generalization and inference problems. This is where foundation models come in. They play a crucial role in enabling intelligent data management.
Powered by AI and foundation models, distributed databases are evolving towards end-to-end autonomous and intelligent databases. They are used for everything from consultations and development to O&M.
1. Consulting: expert-level assistance for custom solutions
2. Development: assistance in improving SQL development efficiency
3. O&M: predictive maintenance for higher reliability
Distributed databases have exceptional performance, high availability, excellent scalability, and high cost-effectiveness. They have become quite popular in the financial sector, where there are some of the strictest database requirements. Such databases will see more widespread adoption in other sectors as well. However, distributed databases are still in the early stages of development and have lots more potential to unlock.
hi, this is a great event
We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more