11 posts tagged with "Data insights"

Data insights

Intercontinental Data Sync - A Comparative Study for Performance Tuning

July 14, 2025 · 5 min read

John Li

Chief Executive Officer

When it comes to moving data across vast distances, particularly between continents, businesses often face a range of challenges that can impact performance. At BladePipe, we regularly help enterprises tackle these hurdles. The most common question we receive is: What’s the best way to deploy BladePipe for optimal performance?

While we can offer general advice based on our experience, the reality is that these tasks come with many variables. This article explores the best practice for intercontinental data migration and sync, blending theory with hands-on insights from real-world experiments.

Challenges of Intercontinental Data Sync

Intercontinental data migration is no easy feat. There are two primary challenges that stand in the way of fast and reliable data transfers:

Unavoidable network latency: For instance, network latency between Singapore and the U.S. typically ranges from 150ms to 300ms, which is significantly higher compared to the sub-5ms latency of typical relational database INSERT/UPDATE operations.
Complex factors affecting network quality: Factors such as packet loss and routing paths can degrade the performance of intercontinental data transfers. Unlike intranet communication, intercontinental transfers pass through multiple layers of switches and routers in data centers and backbone networks.

Beyond these, it’s critical to consider the load on both the source and target databases, network bandwidth, and the volume of data being transferred.

When using BladePipe, understanding its data extraction and writing mechanisms is essential to determine the best deployment strategy.

BladePipe Migration & Sync Techniques

Data Migration Techniques

For relational databases, BladePipe uses JDBC-based data scanning, with support for resumable migration using techniques like pagination. Additionally, it supports parallel data migration—both inter-table and intra-table parallelism (via multiple tasks with specific filters).

On the target side, since all data is inserted via INSERT operations, BladePipe uses several batch writing techniques:

Batching
Spliting and parallel writing
Bulk inserts
INSERT rewriting (e.g., converting multiple rows into insert..values(),(),())

Data Sync Techniques

BladePipe supports different methods for capturing incremental changes depending on the source database. Here’s a quick look:

Source Database	Incremental Capture Method
MySQL	Binlog parsing
PostgreSQL	logical WAL subscription
Oracle	LogMiner parsing
SQL Server	SQL Server CDC table scan
MongoDB	Oplog scan / ChangeStream
Redis	PSYNC command
SAP Hana	Trigger
Kafka	Message subscription
StarRocks	Periodic incremental scan
...	...

These methods largely rely on the source database to emit incremental changes, which can vary based on network conditions.

On the target side, unlike data migration, more operations (INSERT/UPDATE/DELETE) need to be handled while order consistency must be kept in data sync. BladePipe offers a variety of techniques to improve data sync performance:

Optimization	Description
Batching	Reduce network overhead and help with merge performance
Partitioning by unique key	Ensure data order consistency
Partitioning by table	Looser method when unique key changes occur
Multi-statement execution	Reduce network latency by concatenating SQL
Bulk load	For data sources with full-image and upsert capabilities, INSERT/UPDATE operations are converted into INSERT for batch overwriting
Distributed tasks	Allow parallel writes of the same amount of data using multiple tasks

Exploring the Best Practice

BladePipe’s design emphasizes performance optimizations on the target side, which are more controllable. Typically, we recommend deploying BladePipe near the source data source to mitigate the impact of network quality on data extraction.

But does this theory hold up in practice? To test this, we conducted an intercontinental MySQL-to-MySQL migration and sync experiment.

Experimental Setup

Resources:

Source MySQL: located in Singapore (4 cores, 8GB RAM)
Target MySQL: located in Silicon Valley, USA (4 cores, 8GB RAM)
BladePipe: deployed on VMs in both Singapore and Silicon Valley (8 cores, 16GB RAM)

Test Plan: We migrated and synchronized the same data twice to compare performance with BladePipe deployed in different locations.

Process

Generate 1.3 million rows of data in Singapore MySQL.
Use BladePipe deployed in Singapore to migrate data to the U.S. and record performance.

Make data changes (INSERT/UPDATE) at Singapore MySQL and record sync performance.

Stop the DataJob and delete target data.
Use BladePipe deployed in the U.S. to migrate the data again from Singapore MySQL and record performance.

Make data changes at Singapore MySQL and record sync performance again.

Results & Analysis

Deployment Location	Task Type	Performance
Source (Singapore)	Migration	6.5k records/sec
Target (Silicon Valley)	Migration	15k records/sec
Source (Singapore)	Sync	8k records/sec
Target (Silicon Valley)	Sync	32k records/sec

Surprisingly, deploying BladePipe at the target (Silicon Valley) significantly outperformed the source-side deployment.

Potential Reasons:

Network policies and bandwidth differences between the two locations.
Target-side batch writes are less affected by poor network conditions compared to binlog/logical scanning on the source side.
Other unpredictable network variables.

Recommendations

While the experiment offers valuable insights to intercontinental data migration and sync, real-world environments can differ:

Production databases may be under heavy load, impacting the ability to push incremental changes efficiently.
Dedicated network lines may offer more consistent network quality.
Gateway rules and security policies vary across data centers, affecting performance.

Our recommendation: During the POC phase, deploy BladePipe on both the source and target sides, compare performance, and choose the best deployment strategy based on real-world results.

Data Masking in Real-time Replication

July 14, 2025 · 6 min read

John Li

Chief Executive Officer

In today’s data-driven world, keeping sensitive information safe is more important than ever. That’s where data masking comes in. It hides or replaces private data so teams can work freely without risking exposure. In this blog, we’ll dive into data masking—what it is, when to use it, and how modern tools make it easy to mask your data as you move it.

What is Data Masking?

When moving or syncing data, especially personally identifiable information (PII), data masking is a key step. It keeps your data safe, private, and compliant—especially when you're migrating, testing, or sharing data. Any time sensitive data is being transferred, data masking should be part of the plan. It helps prevent leaks and protects your business.

There are two main types of data masking: static and dynamic.

Static data masking means masking data in bulk. It creates a new dataset where sensitive information is hidden or replaced. This masked data is safe to use in non-production environments like development, testing, or analytics.

Dynamic data masking happens in real-time. It shows different data to different users based on their roles or permissions. It is usually used in live production systems.

In this blog, we'll focus on static data masking, and how to statically mask data in data replication.

Use Cases

Data masking is useful in many situations where there’s a risk of data breach. It’s especially important when people from different departments—or even outside the organization—need to access the data. Masking keeps private information safe and secure.

Once data is statically masked and separated from the live production system, teams of different departments can use it freely—read it, write it, test with it—without risking the real data. Here are some common use cases for static data masking:

Software development and testing Developers often need real data to test new features or troubleshoot bugs. But dev environments usually aren’t as secure as production environments. Static masking hides the sensitive parts of the data, so developers can work safely without seeing private info.
Scientific research: Researchers need lots of real-world data to get meaningful results. But using raw data with personal or sensitive info is not compliant with privacy laws. With data masking, researchers get access to realistic data, just without the sensitive details, keeping things both useful and compliant.
Data sharing: Businesses often need to share data with partners or third-party vendors. Sharing raw data is risky for the potential of data breach. Masking it first removes that risk. Partners get the insights they need, but none of the sensitive stuff. It’s a win-win for privacy and collaboration.

Common Static Data Masking Techniques

There are several ways to apply static data masking. Each method helps hide sensitive information.

Masking Type	How It Works	Example
Substitution	Replace real data with fake but seemingly realistic values	Rose → Monica
Shuffling	Mix up the order of characters or fields	12345 → 54123
Encryption	Use algorithms like AES or RSA to encrypt the data	123456 → Xy1#Rt
Masking	Hide part of the data with asterisks	13812345678 → 138**5678
Truncation	Keep only part of the original data	622712345678 → 6227

Data Masking in Real-time Replication

In the use cases mentioned above, we often need both data migration/syncing and data masking. The best approach? Mask the data during the sync process itself. That way, teams get masked data right away—no need for extra tools. It’s faster, simpler, and safer. Plus, it lowers the risk of leaks and helps you stay compliant.

BladePipe, a professional end-to-end data replication tool, makes this easy. It supports data transformation during sync. Before, users had to write custom code to do masking while syncing, which is not ideal for non-developers. Now, with BladePipe’s new scripting support, masking can be done with built-in scripts. You can set masking rules for specific fields. When the data sync task runs, it automatically calls the script and applies the transformation. That means: “Sync and mask data at the same time.”

This works for full data migration, incremental sync, data verification and correction.

BladePipe now supports built-in masking rules, including masking and truncation. You can mask your data in several flexible ways:

Keep only the part after a certain character
Keep only the part before a certain character
Mask the part after a certain character
Mask teh part before a certain character
Mask a specific part of the string

Procedure

Here we show how to mask data in real time while replicating data from MySQL to MySQL.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource.
Select the source and target DataSource type, and fill out the setup form respectively.

Step 3: Create a DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources.
Select Incremental for DataJob Type, together with the Full Data option.
Select the tables to be replicated.
In the Data Processing step, select the table on the left side of the page and click Operation > Data Transform.
Select the column(s) that need data transformation, and click the icon next to Expression on the right side of the dialog box. Select the data transformation script in the pop-up dialog box, and click it to automatically copy the script.
Paste the copied script into the Expression input box, and replace col in @params['col'] of the script with the corresponding column name.
In the Test Value input box, enter a test value and click Test. Then you can view how the data is masked.
Confirm the DataJob creation.
Now the DataJob is created and started. The selected data is being masked in real time when moving to the target instance.

Wrapping Up

Data masking isn’t just a checkbox for compliance—it’s a smart move to protect your business and your users. Especially when working with real data in non-production environments or sharing it with others, static data masking gives you the safety net you need without slowing things down.

By integrating data masking directly into the data migration and sync process, tools like BladePipe make it easier than ever. No more juggling extra tools or writing custom code. You get clean, safe, ready-to-use data—all in one smooth step.

Whether you're testing, analyzing, or sharing data, masking should be part of your workflow. And now, it’s finally simple enough for everyone to use.

Real-Time Data Sync-4 Questions We Get All the Time

July 14, 2025 · 5 min read

John Li

Chief Executive Officer

We work closely with teams building real-time systems, migrating databases, or bridging heterogeneous data platforms. Along the way, we hear a lot of recurring questions. So we figured—why not write them down?

This is Part 1 of a practical Q&A series on real-time data sync. In this post, I'd like to share thoughts on the following questions:

How should I choose between official and third-party tools?
Can my project rely on “real-time” sync latency?
What does real-time data sync mean to my project?
How do I keep pipeline stability and data integrity over time?

How should I choose between official and third-party tools?

Mature database vendors typically provide their own tools for data migration or cold/hot backup, like Oracle GoldenGate or MySQL's built-in dump utilities.

Official tools often deliver:

The best possible performance for the migration and sync of that database.
Compatibility with obscure engine-specific features.
Support for special cases that third-party tools often cannot (e.g., Oracle GoldenGate parsing Redo logs).

But they also tend to:

Offer limited or no support for other databases.
Be less flexible for niche or custom workflows.
Lock you in, making data exit harder than data entry.

Third-party tools shine when:

You're syncing across platforms (e.g. MySQL > Kafka/Iceberg/Elasticsearch).
You need advanced features like filtering and transformation.
The official tool simply doesn't support your use case.

In short:

If it’s homogeneous migration or backup, use the official tool.
If it’s heterogeneous sync or anything custom, go third-party tool.

Can my project rely on “real-time” sync latency?

In short: any data sync process that doesn't guarantee distributed transaction consistency comes with some latency risk. Even distributed transactions come at a cost—usually via redundant replication and sacrificing write performance or availability.

Latency typically falls into two categories: fault-induced latency and business-induced latency.

Fault-induced Latency:

Issues with the sync tool itself, such as memory limits or bugs.
Source/target database failures—data can't be pulled or written properly.
Constraint conflicts on the target side, leading to write errors.
Incomplete schema on the target side causing insert failures.

Business-induced Latency:

Bulk data imports or data corrections on the source side.
Traffic spikes during business peaks exceeding the tool’s processing capacity.

You can reduce the chances of delays (via task tuning, schema change rule setting, and database resource planning), but you’ll never fully eliminate them. So the real question becomes:

Do you have a fallback plan (e.g. graceful degradation) when latency hits?

That would significantly mitigate the risks brought by high latency.

What does real-time data sync mean to my project?

Two words: incremental + real-time.

Unlike traditional batch-based ETL, a good real-time sync tool:

Captures only what changes, saving massive bandwidth.
Delivers changes within seconds, enabling use cases like fraud detection or live analytics.
Preserves deletes and DDLs, whereas traditional ETL often relies on external metadata services.

Think of it like this: You don’t want to re-copy 1 billion rows every night when only 100 changed. Real-time sync gives you the speed and precision needed to power fast, reliable data products.

And with modern architectures—where one DB handles transactions, another serves queries, and a third powers ML—real-time sync is the glue holding it all together.

How do I keep pipeline stability and data integrity over time?

Most stability issues come from three factors: schema changes, traffic pattern shifts, and network environment issues. Mitigating or planning for these risks greatly improves stability.

Schema Changes:

Incompatibilities between schema change methods (e.g., native DDL, online tools like pt-osc or gh-ost) and the sync tool’s capabilities.
Uncoordinated changes to target schemas may cause errors or schema misalign.
Changes on the target side (e.g., schema changes or writes) may conflict with sync logic, causing the inconsistency between the source and target shcema or constraint conflicts.

Traffic Shifts:

Business surges causing unexpected peak loads that outstrip the sync tool’s capacity, leading to memory exhaustion or lag.
Ops activities like mass data corrections causing large data volumes and sync bottlenecks.

Network Environment:

Missing database whitelisting for sync nodes. Sync tasks may fail due to connection issues.
High latency in cross-region setups causing read/write problems.

You can reduce these risks significantly via change control setting, load testing during peak traffic, and pre-launch resource validation.

For data loss issues, they are typically resulted from:

Mismatched parallelism strategy causing write disorder.
Conflicting writes on the target side.
Excessive latency not handled in time, causing source-side logs to be purged before sync.

How to fight back:

Parallelism strategy mismatch often occurs due to cascading updates or reuse of primary key. You may need to fall back to table-level sync granularity and verify and correct data to ensure data consistency.
Target-side writes should be prevented via access control and database usage standardization.
Excessive latency must be caught via robust alerting. Also, extend log retention (ideally 24+ hours) on the source database.

With these measures in place, you can significantly enhance sync stability and data reliability—laying a solid foundation for data-driven business operations.

Data Verification - Definition, Benefits and Best Practice

July 14, 2025 · 5 min read

John Li

Chief Executive Officer

When data moves from one system to another, you may have a question: does all the data stored in the target system in a correct way? If not, how can I identify the missing or wrong data? Data verification is introduced to resolve your concern. Verification acts as a safeguard, ensuring that all data is accurately replicated, intact, and functional in the new system.

What is Data Verification?

Data verification is the process of ensuring that all data has been accurately and completely replicated from the source instance to the target instance. It involves validating data integrity, consistency, and correctness to confirm that no data is lost, altered, or corrupted during the replication process.

Why Data Verification is Needed?

Ensuring Data Quality

In data replication, some data records may be skipped or failed to move to the target instance. That results in data loss and inconsistencies. Verification plays a key role in ensuring that data is completely and accurately moved from the source to the target.

Key aspects of data verification:

Completeness: Ensure that all data of the source instance is present in the target instance.
Integrity: Confirm that the data has not been altered or tampered with.
Consistency: Verify that the data in the source instance is in line with that in the target instance.

Enhancing Data Reliability

Stakeholders, including users and management, need confidence that the data replication is successfully done. Data verification provides solid evidence on data reliability. When data is verified, users have more trust in what they get, and more confidence to use the data for analytics.

Supporting Decision-making

Accurate and complete data is the backbone for data-driven insights. Any minor inconsistency, if not be identified and corrected, may lead to misunderstanding and huge costs. Data verification ensures that the data represents the accurate and real situation, offering a basis for wise decision making.

How to Verify Data?

Manual Verification

Manual verification involves human efforts to check data integrity, completeness, and consistency. For small datasets or specific cases requiring human judgment, you may find it's a cost-effective choice, because no specialized tools are needed. However, when there are hundreds of thousands of records of data to be verified, the manual way is time-consuming and labor-intensive, and human errors are tend to occur. That makes it hard to trust in data quality even after verification.

Automated Verification

Compared with the manual way, automated tools are faster, and more efficient, especially for large datasets. A large volume of data can be verified in only a few seconds, helping accelerate your data replication project. No human intervention is needed in this process, reducing human errors and ensuring consistency of every verification. Also, automated tool usually can correct the discrepancies automatically, saving much of your time and energy.

Best Practice

Here, we introduce a tool for automatic data verification and correction after data replication -- BladePipe.

BladePipe fetches data from the source instance batch by batch, then uses the primary key to fetch the corresponding data from the target instance using SQL IN or RANGE. The data with no matching data found in the target is marked as Loss, and then each row of data is compared on a field-by-field basis.

By default, all data is verified. Also, you can narrow the data range to be verified using filtering conditions. For the discrepancies, BladePipe performs 2 additional verifications to minimize the false result caused by the latency of data sync, thus improving the verification performance significantly.

With BladePipe, data can be verified and corrected in a few clicks.