by Joseph Brady, Director of Business Development / Cloud Alliance Lead at Treehouse Software, Inc.
Many enterprises have relied on traditional data infrastructures for decades, but older data solutions can’t keep up with the volume and variety of data being collected today. This blog takes a look at building a data lake on Amazon Web Services (AWS), which allows you to store all of your data in one central repository.
What is a Data Lake?
A data lake is a centralized repository for storing structured and unstructured data at any scale. A customer can store data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Depending on the requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.
A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.
A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
As organizations with data warehouses see the benefits of data lakes, they are evolving their warehouses to include data lakes, and enable diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models. Gartner names this evolution the “Data Management Solution for Analytics” or “DMSA.”
Data lakes built on Amazon S3 provide the foundation for analytics and innovation, and AWS Partner Network (APN) Partners have demonstrated success in helping enterprises evaluate and use the tools and best practices for collecting, storing, governing, and analyzing data.
Key data lake-enabling features of Amazon S3 include the following:
- Decoupling of storage from compute and data processing – With Amazon S3, you can cost-effectively store all data types in their native formats. You can then launch as many or as few virtual servers as you need using Amazon Elastic Compute Cloud (EC2), and you can use AWS analytics tools to process your data. You can optimize your EC2 instances to provide the right ratios of CPU, memory, and bandwidth for best performance.
- Centralized data architecture – Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring their own data analytics tools to a common set of data. This improves both cost and data governance over that of traditional solutions, which require multiple copies of data to be distributed across multiple processing platforms.
- Integration with clusterless and serverless AWS services – Use Amazon S3 with Amazon Athena, Amazon Redshift Spectrum, Amazon Rekognition, and AWS Glue to query and process data. Amazon S3 also integrates with AWS Lambda serverless computing to run code without provisioning or managing servers. With all of these capabilities, you only pay for the actual amounts of data you process or for the compute time that you consume.
- Standardized APIs – Amazon S3 RESTful APIs are simple, easy to use, and supported by most major third-party independent software vendors (ISVs), including leading Apache Hadoop and analytics tool vendors. This allows customers to bring the tools they are most comfortable with and knowledgeable about to help them perform analytics on data in Amazon S3.
— Source: AWS
Getting to the Lake: Mainframe-to-AWS Data Replication with tcVISION…
An enterprise’s long-standing mainframe data can be residing on one, or a combination of databases, including IBM Db2, IBM VSAM, IBM IMS/DB, Software AG Adabas, CA IDMS, CA Datacom, or even sequential files. Treehouse Software’s tcVISION writes .csv or JSON to AWS S3, providing an optimal foundation for a data lake, because of its virtually unlimited scalability.
Here is a look at three options for data replication with tcVISION:
1. tcVISION Bulk Load of Mainframe Data – Initial load of the database where tables can be loaded as a parallel operation in one or more EC2 instances. tcVISION reads the database from the mainframe directly, and can FTP or use S3 to migrate unloads directly to the EC2 instance.
2. tcVISION Enterprise Change Data Capture (CDC) Integration – A multiple platform solution for real-time, continuous data synchronization and replication based on log-based Change Data Capture technology and mainframe batch integration. It turns data exchange into a single-step operation as it includes a powerful repository for data transparency, data transformation, and apply processing logic for the targets. Transfer mainframe data to your AWS targets continuously and in real-time for Big Data, Analytics, Business Intelligence, ERP, CRM, or for application modernization or mainframe offload initiatives or mainframe migration.
3. tcVISION Bi-directional Data Replication – tcVISION allows bi-directional, real-time data synchronization of changes on either platform to be reflected on the other platform (e.g., a change to a PostgreSQL table is reflected back on mainframe). The customer can then modernize their application on the cloud, open systems, etc. without disrupting the existing critical work on the legacy system.
The tcVISION solution focuses on CDC when transferring information between mainframe data sources and Cloud databases and applications. Through an innovative technology, changes occurring in any mainframe application data are tracked and captured, and then published to a variety of Cloud targets.
Further reading: tcVISION Mainframe-to-AWS data replication is featured on the AWS Partner Network Blog…
Treehouse Software is an AWS Technology Partner, and AWS published a blog about tcVISION’s Mainframe-to-AWS data replication capabilities, including a technical overview, security, high availability, scalability, and a step-by-step example of the creation of tcVISION metadata and scripts for replicating mainframe Db2 z/OS data to Amazon Aurora.
Read the blog here: AWS Partner Network (APN) Blog: Real-Time Mainframe Data Replication to AWS with tcVISION from Treehouse Software.
Contact Treehouse Software Today…
Treehouse Software has been helping mainframe enterprises since 1982, and our extensive experience, deep knowledge, and wide-ranging capabilities in mainframe technologies makes us a valued partner and a trusted advisor to customers.
No matter where you want your mainframe data to go – the cloud, open systems, or any LUW target – tcVISION from Treehouse Software is your answer.
Just fill out the Treehouse Software Product Demonstration Request Form and a Treehouse representative will contact you to set up a time for your online tcVISION demonstration.