Skip to content

Commit

Permalink
Merge pull request #10 from helios-pipeline/feat/adding-videos
Browse files Browse the repository at this point in the history
feat: added videos, removed public from image links
Kuanchiliao1 authored Aug 20, 2024
2 parents 4b8e169 + 30ce547 commit 6bcd139
Showing 9 changed files with 40 additions and 30 deletions.
11 changes: 9 additions & 2 deletions docs/.vitepress/theme/components/HomePage.vue
Original file line number Diff line number Diff line change
@@ -123,8 +123,11 @@ export default {
</div>
</section>
<section class="content-section cli">
<img class="box image" src="/home/terminal.png">
</img>
<!-- <img class="box image" src="/home/terminal.png">
</img> -->
<video class="video" width="500" height="500" autoplay loop muted>
<source src="/home/helios-cli.mp4" type="video/mp4">
</video>
<div class="content">
<h2>Automated Deployment</h2>
<p>Helios CLI configures Helios deployment with AWS credentials, deploys the entire Helios stack to AWS using a single command, and destroys the stack when needed.</p>
@@ -297,6 +300,10 @@ section.content-section.cli {
z-index: -1;
}
.cli .video {
border-radius: 10px;
box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5);
}
/* Main Background Trail */
main::after {
content: "";
2 changes: 1 addition & 1 deletion docs/automating-deployment.md
Original file line number Diff line number Diff line change
@@ -14,7 +14,7 @@ How it works:
<p><Icon name="CubeTransparentIcon" /><span> Under the hood, Helios leverages the AWS Cloud Development Kit (AWS CDK) which we will go into more detail below.</span></p>
</div>

![CLI](public/case_study/cli_dropshadow.png)
![CLI](/case_study/cli_dropshadow.png)

## AWS CDK

10 changes: 5 additions & 5 deletions docs/building-helios.md
Original file line number Diff line number Diff line change
@@ -26,11 +26,11 @@ However, row-based storage is often inefficient when it comes to analytical quer

The following is a basic event table along with a basic analytic query that asks the question “How many events has each user initiated?”

![Event Table](public/case_study/eventtable.png)
![Event Table](/case_study/eventtable.png)

Below is an example of how row-based databases might execute an analytical query:

##### ![Row Table](public/case_study/rowbased.png)
##### ![Row Table](/case_study/rowbased.png)

#####

@@ -40,7 +40,7 @@ Columnar databases such as Clickhouse and Apache Druid use column-based storage.

Below is an example of how column-based databases might execute an analytical query:

![Column Table](public/case_study/columnbased.png)
![Column Table](/case_study/columnbased.png)

One key limitation is that the data in a columnar database is not well-suited for frequent updates or deletions of existing data. Fortunately, this drawback should not be limiting for Helios users. By nature, many real-time streaming data use-cases mostly require database reads and insertions that do not modify existing data.

@@ -64,7 +64,7 @@ Of the criteria listed above, ClickHouse's impressive read and write latency par

### Single Node vs Node Cluster

![Node Cluster](public/case_study/node_cluster_opt.png)
![Node Cluster](/case_study/node_cluster_opt.png)

We explored several options when determining the optimal deployment strategy for the Helios production ClickHouse server. While many database deployments utilize clustered architectures for high availability and scalability, with modern implementations often leveraging containerization and orchestration tools like Kubernetes, we found this approach less suitable for ClickHouse.

@@ -93,7 +93,7 @@ Lambda functions typically operate on a shared pool of virtual CPUs, essentially

This prep time setting up the execution environment, a cold start, is not charged, but it adds latency to the Lambda invocation. One could pay to ensure dedicated CPUs are available to avoid this cold start latency.

![Cold Starts](public/case_study/lambdacoldstarts.png)
![Cold Starts](/case_study/lambdacoldstarts.png)

Ultimately, we decided to stick with the default setup to save users money and have our Lambda function run with cold starts versus implementing a warm Lambda. We believe that the latency impact from this initial setup is of minimal concern as per the nature of event streams; after the first execution, each Lambda execution environment will stay active as long as they are continually invoked.

14 changes: 7 additions & 7 deletions docs/helios-architecture.md
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ To meet the requirements of Amazon Kinesis users looking to explore and analyze
<p><Icon name="WindowIcon" /><span><strong>Interface</strong> - A user-friendly graphical interface allowing users to conduct analyses and visualize results.</span></p>
</div>

![Core Arch](public/case_study/core_full_color.png)
![Core Arch](/case_study/core_full_color.png)

Given that potential Helios users are already leveraging Amazon Kinesis, it made sense to host all of Helios' infrastructure within the AWS ecosystem.

@@ -30,15 +30,15 @@ The following diagram illustrates our storage architecture:

### ClickHouse Database Server

![Clickhouse Arch](public/case_study/core_clickhouse_highlight.png)
![Clickhouse Arch](/case_study/core_clickhouse_highlight.png)

The main functions of the ClickHouse database are to store event data consumed from Kinesis streams and to make this data available for querying. The database is deployed on an Amazon EC2 instance (i.e. virtual server).

With storage in place, the next phase of our architecture design was to implement an integration between a user’s Amazon Kinesis streams and the ClickHouse instance.

## Connection

![Connection Arch](public/case_study/core_connector_highlight.png)
![Connection Arch](/case_study/core_connector_highlight.png)

Efficiently transferring events from Kinesis streams to our ClickHouse database presented a challenge. We needed a solution that could handle high-volume data ingestion, perform necessary decoding, and ensure reliable delivery.

@@ -62,9 +62,9 @@ Helios’ Lambda Processor is an AWS serverless function that serves as a connec

Using an event-based trigger, the function ingests <TippyWrapper content="In Amazon Kinesis, this is formally called a 'record'. However, for consistency and clarity in our discussion, we will continue to refer to it as an 'event' throughout this case study.">event data</TippyWrapper> from AWS Kinesis streams, and decodes the Kinesis event payload into a JSON object.

![Kinesis Integration 1](public/case_study/kinesis_integration1.png)
![Kinesis Integration 1](/case_study/kinesis_integration1.png)

![Kinesis Integration 2](public/case_study/kinesis_integration2.png)
![Kinesis Integration 2](/case_study/kinesis_integration2.png)

Once the Lambda decodes the payload from a stream, the data needs to be sent to the associated destination table within Clickhouse. To retrieve the table ID, the Lambda interacts with a key-value database, DynamoDB, which contains a mapping of stream IDs to table IDs.

@@ -76,7 +76,7 @@ While the storage and connection components form the backbone of Helios, the ana

### Application Server

![App Server](public/case_study/core_client_server_highlight.png)
![App Server](/case_study/core_client_server_highlight.png)

The Helios web application, hosted on an Amazon EC2 instance, serves as the primary interface for users. Implemented with a Flask backend and a React frontend, its core features include:

@@ -86,4 +86,4 @@ The Helios web application, hosted on an Amazon EC2 instance, serves as the prim
</div>

Now that you have a good understanding of how Helios works, in the next section we will cover why we designed it in this way as well as the trade-offs made throughout the building of Helios. Here is our architecture so far:
![Core Arch](public/case_study/core_full_color.png)
![Core Arch](/case_study/core_full_color.png)
17 changes: 10 additions & 7 deletions docs/improving-core-platform.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@ Below we detail the problems we encountered and the solutions we implemented to

## Quarantine Tables

![Quarantine Arch](public/case_study/full_storage_highlight.png)
![Quarantine Arch](/case_study/full_storage_highlight.png)

The initial version of Helios lacked error handling for failed database insertion of event records within the Lambda connector, potentially leading to data loss and difficult to parse error messages. To mitigate these issues and enhance system reliability, we implemented a comprehensive error handling and data quarantine system. Below, we outline the key features of this new system:

@@ -15,7 +15,10 @@ The initial version of Helios lacked error handling for failed database insertio
<p><Icon name="SparklesIcon"/><span><strong>AI Summary</strong>: For users who provide a ChatGPT AI key during deployment, we've integrated an AI-powered feature to enhance the error analysis process. This feature leverages a custom ChatGPT system prompt to summarize and interpret the errors stored in the quarantine table, significantly aiding users in their debugging efforts.</span></p>
</div>

\[Image of quarantine table \- needs css done first\]
<video class="video" width="700" height="400" muted autoplay loop style="border-radius: 5px; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5);">
<source src="/case_study/quartable.mp4" type="video/mp4">
</video>


These enhancements collectively improve Helios’ error handling, while providing tools for error analysis and resolution.

@@ -27,7 +30,7 @@ To improve Helios' performance and efficiency, we implemented several optimizati

Our initial Connector function processed Kinesis records in small batches of 10\. We optimized this by increasing the maximum batch size to 100 records and setting a batch window of one second.

![Batch Size](public/case_study/stream_efficiency.png)
![Batch Size](/case_study/stream_efficiency.png)

This change significantly reduces Lambda invocations, leading to less overhead and improved cost efficiency. Larger batches also enhance overall throughput and make better use of allocated resources. The one-second window ensures a balance between efficiency and near real-time processing, as records are sent when either the batch size is reached or the time window expires.

@@ -37,7 +40,7 @@ These optimizations improve system performance while reducing costs for users, a

Parallelization in the context of Lambdas means that multiple Lambda instances can be run at the same time. By default, the Lambda parallelization factor is set to 1\. This means that only one Lambda instance can be a trigger for one Kinesis <TippyWrapper content="A shard is a unit of capacity within a Kinesis stream that provides a fixed amount of data throughput and serves as a partition for organizing events.">shard</TippyWrapper>. We adjusted this setting to 10, allowing up to ten Lambda instances to process data from a single Kinesis shard simultaneously. This significantly improves our ingestion capacity and scalability, increasing our system's ability to handle high-volume data streams quickly and efficiently.

![Parallelization](public/case_study/lambdakinesislimit.png)
![Parallelization](/case_study/lambdakinesislimit.png)

### Caching DynamoDB requests

@@ -47,15 +50,15 @@ Our caching strategy ensures that subsequent invocations of the same Lambda inst

## Database Backups

![Database Arch](public/case_study/full_backup_highlight.png)
![Database Arch](/case_study/full_backup_highlight.png)

In the first iteration of Helios, the ClickHouse database lacked backup capabilities. For instance, if the EC2 went down, our data would not be recoverable. Addressing this vulnerability was a critical improvement for version two.

As part of our backup strategy, we integrated an Amazon S3 (Simple Storage Service) bucket into our AWS-based architecture. This addition plays a crucial role in enhancing our data resilience and disaster recovery capabilities for our ClickHouse database running on an EC2 instance. The S3 bucket serves as a highly durable and scalable object storage solution, allowing us to implement daily backups of the ClickHouse database efficiently. By leveraging S3's virtually unlimited storage capacity and 99.999999999% (11 9's) of durability, we ensure that our critical data is preserved securely over the long term.

The backup cron job process is automated to run daily, capturing a full snapshot of the ClickHouse database and transferring it to the designated S3 bucket.

![Daily Cron Job](public/case_study/dailycjob.png)
![Daily Cron Job](/case_study/dailycjob.png)

## Export to CSV

@@ -71,4 +74,4 @@ To summarize our core platform improvements, we bolstered the platform with feat

After the above improvements, our final architecture looks like this:

![Full Arch](public/case_study/full_full_color.png)
![Full Arch](/case_study/full_full_color.png)
14 changes: 7 additions & 7 deletions docs/introduction.md
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@ An **event** is a state change in a system or application. This could be as simp
#### Event streaming

![Event Streaming](public/case_study/eventbroker.png)
![Event Streaming](/case_study/eventbroker.png)

**Event streaming** is the continuous transmission and processing of events from various sources in real-time or near real-time. **Real-time** has different meanings depending on the context. For instance, in high-frequency trading, real-time might mean microseconds, whereas in social media analytics, it could mean within a few minutes. In this case study, we define real-time as end-to-end latency within 5 seconds from event consumption by Helios from a streaming platform to event data being available for querying. This definition is based on our load testing, which will be detailed later in this case study.

@@ -45,7 +45,7 @@ Event streaming serves a variety of functions, including:

While event streaming platforms excel at ingesting and processing high-volume, real-time data, they present a significant challenge for data analysis and exploration: data accessibility. Event streaming platforms are optimized for throughput and real-time processing, not for ad-hoc querying or historical analysis. This makes it difficult for analysts to explore past data or perform complex analyses on the fly.

![Black Box](public/case_study/blackbox.png)
![Black Box](/case_study/blackbox.png)

This limitation can significantly impact a team's ability to derive timely insights from their streaming data. To illustrate this challenge more concretely, let's consider a common use case in the e-commerce industry.

@@ -73,7 +73,7 @@ For example, Tinybird is a data platform that allows users to explore real-time

There are similar services including StarTree, which operates in the same space as Tinybird and serves customers including Stripe.

![Tinybird](public/case_study/tinybird_arch.png)
![Tinybird](/case_study/tinybird_arch.png)

Managed services offer quick setup and powerful features, but come with specific trade-offs.

@@ -101,19 +101,19 @@ At its core, Helios is comprised of:
<p><Icon name="LinkIcon" /><span>Helios Amazon Kinesis Integration: Links existing Kinesis streams to the Helios infrastructure.</span></p>
</div>

![Kinesis Connection](public/case_study/kinesis_to_helios.png)
![Kinesis Connection](/case_study/kinesis_to_helios.png)

<div class="icon-list">
<p><Icon name="WindowIcon" /><span>Helios web application: offers an interface for connecting existing streams to the Helios backend infrastructure and an integrated SQL console querying and analyzing Kinesis event streams.</span></p>
</div>

![Web app](public/case_study/webapp.png)
![Web app](/case_study/webapp.png)

<div class="icon-list">
<p><Icon name="CommandLineIcon" /><span>Helios CLI: configures Helios deployment with AWS credentials; deploys the entire Helios stack to AWS using a single command; and destroys the stack when needed. We will go into more detail within the Automating Deployment section.</span></p>
</div>

![CLI](public/case_study/cli_dropshadow.png)
![CLI](/case_study/cli_dropshadow.png)

As with any tool, the suitability of Helios depends on each team's specific requirements, existing infrastructure, and resources. We encourage potential users to evaluate how our offering aligns with their particular needs and constraints.

@@ -123,6 +123,6 @@ To summarize, teams have numerous options for viewing and analyzing events withi

Each option offers distinct advantages and limitations, making it crucial for teams to carefully evaluate their specific requirements before selecting the most suitable approach.

![Comparison Table](public/case_study/comparetableshadow.png)
![Comparison Table](/case_study/comparetableshadow.png)

Having explored the problem space and current solutions, we will now dive into Helios' internal workings. The upcoming section will break down our architecture, examining how each component functions in detail.
Binary file added docs/public/case_study/quartable.mp4
Binary file not shown.
Binary file added docs/public/home/helios-cli.mp4
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/scaling.md
Original file line number Diff line number Diff line change
@@ -10,6 +10,6 @@ Vertical scaling, often referred to as "scaling up," involves increasing the res

The diagram below illustrates three primary vertical scaling options available for the Helios architecture within the AWS ecosystem. These options focus on enhancing the capabilities of our existing EC2 instances, <TippyWrapper content="EBS is AWS's persistent storage for EC2 instances">Elastic Block Store (EBS)</TippyWrapper> volumes, and Lambda functions, allowing us to boost processing power and storage capacity without fundamentally altering our system's architecture. By leveraging these vertical scaling techniques, we can efficiently address growing workloads and ensure Helios continues to deliver solid performance as user needs evolve.

![Scaling Helios](public/case_study/scaling_components.png)
![Scaling Helios](/case_study/scaling_components.png)

Understanding our system's limits is key to effective scaling. In this next section, we'll walk through our load testing process and system capabilities.

0 comments on commit 6bcd139

Please sign in to comment.