diff --git a/website/en/community/community.md b/website/en/community/community.md index 2fd43f23b..f80830525 100644 --- a/website/en/community/community.md +++ b/website/en/community/community.md @@ -1,12 +1,21 @@ --- order: 1 --- + # Community - - [Contribute Guide](contribute.md) - - [How to get involved](contribute.md#how-to-get-involved) - - [How to submit a github pull request](contribute.md#submit-a-pull-request) - - [Open a github issue](contribute.md#open-a-github-issue) - - [Developing tips](contribute.md#developing-tips) - - [Subscribe Mailing list](mailing.md) - - [Team introduction](team.md) \ No newline at end of file +English | [简体中文](../../zh/community/community.md) + +----- + +- [Contribute Guide](contribute.md) + - [How to get involved](contribute.md#how-to-get-involved) + - [Pull Request Guide](pr_guide.md) + - [Open a github issue](contribute.md#open-a-github-issue) + - [Developing tips](contribute.md#developing-tips) + - [BitSail Release Guide](release_guide.md) +- [Connector Quick Start](connector_quick_start.md) + - [Source Connector Details](source_connector_detail.md) + - [Sink Connector Details](sink_connector_detail.md) +- [Subscribe Mailing list](mailing.md) +- [Team introduction](team.md) \ No newline at end of file diff --git a/website/en/community/connector_quick_start.md b/website/en/community/connector_quick_start.md new file mode 100644 index 000000000..133be8b79 --- /dev/null +++ b/website/en/community/connector_quick_start.md @@ -0,0 +1,281 @@ +--- +order: 7 +--- + +# Connector Quick Start + +English | [简体中文](../../zh/community/connector_quick_start.md) + +----- + +## Introduction + +This article is aimed at BitSail's Connector developers. It comprehensively explains the whole process of developing a complete Connector from the developer's perspective, and quickly gets started with Connector development. + +## contents + +First, the developer needs to fork the BitSail repository. For more details, refer to [Fork BitSail Repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo). And then use git clone the repository to the local, and import it into the IDE. At the same time, create your own working branch and use this branch to develop your own Connector. project address: https://github.com/bytedance/bitsail.git. + +The project structure is as follows: + +![](../../images/community/connector_quick_start/code_structure_en.png) + +## Development Process + +BitSail is a data integration engine based on a distributed architecture, and Connectors will execute concurrently. And the BitSail framework is responsible for task scheduling, concurrent execution, dirty data processing, etc. Developers only need to implement the corresponding interface. The specific development process is as follows: + +- Project configuration, developers need to register their own Connector in the `bitsail/bitsail-connectors/pom.xml` module, add their own Connector module in `bitsail/bitsail-dist/pom.xml`, and register configuration files for your connector , so that the framework can dynamically discover it at runtime + - ![](../../images/community/connector_quick_start/connector_pom.png) + + - ![](../../images/community/connector_quick_start/dist_pom.png) +- Connector development, implement the abstract methods provided by Source and Sink, refer to the follow-up introduction for details +- Data output type, the currently supported data type is the BitSail Row type, whether it is the data type that the Source passes to the downstream in the Reader, or the data type that the Sink consumes from the upstream, it should be the BitSail Row type + +# Architecture + +The current design of the Source API is also compatible with streaming and batch scenarios, in other words, it supports pull & push scenarios at the same time. Before that, we need to go through the interaction model of each component in the traditional streaming batch scenario. + +## Batch Model + +In traditional batch scenarios, data reading is generally divided into the following steps: + +- `createSplits`:It is generally executed on the client side or the central node. The purpose is to split the complete data into as many `rangeSplits` as possible according to the specified rules. `createSplits` is executed once in the job life cycle. +- `runWithSplit`: Generally, it is executed on the execution node. After the execution node is started, it will request the existing `rangeSplit` from the central node, and then execute it locally; after the execution is completed, it will request the central node again until all the `splits `are executed. +- `commit`:After the execution of all splits is completed, the `commit` operation is generally performed on the central node to make the data visible to the outside world. + +## Stream Model + +In traditional streaming scenarios, data reading is generally divided into the following steps: + +- `createSplits`: generally executed on the client side or the central node, the purpose is to divide the data stream into `rangeSplits` according to the sliding window or tumbling window strategy, and `createSplits` will always be executed according to the divided windows during the life cycle of the streaming job. +- `runWithSplit`: Generally executed on the execution node, the central node will send `rangeSplit` to the executable node, and then execute locally on the executable node; after the execution is completed, the processed `splits` data will be sent downstream. +- `commit`: After the execution of all splits is completed, the `retract message` is generally sent to the target data source, and the results are dynamically displayed in real time. + +## BitSail Model + +![](../../images/community/connector_quick_start/bitsail_model.png) + +- `createSplits`: BitSail divides rangeSplits through the `SplitCoordinator` module. `createSplits` will be executed periodically in the life cycle of streaming jobs, but only once in batch jobs. +- `runWithSplit`: Execute on the execution node. The execution node in BitSail includes `Reader` and `Writer` modules. The central node will send `rangeSplit` to the executable node, and then execute locally on the executable node; after the execution is completed, the processed `splits` data will be sent downstream. +- `commit`: After the `writer` completes the data writing, the `committer` completes the submission. When `checkpoint` is not enabled, `commit` will be executed once after all `writers` are finished; when `checkpoint` is enabled, `commit` will be executed once every checkpoint. + +# Source Connector + +## Introduction + +![](../../images/community/connector_quick_start/source_connector.png) + +- Source: The life cycle management component of the data reading component is mainly responsible for interacting with the framework, structuring the job, and not participating in the actual execution of the job. +- SourceSplit: Source data split. The core purpose of the big data processing framework is to split large-scale data into multiple reasonable Splits +- State:Job status snapshot. When the checkpoint is enabled, the current execution status will be saved. +- SplitCoordinator: SplitCoordinator assumes the role of creating and managing Split. +- SourceReader: The component that is actually responsible for data reading will read the data after receiving the Split, and then transmit the data to the next operator. + +Developers first need to create the `Source` class, which needs to implement the `Source` and `ParallelismComputable` interfaces. It is mainly responsible for interacting with the framework and structuring the job. It does not participate in the actual execution of the job. BitSail's `Source` adopts the design idea of stream-batch integration, sets the job processing method through the `getSourceBoundedness` method, defines the `readerConfiguration` through the `configure` method, and performs data type conversion through the `createTypeInfoConverter` method, and can obtain user-defined data in the yaml file through `FileMappingTypeInfoConverter` The conversion between source type and BitSail type realizes customized type conversion. Then we define the data fragmentation format `SourceSplit` class of the data source and the `SourceSplitCoordinator` class that will manage the Split role, and finally complete the `SourceReader` to read data from the Split. + +| Job Type | Boundedness | +| -------- | --------------------------- | +| batch | Boundedness.*BOUNDEDNESS* | +| stream | Boundedness.*UNBOUNDEDNESS* | + +Each `SourceReader` is executed in an independent thread. As long as we ensure that the splits assigned by the `SourceSplitCoordinator` to different `SourceReader` do not overlap, the developer doesn't consider any concurrency details during the execution cycle of `SourceReader`. You only need to pay attention to how to read data from the constructed split, then complete the data type conversion, and convert the external data type into BitSail’s Row type and pass it downstream. + +## Reader example + +```Java +public class FakeSourceReader extends SimpleSourceReaderBase { + + private final BitSailConfiguration readerConfiguration; + private final TypeInfo[] typeInfos; + + private final transient int totalCount; + private final transient RateLimiter fakeGenerateRate; + private final transient AtomicLong counter; + + private final FakeRowGenerator fakeRowGenerator; + + public FakeSourceReader(BitSailConfiguration readerConfiguration, Context context) { + this.readerConfiguration = readerConfiguration; + this.typeInfos = context.getTypeInfos(); + this.totalCount = readerConfiguration.get(FakeReaderOptions.TOTAL_COUNT); + this.fakeGenerateRate = RateLimiter.create(readerConfiguration.get(FakeReaderOptions.RATE)); + this.counter = new AtomicLong(); + this.fakeRowGenerator = new FakeRowGenerator(readerConfiguration, context.getIndexOfSubtask()); + } + + @Override + public void pollNext(SourcePipeline pipeline) throws Exception { + fakeGenerateRate.acquire(); + pipeline.output(fakeRowGenerator.fakeOneRecord(typeInfos)); + } + + @Override + public boolean hasMoreElements() { + return counter.incrementAndGet() <= totalCount; + } +} +``` + +# Sink Connector + +## Introduction + +![](../../images/community/connector_quick_start/sink_connector.png) + +- Sink: life cycle management of data writing components, mainly responsible for interaction with the framework, framing jobs, it does not participate in the actual execution of jobs. +- Writer: responsible for writing the received data to external storage. +- WriterCommitter (optional): Commit the data to complete the two-phase commit operation; realize the semantics of exactly-once. + +Developers first need to create a `Sink` class and implement the `Sink` interface, which is mainly responsible for the life cycle management of the data writing component and the construction of the job. Define the configuration of `writerConfiguration` through the `configure` method, perform data type conversion through the `createTypeInfoConverter` method, and write the internal type conversion to the external system, the same as the `Source` part. Then we define the `Writer` class to implement the specific data writing logic. When the `write` method is called, the BitSail Row type writes the data into the cache queue, and when the `flush` method is called, the data in the cache queue is flushed to the target data source. + +## Writer example + +```Java +public class PrintWriter implements Writer { + private static final Logger LOG = LoggerFactory.getLogger(PrintWriter.class); + + private final int batchSize; + private final List fieldNames; + + private final List writeBuffer; + private final List commitBuffer; + + private final AtomicInteger printCount; + + public PrintWriter(int batchSize, List fieldNames) { + this(batchSize, fieldNames, 0); + } + + public PrintWriter(int batchSize, List fieldNames, int alreadyPrintCount) { + Preconditions.checkState(batchSize > 0, "batch size must be larger than 0"); + this.batchSize = batchSize; + this.fieldNames = fieldNames; + this.writeBuffer = new ArrayList<>(batchSize); + this.commitBuffer = new ArrayList<>(batchSize); + printCount = new AtomicInteger(alreadyPrintCount); + } + + @Override + public void write(Row element) { + String[] fields = new String[element.getFields().length]; + for (int i = 0; i < element.getFields().length; ++i) { + fields[i] = String.format("\"%s\":\"%s\"", fieldNames.get(i), element.getField(i).toString()); + } + + writeBuffer.add("[" + String.join(",", fields) + "]"); + if (writeBuffer.size() == batchSize) { + this.flush(false); + } + printCount.incrementAndGet(); + } + + @Override + public void flush(boolean endOfInput) { + commitBuffer.addAll(writeBuffer); + writeBuffer.clear(); + if (endOfInput) { + LOG.info("all records are sent to commit buffer."); + } + } + + @Override + public List prepareCommit() { + return commitBuffer; + } + + @Override + public List snapshotState(long checkpointId) { + return Collections.singletonList(printCount.get()); + } +} +``` + +# Register the connector into the configuration file + +Register a configuration file for your connector so that the framework can dynamically discover it at runtime. The configuration file is defined as follows: + +Taking hive as an example, developers need to add a json file in the resource directory. The example name is `bitsail-connector-hive.json`, as long as it does not overlap with other connectors. + +```Json +{ + "name": "bitsail-connector-hive", + "classes": [ + "com.bytedance.bitsail.connector.hive.source.HiveSource", + "com.bytedance.bitsail.connector.hive.sink.HiveSink" + ], + "libs": [ + "bitsail-connector-hive-${version}.jar" + ] +} +``` + +# Test module + +In the module where the Source or Sink connector is located, add an ITCase test case, and then support it according to the following process. + +- Start data source through the test container + +![](../../images/community/connector_quick_start/test_container.png) + +- Write the corresponding configuration file + +```Json +{ + "job": { + "common": { + "job_id": 313, + "instance_id": 3123, + "job_name": "bitsail_clickhouse_to_print_test", + "user_name": "test" + }, + "reader": { + "class": "com.bytedance.bitsail.connector.clickhouse.source.ClickhouseSource", + "jdbc_url": "jdbc:clickhouse://localhost:8123", + "db_name": "default", + "table_name": "test_ch_table", + "split_field": "id", + "split_config": "{\"name\": \"id\", \"lower_bound\": 0, \"upper_bound\": \"10000\", \"split_num\": 3}", + "sql_filter": "( id % 2 == 0 )", + "columns": [ + { + "name": "id", + "type": "int64" + }, + { + "name": "int_type", + "type": "int32" + }, + { + "name": "double_type", + "type": "float64" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "p_date", + "type": "date" + } + ] + }, + "writer": { + "class": "com.bytedance.bitsail.connector.legacy.print.sink.PrintSink" + } + } +} +``` + +- Submit the job through EmbeddedFlinkCluster.submit method + +```Java +@Test +public void testClickhouseToPrint() throws Exception { + BitSailConfiguration jobConf = JobConfUtils.fromClasspath("clickhouse_to_print.json"); + EmbeddedFlinkCluster.submitJob(jobConf); +} +``` + +# Submit your PR + +After the developer implements his own Connector, he can associate his own issue and submit a PR to github. Before submitting, the developer remembers to add documents to the Connector. After passing the review, the Connector contributed by everyone becomes a part of BitSail. We follow the level of contribution and select active Contributors to become our Committers and participate in major decisions in the BitSail community. We hope that everyone will actively participate! \ No newline at end of file diff --git a/website/en/community/contribute.md b/website/en/community/contribute.md index f020ccbaf..5e26567a1 100644 --- a/website/en/community/contribute.md +++ b/website/en/community/contribute.md @@ -1,9 +1,13 @@ --- order: 2 --- + # Contributor Guide + English | [简体中文](../../zh/community/contribute.md) +----- + BitSail community welcomes contributions from anyone! ## How To Get Involved @@ -68,6 +72,7 @@ If it is the first time to submit a pull request, you can read this doc [About P - Commit changes to the branch and push to the fork repo - Create a pull request to the ***BitSail*** repo +If you are a freshman to open source projects, you can read [How to submit a github pull request](pr_guide.md) for a more detailed guide. ## Ask for a code review After you have your pull request ready, with all the items from the pull request checklist being completed. Tag a committer to review you pull request. diff --git a/website/en/community/mailing.md b/website/en/community/mailing.md index c620e0fb9..578e0ee1c 100644 --- a/website/en/community/mailing.md +++ b/website/en/community/mailing.md @@ -1,8 +1,13 @@ --- -order: 3 +order: 5 --- + # Subscribe Mailing Lists +English | [简体中文](../../zh/community/mailing.md) + +----- + Currently, BitSail community use Google Group as the mailing list provider. You need to subscribe to the mailing list before starting a conversation diff --git a/website/en/community/pr_guide.md b/website/en/community/pr_guide.md new file mode 100644 index 000000000..cd9a0b709 --- /dev/null +++ b/website/en/community/pr_guide.md @@ -0,0 +1,294 @@ +--- +order: 3 +--- + +# Pull Request Guide + +English | [简体中文](../../zh/community/pr_guide.md) + +----- + +![](../../images/community/pr_guide/repository_structure.png) + +## Fork BitSail to your repository + +![](../../images/community/pr_guide/repository_fork.png) + +## Git account configuration + + The role of user name and email address: User name and email address are variables of the local git client. Each commit will be recorded with the user name and email address. Github's contribution statistics are based on email addresses. + + Check your account and email address: + +```Bash +$ git config user.name +$ git config user.email +``` + + If you are using git for the first time, or need to modify account, execute the following command, replacing the username and email address with your own. + +```Bash +$ git config --global user.name "username" +$ git config --global user.email "your_email@example.com" +``` + +## Clone the Fork repository to local + + You can choose HTTPS or SSH mode, and the following operations will use SSH mode as an example. If you use HTTPS mode, you only need to replace all the SSH url in the command with HTTPS url. + +### HTTPS + +```Bash +$ git clone git@github.com:{your_github_id}/bitsail.git +``` + +### SSH + +```Bash +$ git clone https://github.com/{your_github_id}/bitsail.git +``` + +![](../../images/community/pr_guide/git_clone_example.png) + +## Set origin and upstream + +```Bash +$ git remote add origin git@github.com:{your_github_id}/bitsail.git +$ git remote add upstream git@github.com:bytedance/bitsail.git +$ git remote -v +origin git@github.com:{your_github_id}/bitsail.git (fetch) +origin git@github.com:{your_github_id}/bitsail.git (push) +upstream git@github.com:bytedance/bitsail.git (fetch) +upstream git@github.com:bytedance/bitsail.git (push) +``` + + If the `origin` setting of `git` is wrong, you can execute `git `*`remote`*` rm `*`origin`* to clear and reset it. + + The `upstream` is the same, setting errors can be cleared by `git `*`remote`*` rm `*`upstream`* and reset. + +## Create your working branch + +```Bash +// view all branches +$ git branch -a +// Create a new loacl branch +$ git branch {your_branch_name} +// switch to new branch +$ git checkout {your_branch_name} +// Push the local branch to the fork repository +$ git push -u origin +``` + + Branch name example: add-sink-connector-redis + + After that, you can write and test the code in your own working branch, and synchronize it to your personal branch in time. + +```Bash +$ git add . +$ git commit -m "[BitSail] Message" +$ git push -u origin +``` + +## Synchronize source code + + BitSail will carefully consider the update and iteration of the interface or version. If the developer has a short development cycle, he can do a synchronization with the original warehouse before submitting the code. However, if unfortunately encountering a major version change, the developer can follow up at any time Changes to the original repository. + + Here, in order to ensure the cleanness of the code branch, it is recommended to use the rebase method for merging. + +```Bash +$ git fetch upstream +$ git rebase upstream/master +``` + + During the rebase process, file conflicts may be reported + + For example, in the following situation, we need to manually merge the conflicting files: `bitsail-connectors/pom.xml` + +```Bash +$ git rebase upstream/master +Auto-merging bitsail-dist/pom.xml +Auto-merging bitsail-connectors/pom.xml +CONFLICT (content): Merge conflict in bitsail-connectors/pom.xml +error: could not apply 054a4d3... [BitSail] Migrate hadoop source&sink to v1 interface +Resolve all conflicts manually, mark them as resolved with +"git add/rm ", then run "git rebase --continue". +You can instead skip this commit: run "git rebase --skip". +To abort and get back to the state before "git rebase", run "git rebase --abort". +Could not apply 054a4d3... [BitSail] Migrate hadoop source&sink to v1 interface +``` + + The conflicting parts are shown below, bounded by `=======`, decide whether you want to keep only the changes of the branch, only the changes of the other branch, or make completely new changes (possibly containing changes of both branches). Remove the conflict markers `<<<<<<<`, `=======`, `>>>>>>>` and make the desired changes in the final merge. + +```Plain + + bitsail-connectors-legacy + connector-print + connector-elasticsearch + connector-fake + connector-base + connector-doris + connector-kudu + connector-rocketmq + connector-redis + connector-clickhouse +<<<<<<< HEAD + connector-druid +======= + connector-hadoop +>>>>>>> 054a4d3 ([BitSail] Migrate hadoop source&sink to v1 interface) + +``` + + After combine: + +```Plain + + bitsail-connectors-legacy + connector-print + connector-elasticsearch + connector-fake + connector-base + connector-doris + connector-kudu + connector-rocketmq + connector-redis + connector-clickhouse + connector-druid + connector-hadoop + +``` + + Execute `git add ` after combine: + +```Bash +$ git add bitsail-connectors/pom.xml +$ git rebase --continue +``` + + Afterwards, the following window will appear. This is the Vim editing interface. The editing mode can be done according to Vim. Usually we only need to edit the Commit information on the first line, or not. After completion, follow the exit method of Vim, and press`: w q ↵`。 + +![](../../images/community/pr_guide/git_rebase_example.png) + + After that, the following appears to indicate that the rebase is successful. + +```Bash +$ git rebase --continue +[detached HEAD 9dcf4ee] [BitSail] Migrate hadoop source&sink to v1 interface + 15 files changed, 766 insertions(+) + create mode 100644 bitsail-connectors/connector-hadoop/pom.xml + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/constant/HadoopConstants.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/error/TextInputFormatErrorCode.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/format/HadoopDeserializationSchema.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/option/HadoopReaderOptions.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/sink/HadoopSink.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/sink/HadoopWriter.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/HadoopSource.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/coordinator/HadoopSourceSplitCoordinator.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/reader/HadoopSourceReader.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/reader/HadoopSourceReaderCommonBasePlugin.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/split/HadoopSourceSplit.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/resources/bitsail-connector-unified-hadoop.json +Successfully rebased and updated refs/heads/add-v1-connector-hadoop. +``` + + At this point, we can see that our `commit` has been mentioned on the front: + +![](../../images/community/pr_guide/commit_info.png) + + The code may not be pushed normally after rebase: + +```Bash +$ git push +To github.com:love-star/bitsail.git + ! [rejected] add-v1-connector-hadoop -> add-v1-connector-hadoop (non-fast-forward) +error: failed to push some refs to 'github.com:love-star/bitsail.git' +hint: Updates were rejected because the tip of your current branch is behind +hint: its remote counterpart. Integrate the remote changes (e.g. +hint: 'git pull ...') before pushing again. +hint: See the 'Note about fast-forwards' in 'git push --help' for details. +``` + + At this time, `git push -f` is required to force the push. Forced push is a risky operation. Please check carefully before the operation to avoid the problem that irrelevant code is forcibly overwritten. + +```Bash +git push -f +Enumerating objects: 177, done. +Counting objects: 100% (177/177), done. +Delta compression using up to 12 threads +Compressing objects: 100% (110/110), done. +Writing objects: 100% (151/151), 26.55 KiB | 1.40 MiB/s, done. +Total 151 (delta 40), reused 0 (delta 0), pack-reused 0 +remote: Resolving deltas: 100% (40/40), completed with 10 local objects. +To github.com:love-star/bitsail.git + + adb90f4...b72d931 add-v1-connector-hadoop -> add-v1-connector-hadoop (forced update) +``` + + At this point, the branch has been synchronized with the upstream repository, and subsequent code writing will be based on the latest. + +## Submit your code + + When the developer completes the development, he first needs to complete a `rebase` of the warehouse. For details, refer to the scenario of `synchronizing source code`. After rebase, git's history looks like this: + +![](../../images/community/pr_guide/git_history.png) + + As shown on Github + +![](../../images/community/pr_guide/github_status.png) + + We hope to keep only one Commit for each PR to ensure the cleanness of the branch. If there are multiple commits, they can be merged into one commit in the end. The specific operation is as follows: + +```Bash +git reset --soft HEAD~N(N is the reset submit number) +git add . +git commit -m "[BitSail] Message" +git push -f +``` + + example: + +```Bash +$ git reset --soft HEAD~4 +$ git add . +$ git commit -m "[BitSail] Migrate hadoop source&sink to v1 interface" +$ git push -f +``` + + After the reset: + +![](../../images/community/pr_guide/after_git_reset.png) + +## Submit your PR + +![](../../images/community/pr_guide/github_pr.png) + + When submitting PR, you should pay attention to the specifications of Commit message and PR message: + +![](../../images/community/pr_guide/create_pr.png) + +### Commit message specification + +1. Create a Github issue or claim an existing issue +2. Describe what you would like to do in the issue description. +3. Include the issue number in the commit message. The format follows below. + +```Plain +[BitSail#${IssueNumber}][${Module}] Description +[BitSail#1234][Connector] Improve reader split algorithm to Kudu source connector + +//For Minor change +[Minor] Description +``` + +1. List of module. Chose the most related one if your changes affect multiple modules. e.g. If you are adding a feature to the kafka connector and end up modifying code in common, components and cores, you should still use the [Connector] as module name. + +```Plain +[Common] bitsail-common +[Core] base client component cores +[Connector] all connector related changes +[Doc] documentation or java doc changes +[Build] build, dependency changes +``` + +### PR message specification + + The PR message should summarize the cause and effect of the problem clearly. If there is a corresponding issue, the issue address should be attached to ensure that the problem is traceable. \ No newline at end of file diff --git a/website/en/community/release_guide.md b/website/en/community/release_guide.md new file mode 100644 index 000000000..a455da946 --- /dev/null +++ b/website/en/community/release_guide.md @@ -0,0 +1,111 @@ +--- +order: 4 +--- + +# BitSail Release Guide + +English | [简体中文](../../zh/community/release_guide.md) + +----- + +## Procedure to submit a pull request + +SOP to submit a new commit + +1. Create a Github issue or claim an existing issue +2. Describe what you would like to do in the issue description. +3. Include the issue number in the commit message. The format follows below. + +```Plain +[BitSail#${IssueNumber}][${Module}] Description +[BitSail#1234][Connector] Improve reader split algorithm to Kudu source connector + +//For Minor change +[Minor] Description +``` + +4. List of module. Chose the most related one if your changes affect multiple modules. e.g. If you are adding a feature to kafka connector and end up modifying code in common, components and cores, you should still use the [Connector] as module name. + +```Plain +[Common] bitsail-common +[Core] base client component cores +[Connector] all connector related changes +[Doc] documentation or java doc changes +[Build] build, dependency changes +``` + +## Procedure to release + +![img](../../images/community/release_guide/release_procedure.png) + +### 1. Decide to release + +Because we don't have many users subscribed to the mailing list for now. Using Github issue to discuss release related topic should have better visibility. + +We could start a new discussion on Github with topics like + +`0.1.0` Release Discussion + +Deciding to release and selecting a Release Manager is the first step of the release process. This is a consensus-based decision of the entire community. + +Anybody can propose a release on the Github issue, giving a solid argument and nominating a committer as the Release Manager (including themselves). There’s no formal process, no vote requirements, and no timing requirements. Any objections should be resolved by consensus before starting the release. + +### 2. Prepare for the relase + +A. Triage release-blocking issues + +B. Review and update documentation + +C. Cross team testing + +D. Review Release Notes + +E. Verify build and tests + +F. Create a release branch + +G. Bump the version of the master + +### 3. Build a release candidate + +Since we don't have a maven central access for now, we will build a release candidate on github and let other users test it. + +A. Add git release tag + +B. Publish on Github for the public to download + +### 4. Vote on the release candidate + +Once the release branch is ready and release candidate is available on Github. The release manager will ask other committers to test the release candidate and start a vote on corresponding Github Issue. We need 3 blinding votes from PMC members at least. + +### 5. Fix issue + +Any issues identified during the community review and vote should be fixed in this step. + +Code changes should be proposed as standard pull requests to the master branch and reviewed using the normal contributing process. Then, relevant changes should be cherry-picked into the release branch. The cherry-pick commits should then be proposed as the pull requests against the release branch, again reviewed and merged using the normal contributing process. + +Once all issues have been resolved, you should go back and build a new release candidate with these changes. + +### 6. Finalize the release + +Once the release candidate passes the voting, we could finalize the release. + +A. Change the release branch version from `x.x.x-rc1` to `x.x.x`. e.g. `0.1.0-rc1` to `0.1.0` + +B. `git commit -am "[MINOR] Update release version to reflect published version ${RELEASE_VERSION}"` + +C. Push to the release branch + +D. Resolve related Github issues + +E. Create a new Github release, off the release version tag, you pushed before + +### 7. Promote the release + +24 hours after we publish the release, promote the release on all the community channels, including WeChat, slack, mailing list. + +### Reference: + +Flink release guide: [Creating a Flink Release](https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release) + +Hudi release guide: [Apache Hudi Release Guide](https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide) \ No newline at end of file diff --git a/website/en/community/sink_connector_detail.md b/website/en/community/sink_connector_detail.md new file mode 100644 index 000000000..41ee8b29e --- /dev/null +++ b/website/en/community/sink_connector_detail.md @@ -0,0 +1,398 @@ +--- +order: 9 +--- + +# Sink Connector Details + +English | [简体中文](../../zh/community/sink_connector_detail.md) + +----- + +## Introduction + +![](../../images/community/connector_quick_start/sink_connector.png) + +- Sink: life cycle management of data writing components, mainly responsible for interaction with the framework, framing jobs, it does not participate in the actual execution of jobs. +- Writer: responsible for writing the received data to external storage. +- WriterCommitter (optional): Commit the data to complete the two-phase commit operation; realize the semantics of exactly-once. + +Developers first need to create a `Sink` class and implement the `Sink interface`, which is mainly responsible for the life cycle management of the data writing component and the construction of the job. Define the configuration of `writerConfiguration` through the configure method, perform data type conversion through the `createTypeInfoConverter` method, and `write` the internal type conversion to the external system, the same as the `Source` part. Then we define the `Writer` class to implement the specific data writing logic. When the `write` method is called, the `BitSail Row` type writes the data into the cache queue, and when the `flush` method is called, the data in the cache queue is flushed to the target data source. + +## Sink + +The life cycle management of the data writing component is mainly responsible for the interaction with the framework and the construction of the job. It does not participate in the actual execution of the job. + +For each Sink task, we need to implement a class that inherits the Sink interface. + +![](../../images/community/sink_connector/sink_diagram.png) + +### Sink Interface + +```Java +public interface Sink extends Serializable { + + /** + * @return The name of writer operation. + */ + String getWriterName(); + + /** + * Configure writer with user defined options. + * + * @param commonConfiguration Common options. + * @param writerConfiguration Options for writer. + */ + void configure(BitSailConfiguration commonConfiguration, BitSailConfiguration writerConfiguration) throws Exception; + + /** + * Create a writer for processing elements. + * + * @return An initialized writer. + */ + Writer createWriter(Writer.Context context) throws IOException; + + /** + * @return A converter which supports conversion from BitSail {@link TypeInfo} + * and external engine type. + */ + default TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); + } + + /** + * @return A committer for commit committable objects. + */ + default Optional> createCommitter() { + return Optional.empty(); + } + + /** + * @return A serializer which convert committable object to byte array. + */ + default BinarySerializer getCommittableSerializer() { + return new SimpleBinarySerializer(); + } + + /** + * @return A serializer which convert state object to byte array. + */ + default BinarySerializer getWriteStateSerializer() { + return new SimpleBinarySerializer(); + } +} +``` + +### configure method + +Responsible for configuration initialization, usually extracting necessary configuration from commonConfiguration and writerConfiguration. + +#### example + +ElasticsearchSink: + +```Java +public void configure(BitSailConfiguration commonConfiguration, BitSailConfiguration writerConfiguration) { + writerConf = writerConfiguration; +} +``` + +### createWriter method + +Responsible for generating a connector Writer class inherited from the Writer interface. Pass in construction configuration parameters as needed, and note that the passed in parameters must be serializable. + +```Java +@Override +public Writer createWriter(Writer.Context context) { + return new RedisWriter<>(redisOptions, jedisPoolOptions); +} +``` + +### createTypeInfoConverter method + +Type conversion, convert the internal type and write it to the external system, same as the Source part. + +### createCommitter method + +The optional method is to write the specific data submission logic, which is generally used in scenarios where the data exactly-once semantics needs to be guaranteed. After the writer completes the data writing, the committer completes the submission, and then realizes the two-phase submission. For details, please refer to the implementation of Doris Connector. + +## Writer + +specific data write logic + +![](../../images/community/sink_connector/writer_diagram.png) + +### Writer Interface + +```Java +public interface Writer extends Serializable, Closeable { + + /** + * Output an element to target source. + * + * @param element Input data from upstream. + */ + void write(InputT element) throws IOException; + + /** + * Flush buffered input data to target source. + * + * @param endOfInput Flag indicates if all input data are delivered. + */ + void flush(boolean endOfInput) throws IOException; + + /** + * Prepare commit information before snapshotting when checkpoint is triggerred. + * + * @return Information to commit in this checkpoint. + * @throws IOException Exceptions encountered when preparing committable information. + */ + List prepareCommit() throws IOException; + + /** + * Do snapshot for at each checkpoint. + * + * @param checkpointId The id of checkpoint when snapshot triggered. + * @return The current state of writer. + * @throws IOException Exceptions encountered when snapshotting. + */ + default List snapshotState(long checkpointId) throws IOException { + return Collections.emptyList(); + } + + /** + * Closing writer when operator is closed. + * + * @throws IOException Exception encountered when closing writer. + */ + default void close() throws IOException { + + } + + interface Context extends Serializable { + + TypeInfo[] getTypeInfos(); + + int getIndexOfSubTaskId(); + + boolean isRestored(); + + List getRestoreStates(); + } +} +``` + +### Construction method + +Initialize the connection object of the data source according to the configuration, and establish a connection with the target data source. + +#### example + +```Java +public RedisWriter(BitSailConfiguration writerConfiguration) { + // initialize ttl + int ttl = writerConfiguration.getUnNecessaryOption(RedisWriterOptions.TTL, -1); + TtlType ttlType; + try { + ttlType = TtlType.valueOf(StringUtils.upperCase(writerConfiguration.get(RedisWriterOptions.TTL_TYPE))); + } catch (IllegalArgumentException e) { + throw BitSailException.asBitSailException(RedisPluginErrorCode.ILLEGAL_VALUE, + String.format("unknown ttl type: %s", writerConfiguration.get(RedisWriterOptions.TTL_TYPE))); + } + int ttlInSeconds = ttl < 0 ? -1 : ttl * ttlType.getContainSeconds(); + log.info("ttl is {}(s)", ttlInSeconds); + + // initialize commandDescription + String redisDataType = StringUtils.upperCase(writerConfiguration.get(RedisWriterOptions.REDIS_DATA_TYPE)); + String additionalKey = writerConfiguration.getUnNecessaryOption(RedisWriterOptions.ADDITIONAL_KEY, "default_redis_key"); + this.commandDescription = initJedisCommandDescription(redisDataType, ttlInSeconds, additionalKey); + this.columnSize = writerConfiguration.get(RedisWriterOptions.COLUMNS).size(); + + // initialize jedis pool + JedisPoolConfig jedisPoolConfig = new JedisPoolConfig(); + jedisPoolConfig.setMaxTotal(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_TOTAL_CONNECTIONS)); + jedisPoolConfig.setMaxIdle(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_IDLE_CONNECTIONS)); + jedisPoolConfig.setMinIdle(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MIN_IDLE_CONNECTIONS)); + jedisPoolConfig.setMaxWait(Duration.ofMillis(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_WAIT_TIME_IN_MILLIS))); + + String redisHost = writerConfiguration.getNecessaryOption(RedisWriterOptions.HOST, RedisPluginErrorCode.REQUIRED_VALUE); + int redisPort = writerConfiguration.getNecessaryOption(RedisWriterOptions.PORT, RedisPluginErrorCode.REQUIRED_VALUE); + String redisPassword = writerConfiguration.get(RedisWriterOptions.PASSWORD); + int timeout = writerConfiguration.get(RedisWriterOptions.CLIENT_TIMEOUT_MS); + + if (StringUtils.isEmpty(redisPassword)) { + this.jedisPool = new JedisPool(jedisPoolConfig, redisHost, redisPort, timeout); + } else { + this.jedisPool = new JedisPool(jedisPoolConfig, redisHost, redisPort, timeout, redisPassword); + } + + // initialize record queue + int batchSize = writerConfiguration.get(RedisWriterOptions.WRITE_BATCH_INTERVAL); + this.recordQueue = new CircularFifoQueue<>(batchSize); + + this.logSampleInterval = writerConfiguration.get(RedisWriterOptions.LOG_SAMPLE_INTERVAL); + this.jedisFetcher = RetryerBuilder.newBuilder() + .retryIfResult(Objects::isNull) + .retryIfRuntimeException() + .withStopStrategy(StopStrategies.stopAfterAttempt(3)) + .withWaitStrategy(WaitStrategies.exponentialWait(100, 5, TimeUnit.MINUTES)) + .build() + .wrap(jedisPool::getResource); + + this.maxAttemptCount = writerConfiguration.get(RedisWriterOptions.MAX_ATTEMPT_COUNT); + this.retryer = RetryerBuilder.newBuilder() + .retryIfResult(needRetry -> Objects.equals(needRetry, true)) + .retryIfException(e -> !(e instanceof BitSailException)) + .withWaitStrategy(WaitStrategies.fixedWait(3, TimeUnit.SECONDS)) + .withStopStrategy(StopStrategies.stopAfterAttempt(maxAttemptCount)) + .build(); +} +``` + +### write method + +When this method is called, the BitSail Row type data will be written to the cache queue, and various formats of Row type data can also be preprocessed here. If the size of the cache queue is set here, then flush is called after the cache queue is full. + +#### example + +redis:Store data in `BitSail Row` format directly in a cache queue of a certain size + +```Java +public void write(Row record) throws IOException { + validate(record); + this.recordQueue.add(record); + if (recordQueue.isAtFullCapacity()) { + flush(false); + } +} +``` + +Druid:Preprocess the data in `BitSail Row` format and convert it into `StringBuffer` for storage. + +```Java +@Override +public void write(final Row element) { + final StringJoiner joiner = new StringJoiner(DEFAULT_FIELD_DELIMITER, "", ""); + for (int i = 0; i < element.getArity(); i++) { + final Object v = element.getField(i); + if (v != null) { + joiner.add(v.toString()); + } + } + // timestamp column is a required field to add in Druid. + // See https://druid.apache.org/docs/24.0.0/ingestion/data-model.html#primary-timestamp + joiner.add(String.valueOf(processTime)); + data.append(joiner); + data.append(DEFAULT_LINE_DELIMITER); +} +``` + +### flush method + +This method mainly implements flushing the data in the cache of the `write` method to the target data source. + +#### example + +redis: flush the BitSail Row format data in the cache queue to the target data source. + +```Java +public void flush(boolean endOfInput) throws IOException { + processorId++; + try (PipelineProcessor processor = genPipelineProcessor(recordQueue.size(), this.complexTypeWithTtl)) { + Row record; + while ((record = recordQueue.poll()) != null) { + + String key = (String) record.getField(0); + String value = (String) record.getField(1); + String scoreOrHashKey = value; + if (columnSize == SORTED_SET_OR_HASH_COLUMN_SIZE) { + value = (String) record.getField(2); + // Replace empty key with additionalKey in sorted set and hash. + if (key.length() == 0) { + key = commandDescription.getAdditionalKey(); + } + } + + if (commandDescription.getJedisCommand() == JedisCommand.ZADD) { + // sorted set + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), parseScoreFromString(scoreOrHashKey), value.getBytes())); + } else if (commandDescription.getJedisCommand() == JedisCommand.HSET) { + // hash + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), scoreOrHashKey.getBytes(), value.getBytes())); + } else if (commandDescription.getJedisCommand() == JedisCommand.HMSET) { + //mhset + if ((record.getArity() - 1) % 2 != 0) { + throw new BitSailException(CONVERT_NOT_SUPPORT, "Inconsistent data entry."); + } + List datas = Arrays.stream(record.getFields()) + .collect(Collectors.toList()).stream().map(o -> ((String) o).getBytes()) + .collect(Collectors.toList()).subList(1, record.getFields().length); + Map map = new HashMap<>((record.getArity() - 1) / 2); + for (int index = 0; index < datas.size(); index = index + 2) { + map.put(datas.get(index), datas.get(index + 1)); + } + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), map)); + } else { + // set and string + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), value.getBytes())); + } + } + retryer.call(processor::run); + } catch (ExecutionException | RetryException e) { + if (e.getCause() instanceof BitSailException) { + throw (BitSailException) e.getCause(); + } else if (e.getCause() instanceof RedisUnexpectedException) { + throw (RedisUnexpectedException) e.getCause(); + } + throw e; + } catch (IOException e) { + throw new RuntimeException("Error while init jedis client.", e); + } +} +``` + +Druid: Submit the sink job to the data source using HTTP post. + +```Java +private HttpURLConnection provideHttpURLConnection(final String coordinatorURL) throws IOException { + final URL url = new URL("http://" + coordinatorURL + DRUID_ENDPOINT); + final HttpURLConnection con = (HttpURLConnection) url.openConnection(); + con.setRequestMethod("POST"); + con.setRequestProperty("Content-Type", "application/json"); + con.setRequestProperty("Accept", "application/json, text/plain, */*"); + con.setDoOutput(true); + return con; + } + + public void flush(final boolean endOfInput) throws IOException { + final ParallelIndexIOConfig ioConfig = provideDruidIOConfig(data); + final ParallelIndexSupervisorTask indexTask = provideIndexTask(ioConfig); + final String inputJSON = provideInputJSONString(indexTask); + final byte[] input = inputJSON.getBytes(); + try (final OutputStream os = httpURLConnection.getOutputStream()) { + os.write(input, 0, input.length); + } + try (final BufferedReader br = + new BufferedReader(new InputStreamReader(httpURLConnection.getInputStream(), StandardCharsets.UTF_8))) { + final StringBuilder response = new StringBuilder(); + String responseLine; + while ((responseLine = br.readLine()) != null) { + response.append(responseLine.trim()); + } + LOG.info("Druid write task has been sent, and the response is {}", response); + } + } +``` + +### close method + +Closes any previously created target data source connection objects. + +#### example + +```Java +public void close() throws IOException { + bulkProcessor.close(); + restClient.close(); + checkErrorAndRethrow(); +} +``` \ No newline at end of file diff --git a/website/en/community/source_connector_detail.md b/website/en/community/source_connector_detail.md new file mode 100644 index 000000000..33c7a5871 --- /dev/null +++ b/website/en/community/source_connector_detail.md @@ -0,0 +1,1549 @@ +--- +order: 8 +--- + +# Source Connector Details + +English | [简体中文](../../zh/community/source_connector_detail.md) + +----- + +## Introduction + +![](../../images/community/connector_quick_start/bitsail_model.png) + +- Source: The life cycle management component of the data reading component is mainly responsible for interacting with the framework, structuring the job, and not participating in the actual execution of the job. +- SourceSplit: Source data split, the core purpose of the big data processing framework is to split large-scale data into multiple reasonable Splits +- State:Job status snapshot, when the checkpoint is enabled, the current execution status will be saved. +- SplitCoordinator: SplitCoordinator assumes the role of creating and managing Split. +- SourceReader: The component that is actually responsible for data reading will read the data after receiving the Split, and then transmit the data to the next operator. + +## Source + +The life cycle management of the data reading component is mainly responsible for the interaction with the framework and the construction of the job, and it does not participate in the actual execution of the job. + +Take RocketMQSource as an example: the Source method needs to implement the Source and ParallelismComputable interfaces. + +![](../../images/community/source_connector/source_diagram.png) + +### Source Interface + +```Java +public interface Source extends Serializable { + + /** + * Run in client side for source initialize; + */ + void configure(ExecutionEnviron execution, BitSailConfiguration readerConfiguration) throws IOException; + + /** + * Indicate the Source type. + */ + Boundedness getSourceBoundedness(); + + /** + * Create Source Reader. + */ + SourceReader createReader(SourceReader.Context readerContext); + + /** + * Create split coordinator. + */ + SourceSplitCoordinator createSplitCoordinator(SourceSplitCoordinator.Context coordinatorContext); + + /** + * Get Split serializer for the framework,{@link SplitT}should implement from {@link Serializable} + */ + default BinarySerializer getSplitSerializer() { + return new SimpleBinarySerializer<>(); + } + + /** + * Get State serializer for the framework, {@link StateT}should implement from {@link Serializable} + */ + default BinarySerializer getSplitCoordinatorCheckpointSerializer() { + return new SimpleBinarySerializer<>(); + } + + /** + * Create type info converter for the source, default value {@link BitSailTypeInfoConverter} + */ + default TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); + } + + /** + * Get Source's name. + */ + String getReaderName(); +} +``` + +#### configure method + +We mainly do the distribution and extraction of some client configurations, and can operate on the configuration of the runtime environment `ExecutionEnviron` and `readerConfiguration`. + +##### example + +```Java +@Override +public void configure(ExecutionEnviron execution, BitSailConfiguration readerConfiguration) { + this.readerConfiguration = readerConfiguration; + this.commonConfiguration = execution.getCommonConfiguration(); +} +``` + +#### getSourceBoundedness method + +Set the processing method of the job, which is to use the stream processing method, the batch processing method, or the stream-batch unified processing method. In the stream-batch integrated examples, we need to set different processing methods according to different types of jobs。 + +| Job Type | Boundedness | +| -------- | --------------------------- | +| batch | Boundedness.*BOUNDEDNESS* | +| stream | Boundedness.*UNBOUNDEDNESS* | + +##### Unified example + +```Java +@Override +public Boundedness getSourceBoundedness() { + return Mode.BATCH.equals(Mode.getJobRunMode(commonConfiguration.get(CommonOptions.JOB_TYPE))) ? + Boundedness.BOUNDEDNESS : + Boundedness.UNBOUNDEDNESS; +} +``` + +##### Batch example + +```Java +public Boundedness getSourceBoundedness() { + return Boundedness.BOUNDEDNESS; +} +``` + +#### createTypeInfoConverter method + +A type converter used to specify the Source connector; we know that most external data systems have their own type definitions, and their definitions will not be completely consistent with BitSail’s type definitions; in order to simplify the conversion of type definitions, we support the relationship between the two being mapped through the configuration file, thereby simplifying the development of the configuration file. + +It is the parsing of the `columns` in the `reader` part of the task description Json file. The type of different fields in the `columns` will be parsed from the `ClickhouseReaderOptions.COLUMNS` field to `readerContext.getTypeInfos()` according to the above description file。 + +##### example + +- `BitSailTypeInfoConverter` + - Default `TypeInfoConverter`,Directly parse the string of the `ReaderOptions.COLUMNS` field, what type is in the `COLUMNS` field, and what type is in `TypeInfoConverter`. +- `FileMappingTypeInfoConverter` + - It will bind the `{readername}-type-converter.yaml` file during BitSail type system conversion to map the database field type and BitSail type. The `ReaderOptions.COLUMNS` field will be mapped to `TypeInfoConverter` after being converted by this mapping file. + +###### FileMappingTypeInfoConverter + +Databases connected through JDBC, including MySql, Oracle, SqlServer, Kudu, ClickHouse, etc. The characteristic of the data source here is to return the obtained data in the form of `java.sql.ResultSet` interface. For this type of database, we often design the `TypeInfoConverter` object as `FileMappingTypeInfoConverter`. This object will be bound to `{readername}-type-converter.yaml` file during BitSail type system conversion, which is used to map the database field type and BitSail type. + +```Java +@Override +public TypeInfoConverter createTypeInfoConverter() { + return new FileMappingTypeInfoConverter(getReaderName()); +} +``` + +For the parsing of the `{readername}-type-converter.yaml` file, take `clickhouse-type-converter.yaml` as an example. + +```Plain +# Clickhouse Type to BitSail Type +engine.type.to.bitsail.type.converter: + + - source.type: int32 + target.type: int + + - source.type: float64 + target.type: double + + - source.type: string + target.type: string + + - source.type: date + target.type: date.date + + - source.type: null + target.type: void + +# BitSail Type to Clickhouse Type +bitsail.type.to.engine.type.converter: + + - source.type: int + target.type: int32 + + - source.type: double + target.type: float64 + + - source.type: date.date + target.type: date + + - source.type: string + target.type: string +``` + +The role of this file is to analyze the `columns` in the `reader `part of the job description json file. The types of different fields in the `columns` will be parsed from the `ClickhouseReaderOptions.COLUMNS` field to `readerContext.getTypeInfos()` according to the above description file. + +```Json +"reader": { + "class": "com.bytedance.bitsail.connector.clickhouse.source.ClickhouseSource", + "jdbc_url": "jdbc:clickhouse://localhost:8123", + "db_name": "default", + "table_name": "test_ch_table", + "split_field": "id", + "split_config": "{\"name\": \"id\", \"lower_bound\": 0, \"upper_bound\": \"10000\", \"split_num\": 3}", + "sql_filter": "( id % 2 == 0 )", + "columns": [ + { + "name": "id", + "type": "int64" + }, + { + "name": "int_type", + "type": "int32" + }, + { + "name": "double_type", + "type": "float64" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "p_date", + "type": "date" + } + ] +}, +``` + +![](../../images/community/source_connector/file_mapping_converter.png) + +This method is not only applicable to databases, but also applicable to all scenarios that require type mapping between the engine side and the BitSail side during type conversion. + +###### BitSailTypeInfoConverter + +Usually, the default method is used for type conversion, and the string is directly parsed for the `ReaderOptions.COLUMNS`field. + +```Java +@Override +public TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); +} +``` + +BitSailTypeInfoConverter + +Usually, the default method is used for type conversion, and the string is directly parsed for the `ReaderOptions.COLUMNS` field. + +```Java +@Override +public TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); +} +``` + +Take Hadoop as an example: + +```Json +"reader": { + "class": "com.bytedance.bitsail.connector.hadoop.source.HadoopSource", + "path_list": "hdfs://127.0.0.1:9000/test_namespace/source/test.json", + "content_type":"json", + "reader_parallelism_num": 1, + "columns": [ + { + "name":"id", + "type": "int" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "map_string_string", + "type": "map" + }, + { + "name": "array_string", + "type": "list" + } + ] +} +``` + +![](../../images/community/source_connector/bitsail_converter.png) + +#### createSourceReader method + +Write the specific data reading logic. The component responsible for data reading will read the data after receiving the Split, and then transmit the data to the next operator. + +The specific parameters passed to construct `SourceReader` are determined according to requirements, but it must be ensured that all parameters can be serialized. If it is not serializable, an error will occur when `createJobGraph` is created. + +##### example + +```Java +public SourceReader createReader(SourceReader.Context readerContext) { + return new RocketMQSourceReader( + readerConfiguration, + readerContext, + getSourceBoundedness()); +} +``` + +#### createSplitCoordinator method + +Writing specific data split and split allocation logic, the SplitCoordinator assumes the role of creating and managing Splits + +The specific parameters passed to construct `SplitCoordinator `are determined according to requirements, but it must be ensured that all parameters can be serialized. If it is not serializable, an error will occur when `createJobGraph` is created. + +##### example + +```Java +public SourceSplitCoordinator createSplitCoordinator(SourceSplitCoordinator + .Context coordinatorContext) { + return new RocketMQSourceSplitCoordinator( + coordinatorContext, + readerConfiguration, + getSourceBoundedness()); +} +``` + +### ParallelismComputable Interface + +```Java +public interface ParallelismComputable extends Serializable { + + /** + * give a parallelism advice for reader/writer based on configurations and upstream parallelism advice + * + * @param commonConf common configuration + * @param selfConf reader/writer configuration + * @param upstreamAdvice parallelism advice from upstream (when an operator has no upstream in DAG, its upstream is + * global parallelism) + * @return parallelism advice for the reader/writer + */ + ParallelismAdvice getParallelismAdvice(BitSailConfiguration commonConf, + BitSailConfiguration selfConf, + ParallelismAdvice upstreamAdvice) throws Exception; +} +``` + +#### getParallelismAdvice method + +Used to specify the parallel number of downstream readers. Generally, there are the following methods: + +- Use `selfConf.get(ClickhouseReaderOptions.READER_PARALLELISM_NUM)` to specify the degree of parallelism. +- Customize your own parallelism division logic. + +##### example + +For example, in RocketMQ, we can define that each reader can handle up to 4 queues. *`DEFAULT_ROCKETMQ_PARALLELISM_THRESHOLD `*`= 4` + +Obtain the corresponding degree of parallelism through this custom method. + +```Java +public ParallelismAdvice getParallelismAdvice(BitSailConfiguration commonConfiguration, + BitSailConfiguration rocketmqConfiguration, + ParallelismAdvice upstreamAdvice) throws Exception { + String cluster = rocketmqConfiguration.get(RocketMQSourceOptions.CLUSTER); + String topic = rocketmqConfiguration.get(RocketMQSourceOptions.TOPIC); + String consumerGroup = rocketmqConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP); + DefaultLitePullConsumer consumer = RocketMQUtils.prepareRocketMQConsumer(rocketmqConfiguration, String.format(SOURCE_INSTANCE_NAME_TEMPLATE, + cluster, + topic, + consumerGroup, + UUID.randomUUID() + )); + try { + consumer.start(); + Collection messageQueues = consumer.fetchMessageQueues(topic); + int adviceParallelism = Math.max(CollectionUtils.size(messageQueues) / DEFAULT_ROCKETMQ_PARALLELISM_THRESHOLD, 1); + + return ParallelismAdvice.builder() + .adviceParallelism(adviceParallelism) + .enforceDownStreamChain(true) + .build(); + } finally { + consumer.shutdown(); + } + } +} +``` + +## SourceSplit + +The data fragmentation format of the data source requires us to implement the SourceSplit interface. + +![](../../images/community/source_connector/source_split_diagram.png) + +### SourceSplit Interface + +We are required to implement a method to obtain splitId. + +```Java +public interface SourceSplit extends Serializable { + String uniqSplitId(); +} +``` + +For the specific slice format, developers can customize it according to their own needs. + +### example + +#### Database + +Generally, the primary key is used to divide the data into maximum and minimum values; for classes without a primary key, it is usually recognized as a split and no longer split, so the parameters in the split include the maximum and minimum values of the primary key, and a Boolean type `readTable`. If there is no primary key class or the primary key is not split, the entire table will be regarded as a split. Under this condition, `readTable` is true. If the primary key is split according to the maximum and minimum values, it is set to false.。 + +Take ClickhouseSourceSplit as an example: + +```Java +@Setter +public class ClickhouseSourceSplit implements SourceSplit { + public static final String SOURCE_SPLIT_PREFIX = "clickhouse_source_split_"; + private static final String BETWEEN_CLAUSE = "( `%s` BETWEEN ? AND ? )"; + + private final String splitId; + + /** + * Read whole table or range [lower, upper] + */ + private boolean readTable; + private Long lower; + private Long upper; + + public ClickhouseSourceSplit(int splitId) { + this.splitId = SOURCE_SPLIT_PREFIX + splitId; + } + + @Override + public String uniqSplitId() { + return splitId; + } + + public void decorateStatement(PreparedStatement statement) { + try { + if (readTable) { + lower = Long.MIN_VALUE; + upper = Long.MAX_VALUE; + } + statement.setObject(1, lower); + statement.setObject(2, upper); + } catch (SQLException e) { + throw BitSailException.asBitSailException(CommonErrorCode.RUNTIME_ERROR, "Failed to decorate statement with split " + this, e.getCause()); + } + } + + public static String getRangeClause(String splitField) { + return StringUtils.isEmpty(splitField) ? null : String.format(BETWEEN_CLAUSE, splitField); + } + + @Override + public String toString() { + return String.format( + "{\"split_id\":\"%s\", \"lower\":%s, \"upper\":%s, \"readTable\":%s}", + splitId, lower, upper, readTable); + } +} +``` + +#### Message queue + +Generally, splits are divided according to the number of partitions registered in the topic in the message queue. The slice should mainly include the starting point and end point of consumption and the queue of consumption. + +Take RocketMQSplit as an example: + +```Java +@Builder +@Getter +public class RocketMQSplit implements SourceSplit { + + private MessageQueue messageQueue; + + @Setter + private long startOffset; + + private long endOffset; + + private String splitId; + + @Override + public String uniqSplitId() { + return splitId; + } + + @Override + public String toString() { + return "RocketMQSplit{" + + "messageQueue=" + messageQueue + + ", startOffset=" + startOffset + + ", endOffset=" + endOffset + + '}'; + } +} +``` + +#### File system + +Generally, files are divided as the smallest granularity, and some formats also support splitting a single file into multiple sub-Splits. The required file slices need to be packed in the file system split. + +Take `FtpSourceSplit` as an example: + +```Java +public class FtpSourceSplit implements SourceSplit { + + public static final String FTP_SOURCE_SPLIT_PREFIX = "ftp_source_split_"; + + private final String splitId; + + @Setter + private String path; + @Setter + private long fileSize; + + public FtpSourceSplit(int splitId) { + this.splitId = FTP_SOURCE_SPLIT_PREFIX + splitId; + } + + @Override + public String uniqSplitId() { + return splitId; + } + + @Override + public boolean equals(Object obj) { + return (obj instanceof FtpSourceSplit) && (splitId.equals(((FtpSourceSplit) obj).splitId)); + } + +} +``` + +In particular, in the Hadoop file system, we can also use the wrapper of the `org.apache.hadoop.mapred.InputSpli`t class to customize our Split. + +```Java +public class HadoopSourceSplit implements SourceSplit { + private static final long serialVersionUID = 1L; + private final Class splitType; + private transient InputSplit hadoopInputSplit; + + private byte[] hadoopInputSplitByteArray; + + public HadoopSourceSplit(InputSplit inputSplit) { + if (inputSplit == null) { + throw new NullPointerException("Hadoop input split must not be null"); + } + + this.splitType = inputSplit.getClass(); + this.hadoopInputSplit = inputSplit; + } + + public InputSplit getHadoopInputSplit() { + return this.hadoopInputSplit; + } + + public void initInputSplit(JobConf jobConf) { + if (this.hadoopInputSplit != null) { + return; + } + + checkNotNull(hadoopInputSplitByteArray); + + try { + this.hadoopInputSplit = (InputSplit) WritableFactories.newInstance(splitType); + + if (this.hadoopInputSplit instanceof Configurable) { + ((Configurable) this.hadoopInputSplit).setConf(jobConf); + } else if (this.hadoopInputSplit instanceof JobConfigurable) { + ((JobConfigurable) this.hadoopInputSplit).configure(jobConf); + } + + if (hadoopInputSplitByteArray != null) { + try (ObjectInputStream objectInputStream = new ObjectInputStream(new ByteArrayInputStream(hadoopInputSplitByteArray))) { + this.hadoopInputSplit.readFields(objectInputStream); + } + + this.hadoopInputSplitByteArray = null; + } + } catch (Exception e) { + throw new RuntimeException("Unable to instantiate Hadoop InputSplit", e); + } + } + + private void writeObject(ObjectOutputStream out) throws IOException { + + if (hadoopInputSplit != null) { + try ( + ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); + ObjectOutputStream objectOutputStream = new ObjectOutputStream(byteArrayOutputStream) + ) { + this.hadoopInputSplit.write(objectOutputStream); + objectOutputStream.flush(); + this.hadoopInputSplitByteArray = byteArrayOutputStream.toByteArray(); + } + } + out.defaultWriteObject(); + } + + @Override + public String uniqSplitId() { + return hadoopInputSplit.toString(); + } +} +``` + +## State + +In scenarios where checkpoints are required, we usually use `Map` to preserve the current execution state. + +### Unified example + +In the streaming-batch unified scenario, we need to save the state to recover from the abnormally interrupted streaming job. + +Take RocketMQState as an example: + +```Java +public class RocketMQState implements Serializable { + + private final Map assignedWithSplitIds; + + public RocketMQState(Map assignedWithSplitIds) { + this.assignedWithSplitIds = assignedWithSplitIds; + } + + public Map getAssignedWithSplits() { + return assignedWithSplitIds; + } +} +``` + +### Batch example + +For batch scenarios, we can use EmptyState to not store the state. If state storage is required, a similar design scheme is adopted for the stream-batch unified scenario. + +```Java +public class EmptyState implements Serializable { + + public static EmptyState fromBytes() { + return new EmptyState(); + } +} +``` + +## SourceSplitCoordinator + +The core purpose of the big data processing framework is to split large-scale data into multiple reasonable Splits, and the SplitCoordinator assumes the role of creating and managing Splits. + +![](../../images/community/source_connector/source_split_coordinator_diagram.png) + +### SourceSplitCoordinator Interface + +```Java +public interface SourceSplitCoordinator extends Serializable, AutoCloseable { + + void start(); + + void addReader(int subtaskId); + + void addSplitsBack(List splits, int subtaskId); + + void handleSplitRequest(int subtaskId, @Nullable String requesterHostname); + + default void handleSourceEvent(int subtaskId, SourceEvent sourceEvent) { + } + + StateT snapshotState() throws Exception; + + default void notifyCheckpointComplete(long checkpointId) throws Exception { + } + + void close(); + + interface Context { + + boolean isRestored(); + + /** + * Return the state to the split coordinator, for the exactly-once. + */ + StateT getRestoreState(); + + /** + * Return total parallelism of the source reader. + */ + int totalParallelism(); + + /** + * When Source reader started, it will be registered itself to coordinator. + */ + Set registeredReaders(); + + /** + * Assign splits to reader. + */ + void assignSplit(int subtaskId, List splits); + + /** + * Mainly use in boundedness situation, represents there will no more split will send to source reader. + */ + void signalNoMoreSplits(int subtask); + + /** + * If split coordinator have any event want to send source reader, use this method. + * Like send Pause event to Source Reader in CDC2.0. + */ + void sendEventToSourceReader(int subtaskId, SourceEvent event); + + /** + * Schedule to run the callable and handler, often used in un-boundedness mode. + */ + void runAsync(Callable callable, + BiConsumer handler, + int initialDelay, + long interval); + + /** + * Just run callable and handler once, often used in boundedness mode. + */ + void runAsyncOnce(Callable callable, + BiConsumer handler); + } +} +``` + +### Construction method + +In the construction method, developers generally mainly perform some configuration settings and create containers for shard information storage. + +Take the construction of ClickhouseSourceSplitCoordinator as an example: + +```Java +public ClickhouseSourceSplitCoordinator(SourceSplitCoordinator.Context context, + BitSailConfiguration jobConf) { + this.context = context; + this.jobConf = jobConf; + this.splitAssignmentPlan = Maps.newConcurrentMap(); +} +``` + +In the scenario where State is customized, it is necessary to save and restore the state stored in `SourceSplitCoordinator.Context` during checkpoint. + +Take RocketMQSourceSplitCoordinator as an example: + +```Java +public RocketMQSourceSplitCoordinator( + SourceSplitCoordinator.Context context, + BitSailConfiguration jobConfiguration, + Boundedness boundedness) { + this.context = context; + this.jobConfiguration = jobConfiguration; + this.boundedness = boundedness; + this.discoveryInternal = jobConfiguration.get(RocketMQSourceOptions.DISCOVERY_INTERNAL); + this.pendingRocketMQSplitAssignment = Maps.newConcurrentMap(); + + this.discoveredPartitions = new HashSet<>(); + if (context.isRestored()) { + RocketMQState restoreState = context.getRestoreState(); + assignedPartitions = restoreState.getAssignedWithSplits(); + discoveredPartitions.addAll(assignedPartitions.keySet()); + } else { + assignedPartitions = Maps.newHashMap(); + } + + prepareConsumerProperties(); +} +``` + +### start method + +Extract split metadata required by some data sources. + +#### Unified example + +Take RocketMQSourceSplitCoordinator as an example: + +```Java +private void prepareRocketMQConsumer() { + try { + consumer = RocketMQUtils.prepareRocketMQConsumer(jobConfiguration, + String.format(COORDINATOR_INSTANCE_NAME_TEMPLATE, + cluster, topic, consumerGroup, UUID.randomUUID())); + consumer.start(); + } catch (Exception e) { + throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e); + } +} + +@Override +public void start() { + prepareRocketMQConsumer(); + splitAssigner = new FairRocketMQSplitAssigner(jobConfiguration, assignedPartitions); + if (discoveryInternal > 0) { + context.runAsync( + this::fetchMessageQueues, + this::handleMessageQueueChanged, + 0, + discoveryInternal + ); + } else { + context.runAsyncOnce( + this::fetchMessageQueues, + this::handleMessageQueueChanged + ); + } +} +``` + +#### Batch example + +Take ClickhouseSourceSplitCoordinator as an example: + +```Java +public void start() { + List splitList; + try { + SimpleDivideSplitConstructor constructor = new SimpleDivideSplitConstructor(jobConf); + splitList = constructor.construct(); + } catch (IOException e) { + ClickhouseSourceSplit split = new ClickhouseSourceSplit(0); + split.setReadTable(true); + splitList = Collections.singletonList(split); + LOG.error("Failed to construct splits, will directly read the table.", e); + } + + int readerNum = context.totalParallelism(); + LOG.info("Found {} readers and {} splits.", readerNum, splitList.size()); + if (readerNum > splitList.size()) { + LOG.error("Reader number {} is larger than split number {}.", readerNum, splitList.size()); + } + + for (ClickhouseSourceSplit split : splitList) { + int readerIndex = ReaderSelector.getReaderIndex(readerNum); + splitAssignmentPlan.computeIfAbsent(readerIndex, k -> new HashSet<>()).add(split); + LOG.info("Will assign split {} to the {}-th reader", split.uniqSplitId(), readerIndex); + } +} +``` + +### Assigner + +Assign the divided splits to the Reader. During the development process, we usually let the SourceSplitCoordinator focus on processing the communication with the Reader. The actual split distribution logic is generally encapsulated in the Assigner. This Assigner can be an encapsulated `Split Assign function`, or it can be a `Split Assigner class`. + +#### Assign function example + +Take ClickhouseSourceSplitCoordinator as an example: + +The `tryAssignSplitsToReader` function assigns the divided slices stored in `splitAssignmentPlan` to the corresponding Reader. + +```Java +private void tryAssignSplitsToReader() { + Map> splitsToAssign = new HashMap<>(); + + for (Integer readerIndex : splitAssignmentPlan.keySet()) { + if (CollectionUtils.isNotEmpty(splitAssignmentPlan.get(readerIndex)) && context.registeredReaders().contains(readerIndex)) { + splitsToAssign.put(readerIndex, Lists.newArrayList(splitAssignmentPlan.get(readerIndex))); + } + } + + for (Integer readerIndex : splitsToAssign.keySet()) { + LOG.info("Try assigning splits reader {}, splits are: [{}]", readerIndex, + splitsToAssign.get(readerIndex).stream().map(ClickhouseSourceSplit::uniqSplitId).collect(Collectors.toList())); + splitAssignmentPlan.remove(readerIndex); + context.assignSplit(readerIndex, splitsToAssign.get(readerIndex)); + context.signalNoMoreSplits(readerIndex); + LOG.info("Finish assigning splits reader {}", readerIndex); + } +} +``` + +#### Assigner class example + +Take RocketMQSourceSplitCoordinator as an example: + +```Java +public class FairRocketMQSplitAssigner implements SplitAssigner { + + private BitSailConfiguration readerConfiguration; + + private AtomicInteger atomicInteger; + + public Map rocketMQSplitIncrementMapping; + + public FairRocketMQSplitAssigner(BitSailConfiguration readerConfiguration, + Map rocketMQSplitIncrementMapping) { + this.readerConfiguration = readerConfiguration; + this.rocketMQSplitIncrementMapping = rocketMQSplitIncrementMapping; + this.atomicInteger = new AtomicInteger(CollectionUtils + .size(rocketMQSplitIncrementMapping.keySet())); + } + + @Override + public String assignSplitId(MessageQueue messageQueue) { + if (!rocketMQSplitIncrementMapping.containsKey(messageQueue)) { + rocketMQSplitIncrementMapping.put(messageQueue, String.valueOf(atomicInteger.getAndIncrement())); + } + return rocketMQSplitIncrementMapping.get(messageQueue); + } + + @Override + public int assignToReader(String splitId, int totalParallelism) { + return splitId.hashCode() % totalParallelism; + } +} +``` + +### addReader method + +Call Assigner to add splits to Reader. + +#### Batch example + +Take ClickhouseSourceSplitCoordinator as an example: + +```Java +public void addReader(int subtaskId) { + LOG.info("Found reader {}", subtaskId); + tryAssignSplitsToReader(); +} +``` + +#### Unified example + +Take RocketMQSourceSplitCoordinator as an example: + +```Java +private void notifyReaderAssignmentResult() { + Map> tmpRocketMQSplitAssignments = new HashMap<>(); + + for (Integer pendingAssignmentReader : pendingRocketMQSplitAssignment.keySet()) { + + if (CollectionUtils.isNotEmpty(pendingRocketMQSplitAssignment.get(pendingAssignmentReader)) + && context.registeredReaders().contains(pendingAssignmentReader)) { + + tmpRocketMQSplitAssignments.put(pendingAssignmentReader, Lists.newArrayList(pendingRocketMQSplitAssignment.get(pendingAssignmentReader))); + } + } + + for (Integer pendingAssignmentReader : tmpRocketMQSplitAssignments.keySet()) { + + LOG.info("Assigning splits to reader {}, splits = {}.", pendingAssignmentReader, + tmpRocketMQSplitAssignments.get(pendingAssignmentReader)); + + context.assignSplit(pendingAssignmentReader, + tmpRocketMQSplitAssignments.get(pendingAssignmentReader)); + Set removes = pendingRocketMQSplitAssignment.remove(pendingAssignmentReader); + removes.forEach(removeSplit -> { + assignedPartitions.put(removeSplit.getMessageQueue(), removeSplit.getSplitId()); + }); + + LOG.info("Assigned splits to reader {}", pendingAssignmentReader); + + if (Boundedness.BOUNDEDNESS == boundedness) { + LOG.info("Signal reader {} no more splits assigned in future.", pendingAssignmentReader); + context.signalNoMoreSplits(pendingAssignmentReader); + } + } +} + +@Override +public void addReader(int subtaskId) { + LOG.info( + "Adding reader {} to RocketMQ Split Coordinator for consumer group {}.", + subtaskId, + consumerGroup); + notifyReaderAssignmentResult(); +} +``` + +### addSplitsBack method + +For some splits that have not been processed by the Reader, reassign them. The reassignment strategy can be defined by yourself. The common strategy is hash modulo. All the Splits in the returned Split list are reassigned and then assigned to different Readers. + +#### Batch example + +以ClickhouseSourceSplitCoordinator为例: + +`ReaderSelector` uses the hash modulo strategy to redistribute the Split list. + +The tryAssignSplitsToReader method assigns the reassigned Split collection to Reader through Assigner. + +```Java +public void addSplitsBack(List splits, int subtaskId) { + LOG.info("Source reader {} return splits {}.", subtaskId, splits); + + int readerNum = context.totalParallelism(); + for (ClickhouseSourceSplit split : splits) { + int readerIndex = ReaderSelector.getReaderIndex(readerNum); + splitAssignmentPlan.computeIfAbsent(readerIndex, k -> new HashSet<>()).add(split); + LOG.info("Re-assign split {} to the {}-th reader.", split.uniqSplitId(), readerIndex); + } + + tryAssignSplitsToReader(); +} +``` + +#### Unified example + +Take RocketMQSourceSplitCoordinator as an example: + +`addSplitChangeToPendingAssignment` uses the hash modulo strategy to reassign the Split list. + +`notifyReaderAssignmentResult` assigns the reassigned Split collection to Reader through Assigner. + +```Java +private synchronized void addSplitChangeToPendingAssignment(Set newRocketMQSplits) { + int numReader = context.totalParallelism(); + for (RocketMQSplit split : newRocketMQSplits) { + int readerIndex = splitAssigner.assignToReader(split.getSplitId(), numReader); + pendingRocketMQSplitAssignment.computeIfAbsent(readerIndex, r -> new HashSet<>()) + .add(split); + } + LOG.debug("RocketMQ splits {} finished assignment.", newRocketMQSplits); +} + +@Override +public void addSplitsBack(List splits, int subtaskId) { + LOG.info("Source reader {} return splits {}.", subtaskId, splits); + addSplitChangeToPendingAssignment(new HashSet<>(splits)); + notifyReaderAssignmentResult(); +} +``` + +### snapshotState method + +Store the snapshot information of the processing split, which is used in the construction method when restoring. + +```Java +public RocketMQState snapshotState() throws Exception { + return new RocketMQState(assignedPartitions); +} +``` + +### close method + +Closes all open connectors that interact with the data source to read metadata information during split method. + +```Java +public void close() { + if (consumer != null) { + consumer.shutdown(); + } +} +``` + +## SourceReader + +Each SourceReader is executed in an independent thread. As long as we ensure that the slices assigned by SourceSplitCoordinator to different SourceReaders have no intersection, we can ignore any concurrency details during the execution cycle of SourceReader. + +![](../../images/community/source_connector/source_reader_diagram.png) + +### SourceReader Interface + +```Java +public interface SourceReader extends Serializable, AutoCloseable { + + void start(); + + void pollNext(SourcePipeline pipeline) throws Exception; + + void addSplits(List splits); + + /** + * Check source reader has more elements or not. + */ + boolean hasMoreElements(); + + /** + * There will no more split will send to this source reader. + * Source reader could be exited after process all assigned split. + */ + default void notifyNoMoreSplits() { + + } + + /** + * Process all events which from {@link SourceSplitCoordinator}. + */ + default void handleSourceEvent(SourceEvent sourceEvent) { + } + + /** + * Store the split to the external system to recover when task failed. + */ + List snapshotState(long checkpointId); + + /** + * When all tasks finished snapshot, notify checkpoint complete will be invoked. + */ + default void notifyCheckpointComplete(long checkpointId) throws Exception { + + } + + interface Context { + + TypeInfo[] getTypeInfos(); + + String[] getFieldNames(); + + int getIndexOfSubtask(); + + void sendSplitRequest(); + } +} +``` + +### Construction method + +Here it is necessary to complete the extraction of various configurations related to data source access, such as database name table name, message queue cluster and topic, identity authentication configuration, and so on. + +#### example + +```Java +public RocketMQSourceReader(BitSailConfiguration readerConfiguration, + Context context, + Boundedness boundedness) { + this.readerConfiguration = readerConfiguration; + this.boundedness = boundedness; + this.context = context; + this.assignedRocketMQSplits = Sets.newHashSet(); + this.finishedRocketMQSplits = Sets.newHashSet(); + this.deserializationSchema = new RocketMQDeserializationSchema( + readerConfiguration, + context.getTypeInfos(), + context.getFieldNames()); + this.noMoreSplits = false; + + cluster = readerConfiguration.get(RocketMQSourceOptions.CLUSTER); + topic = readerConfiguration.get(RocketMQSourceOptions.TOPIC); + consumerGroup = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP); + consumerTag = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_TAG); + pollBatchSize = readerConfiguration.get(RocketMQSourceOptions.POLL_BATCH_SIZE); + pollTimeout = readerConfiguration.get(RocketMQSourceOptions.POLL_TIMEOUT); + commitInCheckpoint = readerConfiguration.get(RocketMQSourceOptions.COMMIT_IN_CHECKPOINT); + accessKey = readerConfiguration.get(RocketMQSourceOptions.ACCESS_KEY); + secretKey = readerConfiguration.get(RocketMQSourceOptions.SECRET_KEY); +} +``` + +### start method + +Obtain the access object of the data source, such as the execution object of the database, the consumer object of the message queue, or the recordReader object of the file system. + +#### example + +Message queue + +```Java +public void start() { + try { + if (StringUtils.isNotEmpty(accessKey) && StringUtils.isNotEmpty(secretKey)) { + AclClientRPCHook aclClientRPCHook = new AclClientRPCHook( + new SessionCredentials(accessKey, secretKey)); + consumer = new DefaultMQPullConsumer(aclClientRPCHook); + } else { + consumer = new DefaultMQPullConsumer(); + } + + consumer.setConsumerGroup(consumerGroup); + consumer.setNamesrvAddr(cluster); + consumer.setInstanceName(String.format(SOURCE_READER_INSTANCE_NAME_TEMPLATE, + cluster, topic, consumerGroup, UUID.randomUUID())); + consumer.setConsumerPullTimeoutMillis(pollTimeout); + consumer.start(); + } catch (Exception e) { + throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e); + } +} +``` + +Database + +```Java +public void start() { + this.connection = connectionHolder.connect(); + + // Construct statement. + String baseSql = ClickhouseJdbcUtils.getQuerySql(dbName, tableName, columnInfos); + String querySql = ClickhouseJdbcUtils.decorateSql(baseSql, splitField, filterSql, maxFetchCount, true); + try { + this.statement = connection.prepareStatement(querySql); + } catch (SQLException e) { + throw new RuntimeException("Failed to prepare statement.", e); + } + + LOG.info("Task {} started.", subTaskId); +} +``` + +FTP + +```Java +public void start() { + this.ftpHandler.loginFtpServer(); + if (this.ftpHandler.getFtpConfig().getSkipFirstLine()) { + this.skipFirstLine = true; + } +} +``` + +### addSplits method + +Add the Splits list assigned by SourceSplitCoordinator to the current Reader to its own processing queue or set. + +#### example + +```Java +public void addSplits(List splits) { + LOG.info("Subtask {} received {}(s) new splits, splits = {}.", + context.getIndexOfSubtask(), + CollectionUtils.size(splits), + splits); + + assignedRocketMQSplits.addAll(splits); +} +``` + +### hasMoreElements method + +In an unbounded stream computing scenario, it will always return true to ensure that the Reader thread is not destroyed. + +In a batch scenario, false will be returned after the slices assigned to the Reader are processed, indicating the end of the Reader's life cycle. + +```Java +public boolean hasMoreElements() { + if (boundedness == Boundedness.UNBOUNDEDNESS) { + return true; + } + if (noMoreSplits) { + return CollectionUtils.size(assignedRocketMQSplits) != 0; + } + return true; +} +``` + +### pollNext method + +When the addSplits method adds the slice processing queue and hasMoreElements returns true, this method is called, and the developer implements this method to actually interact with the data. + +Developers need to pay attention to the following issues when implementing the pollNext method: + +- Reading of split data + - Read data from the constructed split. +- Conversion of data types + - Convert external data to BitSail's Row type + +#### example + +Take RocketMQSourceReader as an example: + +Select a split from the split queue for processing, read its information, and then convert the read information into `BitSail's Row type` and send it downstream for processing. + +```Java +public void pollNext(SourcePipeline pipeline) throws Exception { + for (RocketMQSplit rocketmqSplit : assignedRocketMQSplits) { + MessageQueue messageQueue = rocketmqSplit.getMessageQueue(); + PullResult pullResult = consumer.pull(rocketmqSplit.getMessageQueue(), + consumerTag, + rocketmqSplit.getStartOffset(), + pollBatchSize, + pollTimeout); + + if (Objects.isNull(pullResult) || CollectionUtils.isEmpty(pullResult.getMsgFoundList())) { + continue; + } + + for (MessageExt message : pullResult.getMsgFoundList()) { + Row deserialize = deserializationSchema.deserialize(message.getBody()); + pipeline.output(deserialize); + if (rocketmqSplit.getStartOffset() >= rocketmqSplit.getEndOffset()) { + LOG.info("Subtask {} rocketmq split {} in end of stream.", + context.getIndexOfSubtask(), + rocketmqSplit); + finishedRocketMQSplits.add(rocketmqSplit); + break; + } + } + rocketmqSplit.setStartOffset(pullResult.getNextBeginOffset()); + if (!commitInCheckpoint) { + consumer.updateConsumeOffset(messageQueue, pullResult.getMaxOffset()); + } + } + assignedRocketMQSplits.removeAll(finishedRocketMQSplits); +} +``` + +#### The way to convert to BitSail Row type + +##### RowDeserializer class + +Apply different converters to columns of different formats, and set them to the `Field` of the corresponding `Row Field`. + +```Java +public class ClickhouseRowDeserializer { + + interface FiledConverter { + Object apply(ResultSet resultSet) throws SQLException; + } + + private final List converters; + private final int fieldSize; + + public ClickhouseRowDeserializer(TypeInfo[] typeInfos) { + this.fieldSize = typeInfos.length; + this.converters = new ArrayList<>(); + for (int i = 0; i < fieldSize; ++i) { + converters.add(initFieldConverter(i + 1, typeInfos[i])); + } + } + + public Row convert(ResultSet resultSet) { + Row row = new Row(fieldSize); + try { + for (int i = 0; i < fieldSize; ++i) { + row.setField(i, converters.get(i).apply(resultSet)); + } + } catch (SQLException e) { + throw BitSailException.asBitSailException(ClickhouseErrorCode.CONVERT_ERROR, e.getCause()); + } + return row; + } + + private FiledConverter initFieldConverter(int index, TypeInfo typeInfo) { + if (!(typeInfo instanceof BasicTypeInfo)) { + throw BitSailException.asBitSailException(CommonErrorCode.UNSUPPORTED_COLUMN_TYPE, typeInfo.getTypeClass().getName() + " is not supported yet."); + } + + Class curClass = typeInfo.getTypeClass(); + if (TypeInfos.BYTE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getByte(index); + } + if (TypeInfos.SHORT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getShort(index); + } + if (TypeInfos.INT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getInt(index); + } + if (TypeInfos.LONG_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getLong(index); + } + if (TypeInfos.BIG_INTEGER_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> { + BigDecimal dec = resultSet.getBigDecimal(index); + return dec == null ? null : dec.toBigInteger(); + }; + } + if (TypeInfos.FLOAT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getFloat(index); + } + if (TypeInfos.DOUBLE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getDouble(index); + } + if (TypeInfos.BIG_DECIMAL_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getBigDecimal(index); + } + if (TypeInfos.STRING_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getString(index); + } + if (TypeInfos.SQL_DATE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getDate(index); + } + if (TypeInfos.SQL_TIMESTAMP_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getTimestamp(index); + } + if (TypeInfos.SQL_TIME_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getTime(index); + } + if (TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getBoolean(index); + } + if (TypeInfos.VOID_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> null; + } + throw new UnsupportedOperationException("Unsupported data type: " + typeInfo); + } +} +``` + +##### Implement the DeserializationSchema interface + +Compared with implementing `RowDeserializer`, we hope that you can implement an implementation class that inherits the `DeserializationSchema` interface, and convert data in a certain format, such as `JSON` and `CSV`, into `BitSail Row type`. + +![](../../images/community/source_connector/deserialization_schema_diagram.png) + +In specific applications, we can use a unified interface to create corresponding implementation classes. + +```Java +public class TextInputFormatDeserializationSchema implements DeserializationSchema { + + private BitSailConfiguration deserializationConfiguration; + + private TypeInfo[] typeInfos; + + private String[] fieldNames; + + private transient DeserializationSchema deserializationSchema; + + public TextInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration, + TypeInfo[] typeInfos, + String[] fieldNames) { + this.deserializationConfiguration = deserializationConfiguration; + this.typeInfos = typeInfos; + this.fieldNames = fieldNames; + ContentType contentType = ContentType.valueOf( + deserializationConfiguration.getNecessaryOption(HadoopReaderOptions.CONTENT_TYPE, HadoopErrorCode.REQUIRED_VALUE).toUpperCase()); + switch (contentType) { + case CSV: + this.deserializationSchema = + new CsvDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames); + break; + case JSON: + this.deserializationSchema = + new JsonDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames); + break; + default: + throw BitSailException.asBitSailException(HadoopErrorCode.UNSUPPORTED_ENCODING, "unsupported parser type: " + contentType); + } + } + + @Override + public Row deserialize(Writable message) { + return deserializationSchema.deserialize((message.toString()).getBytes()); + } + + @Override + public boolean isEndOfStream(Row nextElement) { + return false; + } +} +``` + +You can also customize the `DeserializationSchema` that currently needs to be parsed: + +```Java +public class MapredParquetInputFormatDeserializationSchema implements DeserializationSchema { + + private final BitSailConfiguration deserializationConfiguration; + + private final transient DateTimeFormatter localDateTimeFormatter; + private final transient DateTimeFormatter localDateFormatter; + private final transient DateTimeFormatter localTimeFormatter; + private final int fieldSize; + private final TypeInfo[] typeInfos; + private final String[] fieldNames; + private final List converters; + + public MapredParquetInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration, + TypeInfo[] typeInfos, + String[] fieldNames) { + + this.deserializationConfiguration = deserializationConfiguration; + this.typeInfos = typeInfos; + this.fieldNames = fieldNames; + this.localDateTimeFormatter = DateTimeFormatter.ofPattern( + deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_TIME_PATTERN)); + this.localDateFormatter = DateTimeFormatter + .ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_PATTERN)); + this.localTimeFormatter = DateTimeFormatter + .ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.TIME_PATTERN)); + this.fieldSize = typeInfos.length; + this.converters = Arrays.stream(typeInfos).map(this::createTypeInfoConverter).collect(Collectors.toList()); + } + + @Override + public Row deserialize(Writable message) { + int arity = fieldNames.length; + Row row = new Row(arity); + Writable[] writables = ((ArrayWritable) message).get(); + for (int i = 0; i < fieldSize; ++i) { + row.setField(i, converters.get(i).convert(writables[i].toString())); + } + return row; + } + + @Override + public boolean isEndOfStream(Row nextElement) { + return false; + } + + private interface DeserializationConverter extends Serializable { + Object convert(String input); + } + + private DeserializationConverter createTypeInfoConverter(TypeInfo typeInfo) { + Class typeClass = typeInfo.getTypeClass(); + + if (typeClass == TypeInfos.VOID_TYPE_INFO.getTypeClass()) { + return field -> null; + } + if (typeClass == TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass()) { + return this::convertToBoolean; + } + if (typeClass == TypeInfos.INT_TYPE_INFO.getTypeClass()) { + return this::convertToInt; + } + throw BitSailException.asBitSailException(CsvFormatErrorCode.CSV_FORMAT_COVERT_FAILED, + String.format("Csv format converter not support type info: %s.", typeInfo)); + } + + private boolean convertToBoolean(String field) { + return Boolean.parseBoolean(field.trim()); + } + + private int convertToInt(String field) { + return Integer.parseInt(field.trim()); + } +} +``` + +### snapshotState method + +Generate and save the snapshot information of State for `checkpoint`. + +#### example + +```Java +public List snapshotState(long checkpointId) { + LOG.info("Subtask {} start snapshotting for checkpoint id = {}.", context.getIndexOfSubtask(), checkpointId); + if (commitInCheckpoint) { + for (RocketMQSplit rocketMQSplit : assignedRocketMQSplits) { + try { + consumer.updateConsumeOffset(rocketMQSplit.getMessageQueue(), rocketMQSplit.getStartOffset()); + LOG.debug("Subtask {} committed message queue = {} in checkpoint id = {}.", context.getIndexOfSubtask(), + rocketMQSplit.getMessageQueue(), + checkpointId); + } catch (MQClientException e) { + throw new RuntimeException(e); + } + } + } + return Lists.newArrayList(assignedRocketMQSplits); +} +``` + +### hasMoreElements method + +The `sourceReader.hasMoreElements()` judgment will be made before calling the `pollNext` method each time. If and only if the judgment passes, the `pollNext` method will be called. + +#### example + +```Java +public boolean hasMoreElements() { + if (noMoreSplits) { + return CollectionUtils.size(assignedHadoopSplits) != 0; + } + return true; +} +``` + +### notifyNoMoreSplits method + +This method is called when the Reader has processed all splits. + +#### example + +```Java +public void notifyNoMoreSplits() { + LOG.info("Subtask {} received no more split signal.", context.getIndexOfSubtask()); + noMoreSplits = true; +} +``` \ No newline at end of file diff --git a/website/en/community/team.md b/website/en/community/team.md index bd62bb352..56896f193 100644 --- a/website/en/community/team.md +++ b/website/en/community/team.md @@ -1,8 +1,13 @@ --- -order: 4 +order: 6 --- + # Team +English | [简体中文](../../zh/community/team.md) + +----- + ## Contributors diff --git a/website/en/documents/start/README.md b/website/en/documents/start/README.md index 9493e0af0..841cd98d7 100644 --- a/website/en/documents/start/README.md +++ b/website/en/documents/start/README.md @@ -10,3 +10,4 @@ dir: - [Develop Environment Setup](env_setup.md) - [Deployment Guide](deployment.md) - [Job Configuration Guide](config.md) +- [BitSail Guide Video](quick_guide.md) \ No newline at end of file diff --git a/website/en/documents/start/config.md b/website/en/documents/start/config.md index e0934914f..ea1e5f4d5 100644 --- a/website/en/documents/start/config.md +++ b/website/en/documents/start/config.md @@ -1,6 +1,13 @@ +--- +order: 3 +--- + # Job Configuration Guide + English | [简体中文](../../../zh/documents/start/config.md) +----- + ***BitSail*** script configuration is managed by JSON structure, follow scripts show the complete structure: ``` json diff --git a/website/en/documents/start/deployment.md b/website/en/documents/start/deployment.md index 7f52470ce..6398ef9fc 100644 --- a/website/en/documents/start/deployment.md +++ b/website/en/documents/start/deployment.md @@ -1,5 +1,12 @@ +--- +order: 1 +--- + # Deployment Guide -English | [简体中文](../../../zh/documents/start/deployment.md) + +English | [简体中文](../../../zh/documents/start/config.md) + +----- > At present, ***BitSail*** only supports flink deployment on Yarn.
Other platforms like `native kubernetes` will be release recently. diff --git a/website/en/documents/start/env_setup.md b/website/en/documents/start/env_setup.md index fc220c3ab..17ac156b7 100644 --- a/website/en/documents/start/env_setup.md +++ b/website/en/documents/start/env_setup.md @@ -1,4 +1,9 @@ +--- +order: 2 +--- + # Develop Environment Setup + English | [简体中文](../../../zh/documents/start/env_setup.md) ----- diff --git a/website/en/documents/start/quick_guide.md b/website/en/documents/start/quick_guide.md new file mode 100644 index 000000000..e4cbe8a4b --- /dev/null +++ b/website/en/documents/start/quick_guide.md @@ -0,0 +1,105 @@ +--- +order: 4 +--- + +# BitSail Guide Video + +English | [简体中文](../../../zh/documents/start/quick_guide.md) + +----- + +## BitSail demo video + +[BitSail demo video](https://zhuanlan.zhihu.com/p/595157599) + +## BitSail source code compilation + +BitSail has a built-in compilation script `build.sh` in the project, which is stored in the project root directory. Newly downloaded users can directly compile this script, and after successful compilation, they can find the corresponding product in the directory: `bitsail-dist/target/bitsail-dist-${rversion}-bin`. + +![](../../../images/documents/start/quick_guide/source_code_structure.png) + +## BitSail product structure + +![](../../../images/documents/start/quick_guide/compile_product_structure.png) + +![](../../../images/documents/start/quick_guide/product_structure.png) + +## BitSail job submission + +### Flink Session Job + +```Shell +Step 1: Start the Flink Session cluster + +Session operation requires the existence of hadoop dependencies in the local environment and the existence of the environment variable HADOOP_CLASSPATH. + +bash ./embedded/flink/bin/start-cluster.sh + +Step 2: Submit the job to the Flink Session cluster + +bash bin/bitsail run \ + --engine flink \ + --execution-mode run \ + --deployment-mode local \ + --conf examples/Fake_Print_Example.json \ + --jm-address +``` + +### Yarn Cluster Job + +```Shell +Step 1: Set the HADOOP_HOME environment variable + +export HADOOP_HOME=XXX + +Step 2: Set HADOOP_HOME so that the submission client can find the configuration path of the yarn cluster, and then submit the job to the yarn cluster + +bash ./bin/bitsail run --engine flink \ +--conf ~/dts_example/examples/Hive_Print_Example.json \ +--execution-mode run \ +--deployment-mode yarn-per-job \ +--queue default +``` + +## BitSail Demo + +### Fake->MySQL + +```Shell +// create mysql table +CREATE TABLE `bitsail_fake_source` ( + `id` bigint(20) NOT NULL AUTO_INCREMENT, + `name` varchar(255) DEFAULT NULL, + `price` double DEFAULT NULL, + `image` blob, + `start_time` datetime DEFAULT NULL, + `end_time` datetime DEFAULT NULL, + `order_id` bigint(20) DEFAULT NULL, + `enabled` tinyint(4) DEFAULT NULL, + `datetime` int(11) DEFAULT NULL, + PRIMARY KEY (`id`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; +``` + +### MySQL->Hive + +```Shell +// create hive table +CREATE TABLE `bitsail`.`bitsail_mysql_hive`( + `id` bigint , + `name` string , + `price` double , + `image` binary, + `start_time` timestamp , + `end_time` timestamp, + `order_id` bigint , + `enabled` int, + `datetime` int +)PARTITIONED BY (`date` string) +ROW FORMAT SERDE + 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' +STORED AS INPUTFORMAT + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' +OUTPUTFORMAT + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' +``` \ No newline at end of file diff --git a/website/images/community/connector_quick_start/bitsail_model.png b/website/images/community/connector_quick_start/bitsail_model.png new file mode 100644 index 000000000..d2e11e700 Binary files /dev/null and b/website/images/community/connector_quick_start/bitsail_model.png differ diff --git a/website/images/community/connector_quick_start/code_structure_en.png b/website/images/community/connector_quick_start/code_structure_en.png new file mode 100644 index 000000000..a507eeab1 Binary files /dev/null and b/website/images/community/connector_quick_start/code_structure_en.png differ diff --git a/website/images/community/connector_quick_start/code_structure_zh.png b/website/images/community/connector_quick_start/code_structure_zh.png new file mode 100644 index 000000000..2d67593e2 Binary files /dev/null and b/website/images/community/connector_quick_start/code_structure_zh.png differ diff --git a/website/images/community/connector_quick_start/connector_pom.png b/website/images/community/connector_quick_start/connector_pom.png new file mode 100644 index 000000000..51c1b59cc Binary files /dev/null and b/website/images/community/connector_quick_start/connector_pom.png differ diff --git a/website/images/community/connector_quick_start/dist_pom.png b/website/images/community/connector_quick_start/dist_pom.png new file mode 100644 index 000000000..5a6cfcd3c Binary files /dev/null and b/website/images/community/connector_quick_start/dist_pom.png differ diff --git a/website/images/community/connector_quick_start/sink_connector.png b/website/images/community/connector_quick_start/sink_connector.png new file mode 100644 index 000000000..6e66dbb0c Binary files /dev/null and b/website/images/community/connector_quick_start/sink_connector.png differ diff --git a/website/images/community/connector_quick_start/source_connector.png b/website/images/community/connector_quick_start/source_connector.png new file mode 100644 index 000000000..2f4531186 Binary files /dev/null and b/website/images/community/connector_quick_start/source_connector.png differ diff --git a/website/images/community/connector_quick_start/test_container.png b/website/images/community/connector_quick_start/test_container.png new file mode 100644 index 000000000..c6b2a1c86 Binary files /dev/null and b/website/images/community/connector_quick_start/test_container.png differ diff --git a/website/images/community/pr_guide/after_git_reset.png b/website/images/community/pr_guide/after_git_reset.png new file mode 100644 index 000000000..459c01ee6 Binary files /dev/null and b/website/images/community/pr_guide/after_git_reset.png differ diff --git a/website/images/community/pr_guide/commit_info.png b/website/images/community/pr_guide/commit_info.png new file mode 100644 index 000000000..b1e1cc190 Binary files /dev/null and b/website/images/community/pr_guide/commit_info.png differ diff --git a/website/images/community/pr_guide/create_pr.png b/website/images/community/pr_guide/create_pr.png new file mode 100644 index 000000000..c3290f3ba Binary files /dev/null and b/website/images/community/pr_guide/create_pr.png differ diff --git a/website/images/community/pr_guide/git_clone_example.png b/website/images/community/pr_guide/git_clone_example.png new file mode 100644 index 000000000..82a38c033 Binary files /dev/null and b/website/images/community/pr_guide/git_clone_example.png differ diff --git a/website/images/community/pr_guide/git_history.png b/website/images/community/pr_guide/git_history.png new file mode 100644 index 000000000..4a665b66b Binary files /dev/null and b/website/images/community/pr_guide/git_history.png differ diff --git a/website/images/community/pr_guide/git_rebase_example.png b/website/images/community/pr_guide/git_rebase_example.png new file mode 100644 index 000000000..9146b7848 Binary files /dev/null and b/website/images/community/pr_guide/git_rebase_example.png differ diff --git a/website/images/community/pr_guide/github_pr.png b/website/images/community/pr_guide/github_pr.png new file mode 100644 index 000000000..4a771b7ec Binary files /dev/null and b/website/images/community/pr_guide/github_pr.png differ diff --git a/website/images/community/pr_guide/github_status.png b/website/images/community/pr_guide/github_status.png new file mode 100644 index 000000000..9bc53121f Binary files /dev/null and b/website/images/community/pr_guide/github_status.png differ diff --git a/website/images/community/pr_guide/repository_fork.png b/website/images/community/pr_guide/repository_fork.png new file mode 100644 index 000000000..e84d36c13 Binary files /dev/null and b/website/images/community/pr_guide/repository_fork.png differ diff --git a/website/images/community/pr_guide/repository_structure.png b/website/images/community/pr_guide/repository_structure.png new file mode 100644 index 000000000..4984c932a Binary files /dev/null and b/website/images/community/pr_guide/repository_structure.png differ diff --git a/website/images/community/release_guide/release_procedure.png b/website/images/community/release_guide/release_procedure.png new file mode 100644 index 000000000..91b66253b Binary files /dev/null and b/website/images/community/release_guide/release_procedure.png differ diff --git a/website/images/community/sink_connector/sink_diagram.png b/website/images/community/sink_connector/sink_diagram.png new file mode 100644 index 000000000..ccde0543a Binary files /dev/null and b/website/images/community/sink_connector/sink_diagram.png differ diff --git a/website/images/community/sink_connector/writer_diagram.png b/website/images/community/sink_connector/writer_diagram.png new file mode 100644 index 000000000..3818e66f7 Binary files /dev/null and b/website/images/community/sink_connector/writer_diagram.png differ diff --git a/website/images/community/source_connector/bitsail_converter.png b/website/images/community/source_connector/bitsail_converter.png new file mode 100644 index 000000000..ce9c01a0b Binary files /dev/null and b/website/images/community/source_connector/bitsail_converter.png differ diff --git a/website/images/community/source_connector/deserialization_schema_diagram.png b/website/images/community/source_connector/deserialization_schema_diagram.png new file mode 100644 index 000000000..b82079a9f Binary files /dev/null and b/website/images/community/source_connector/deserialization_schema_diagram.png differ diff --git a/website/images/community/source_connector/file_mapping_converter.png b/website/images/community/source_connector/file_mapping_converter.png new file mode 100644 index 000000000..ae2763bc3 Binary files /dev/null and b/website/images/community/source_connector/file_mapping_converter.png differ diff --git a/website/images/community/source_connector/source_diagram.png b/website/images/community/source_connector/source_diagram.png new file mode 100644 index 000000000..01a65e2f6 Binary files /dev/null and b/website/images/community/source_connector/source_diagram.png differ diff --git a/website/images/community/source_connector/source_reader_diagram.png b/website/images/community/source_connector/source_reader_diagram.png new file mode 100644 index 000000000..487b1443d Binary files /dev/null and b/website/images/community/source_connector/source_reader_diagram.png differ diff --git a/website/images/community/source_connector/source_split_coordinator_diagram.png b/website/images/community/source_connector/source_split_coordinator_diagram.png new file mode 100644 index 000000000..af1679fe1 Binary files /dev/null and b/website/images/community/source_connector/source_split_coordinator_diagram.png differ diff --git a/website/images/community/source_connector/source_split_diagram.png b/website/images/community/source_connector/source_split_diagram.png new file mode 100644 index 000000000..19a5485b4 Binary files /dev/null and b/website/images/community/source_connector/source_split_diagram.png differ diff --git a/website/images/documents/start/quick_guide/compile_product_structure.png b/website/images/documents/start/quick_guide/compile_product_structure.png new file mode 100644 index 000000000..9d991fbb3 Binary files /dev/null and b/website/images/documents/start/quick_guide/compile_product_structure.png differ diff --git a/website/images/documents/start/quick_guide/product_structure.png b/website/images/documents/start/quick_guide/product_structure.png new file mode 100644 index 000000000..e47d08dd8 Binary files /dev/null and b/website/images/documents/start/quick_guide/product_structure.png differ diff --git a/website/images/documents/start/quick_guide/source_code_structure.png b/website/images/documents/start/quick_guide/source_code_structure.png new file mode 100644 index 000000000..00704118f Binary files /dev/null and b/website/images/documents/start/quick_guide/source_code_structure.png differ diff --git a/website/zh/community/community.md b/website/zh/community/community.md index 45e4cdfee..4d49442ba 100644 --- a/website/zh/community/community.md +++ b/website/zh/community/community.md @@ -1,12 +1,21 @@ --- order: 1 --- + # 社区 +[English](../../en/community/community.md) | 简体中文 + +----- + - [贡献指南](contribute.md) - - [如何参与](contribute.md#如何参与) - - [如何提交一个Pull Request](contribute.md#提交一个Pull-Request) - - [如何发起一个issue](contribute.md#打开一个GitHub-Issue) - - [开发Tips](contribute.md#开发小技巧) + - [如何参与](contribute.md#如何参与) + - [如何发起一个issue](contribute.md#打开一个GitHub-Issue) + - [PR发布指南](pr_guide.md) + - [开发Tips](contribute.md#开发小技巧) + - [BitSail 发版指南](release_guide.md) +- [Connector开发指南](connector_quick_start.md) + - [Source Connector 详解](source_connector_detail.md) + - [Sink Connector 详解](sink_connector_detail.md) - [邮件列表](mailing.md) - [团队介绍](team.md) \ No newline at end of file diff --git a/website/zh/community/connector_quick_start.md b/website/zh/community/connector_quick_start.md new file mode 100644 index 000000000..ecf6a4bf6 --- /dev/null +++ b/website/zh/community/connector_quick_start.md @@ -0,0 +1,283 @@ +--- +order: 7 +--- + +# Connector开发指南 + +[English](../../en/community/connector_quick_start.md) | 简体中文 + +----- + +## 简介 + +本文面向BitSail的Connector开发人员,通过开发者的角度全面的阐述开发一个完整Connector的全流程,快速上手Connector开发。 + +## 目录结构 + +首先开发者需要fork BitSail仓库,详情参考[Fork BitSail Repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo),之后通过git clone仓库到本地,并导入到IDE中。同时创建自己的工作分支,使用该分支开发自己的Connector。项目地址:https://github.com/bytedance/bitsail.git。 + +项目结构如下: + +![](../../images/community/connector_quick_start/code_structure_zh.png) + +## 开发流程 + +BitSail 是一款基于分布式架构的数据集成引擎,Connector会并发执行。并由BitSail 框架来负责任务的调度、并发执行、脏数据处理等,开发者只需要实现对应接口即可,具体开发流程如下: + +- 工程配置,开发者需要在`bitsail/bitsail-connectors/pom.xml`模块中注册自己的Connector,同时在`bitsail/bitsail-dist/pom.xml`增加自己的Connector模块,同时为你的连接器注册配置文件,来使得框架可以在运行时动态发现它。 + - ![](../../images/community/connector_quick_start/connector_pom.png) + + - ![](../../images/community/connector_quick_start/dist_pom.png) +- Connector开发,实现Source、Sink提供的抽象方法,具体细节参考后续介绍。 +- 数据输出类型,目前支持的数据类型为BitSail Row类型,无论是Source在Reader中传递给下游的数据类型,还是Sink从上游消费的数据类型,都应该是BitSail Row类型。 + +## Architecture + +当前Source API的设计同时兼容了流批一批的场景,换言之就是同时支持pull & push 的场景。在此之前,我们需要首先再过一遍传统流批场景中各组件的交互模型。 + +### Batch Model + +传统批式场景中,数据的读取一般分为如下几步: + +- `createSplits`:一般在client端或者中心节点执行,目的是将完整的数据按照指定的规则尽可能拆分为较多的`rangeSplits`,`createSplits`在作业生命周期内有且执行一次。 +- `runWithSplit`: 一般在执行节点节点执行,执行节点启动后会向中心节点请求存在的`rangeSplit`,然后再本地进行执行;执行完成后会再次向中心节点请求直到所有`splits`执行完成。 +- `commit`:全部的split的执行完成后,一般会在中心节点执行`commit`的操作,用于将数据对外可见。 + +### Stream Model + +传统流式场景中,数据的读取一般分为如下几步: + +- `createSplits`:一般在client端或者中心节点执行,目的是根据滑动窗口或者滚动窗口的策略将数据流划分为`rangeSplits`,`createSplits`在流式作业的生命周期中按照划分窗口的会一直执行。 +- `runWithSplit`: 一般在执行节点节点执行,中心节点会向可执行节点发送`rangeSplit`,然后在可执行节点本地进行执行;执行完成后会将处理完的`splits`数据向下游发送。 +- `commit`:全部的split的执行完成后,一般会向目标数据源发送`retract message`,实时动态展现结果。 + +### BitSail Model + +![](../../images/community/connector_quick_start/bitsail_model.png) + +- `createSplits`:BitSail通过`SplitCoordinator`模块划分`rangeSplits`,在流式作业中的生命周期中`createSplits`会周期性执行,而在批式作业中仅仅会执行一次。 +- `runWithSplit`: 在执行节点节点执行,BitSail中执行节点包括`Reader`和`Writer`模块,中心节点会向可执行节点发送`rangeSplit`,然后在可执行节点本地进行执行;执行完成后会将处理完的`splits`数据向下游发送。 +- `commit`:`writer`在完成数据写入后,`committer`来完成提交。在不开启`checkpoint`时,`commit`会在所有`writer`都结束后执行一次;在开启`checkpoint`时,`commit`会在每次`checkpoint`的时候都会执行一次。 + +## Source Connector + +![](../../images/community/connector_quick_start/source_connector.png) + +- Source: 数据读取组件的生命周期管理,主要负责和框架的交互,构架作业,不参与作业真正的执行 +- SourceSplit: 数据读取分片;大数据处理框架的核心目的就是将大规模的数据拆分成为多个合理的Split +- State:作业状态快照,当开启checkpoint之后,会保存当前执行状态。 +- SplitCoordinator: 既然提到了Split,就需要有相应的组件去创建、管理Split;SplitCoordinator承担了这样的角色 +- SourceReader: 真正负责数据读取的组件,在接收到Split后会对其进行数据读取,然后将数据传输给下一个算子 + +Source Connector开发流程如下 + +1. 首先需要创建`Source`类,需要实现`Source`和`ParallelismComputable`接口,主要负责和框架的交互,构架作业,它不参与作业真正的执行 +2. `BitSail`的`Source`采用流批一体的设计思想,通过`getSourceBoundedness`方法设置作业的处理方式,通过`configure`方法定义`readerConfiguration`的配置,通过`createTypeInfoConverter`方法来进行数据类型转换,可以通过`FileMappingTypeInfoConverter`得到用户在yaml文件中自定义的数据源类型和BitSail类型的转换,实现自定义化的类型转换。 +3. 最后,定义数据源的数据分片格式`SourceSplit`类和闯将管理`Split`的角色`SourceSplitCoordinator`类 +4. 最后完成`SourceReader`实现从`Split`中进行数据的读取。 + +| Job Type | Boundedness | +| -------- | --------------------------- | +| batch | Boundedness.*BOUNDEDNESS* | +| stream | Boundedness.*UNBOUNDEDNESS* | + +- 每个`SourceReader`都在独立的线程中执行,并保证`SourceSplitCoordinator`分配给不同`SourceReader`的切片没有交集 +- 在`SourceReader`的执行周期中,开发者只需要关注如何从构造好的切片中去读取数据,之后完成数据类型对转换,将外部数据类型转换成`BitSail`的`Row`类型传递给下游即可 + +### Reader示例 + +```Java +public class FakeSourceReader extends SimpleSourceReaderBase { + + private final BitSailConfiguration readerConfiguration; + private final TypeInfo[] typeInfos; + + private final transient int totalCount; + private final transient RateLimiter fakeGenerateRate; + private final transient AtomicLong counter; + + private final FakeRowGenerator fakeRowGenerator; + + public FakeSourceReader(BitSailConfiguration readerConfiguration, Context context) { + this.readerConfiguration = readerConfiguration; + this.typeInfos = context.getTypeInfos(); + this.totalCount = readerConfiguration.get(FakeReaderOptions.TOTAL_COUNT); + this.fakeGenerateRate = RateLimiter.create(readerConfiguration.get(FakeReaderOptions.RATE)); + this.counter = new AtomicLong(); + this.fakeRowGenerator = new FakeRowGenerator(readerConfiguration, context.getIndexOfSubtask()); + } + + @Override + public void pollNext(SourcePipeline pipeline) throws Exception { + fakeGenerateRate.acquire(); + pipeline.output(fakeRowGenerator.fakeOneRecord(typeInfos)); + } + + @Override + public boolean hasMoreElements() { + return counter.incrementAndGet() <= totalCount; + } +} +``` + +## Sink Connector + +![](../../images/community/connector_quick_start/sink_connector.png) + +- Sink:数据写入组件的生命周期管理,主要负责和框架的交互,构架作业,它不参与作业真正的执行。 +- Writer:负责将接收到的数据写到外部存储。 +- WriterCommitter(可选):对数据进行提交操作,来完成两阶段提交的操作;实现exactly-once的语义。 + +开发者首先需要创建`Sink`类,实现`Sink`接口,主要负责数据写入组件的生命周期管理,构架作业。通过`configure`方法定义`writerConfiguration`的配置,通过`createTypeInfoConverter`方法来进行数据类型转换,将内部类型进行转换写到外部系统,同`Source`部分。之后我们再定义`Writer`类实现具体的数据写入逻辑,在`write`方法调用时将`BitSail Row`类型把数据写到缓存队列中,在`flush`方法调用时将缓存队列中的数据刷写到目标数据源中。 + +### Writer示例 + +```Java +public class PrintWriter implements Writer { + private static final Logger LOG = LoggerFactory.getLogger(PrintWriter.class); + + private final int batchSize; + private final List fieldNames; + + private final List writeBuffer; + private final List commitBuffer; + + private final AtomicInteger printCount; + + public PrintWriter(int batchSize, List fieldNames) { + this(batchSize, fieldNames, 0); + } + + public PrintWriter(int batchSize, List fieldNames, int alreadyPrintCount) { + Preconditions.checkState(batchSize > 0, "batch size must be larger than 0"); + this.batchSize = batchSize; + this.fieldNames = fieldNames; + this.writeBuffer = new ArrayList<>(batchSize); + this.commitBuffer = new ArrayList<>(batchSize); + printCount = new AtomicInteger(alreadyPrintCount); + } + + @Override + public void write(Row element) { + String[] fields = new String[element.getFields().length]; + for (int i = 0; i < element.getFields().length; ++i) { + fields[i] = String.format("\"%s\":\"%s\"", fieldNames.get(i), element.getField(i).toString()); + } + + writeBuffer.add("[" + String.join(",", fields) + "]"); + if (writeBuffer.size() == batchSize) { + this.flush(false); + } + printCount.incrementAndGet(); + } + + @Override + public void flush(boolean endOfInput) { + commitBuffer.addAll(writeBuffer); + writeBuffer.clear(); + if (endOfInput) { + LOG.info("all records are sent to commit buffer."); + } + } + + @Override + public List prepareCommit() { + return commitBuffer; + } + + @Override + public List snapshotState(long checkpointId) { + return Collections.singletonList(printCount.get()); + } +} +``` + +## 将连接器注册到配置文件中 + +为你的连接器注册配置文件,来使得框架可以在运行时动态发现它,配置文件的定义如下: + +以hive为例,开发者需要在resource目录下新增一个json文件,名字示例为bitsail-connector-hive.json,只要不和其他连接器重复即可 + +```Plain +{ + "name": "bitsail-connector-hive", + "classes": [ + "com.bytedance.bitsail.connector.hive.source.HiveSource", + "com.bytedance.bitsail.connector.hive.sink.HiveSink" + ], + "libs": [ + "bitsail-connector-hive-${version}.jar" + ] +} +``` + +## 测试模块 + +在Source或者Sink连接器所在的模块中,新增ITCase测试用例,然后按照如下流程支持 + +- 通过test container来启动相应的组件 + +![](../../images/community/connector_quick_start/test_container.png) + +- 编写相应的配置文件 + +```Json +{ + "job": { + "common": { + "job_id": 313, + "instance_id": 3123, + "job_name": "bitsail_clickhouse_to_print_test", + "user_name": "test" + }, + "reader": { + "class": "com.bytedance.bitsail.connector.clickhouse.source.ClickhouseSource", + "jdbc_url": "jdbc:clickhouse://localhost:8123", + "db_name": "default", + "table_name": "test_ch_table", + "split_field": "id", + "split_config": "{\"name\": \"id\", \"lower_bound\": 0, \"upper_bound\": \"10000\", \"split_num\": 3}", + "sql_filter": "( id % 2 == 0 )", + "columns": [ + { + "name": "id", + "type": "int64" + }, + { + "name": "int_type", + "type": "int32" + }, + { + "name": "double_type", + "type": "float64" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "p_date", + "type": "date" + } + ] + }, + "writer": { + "class": "com.bytedance.bitsail.connector.legacy.print.sink.PrintSink" + } + } +} +``` + +- 通过代码EmbeddedFlinkCluster.submit来进行作业提交 + +```Java +@Test +public void testClickhouseToPrint() throws Exception { + BitSailConfiguration jobConf = JobConfUtils.fromClasspath("clickhouse_to_print.json"); + EmbeddedFlinkCluster.submitJob(jobConf); +} +``` + +## 提交PR + +当开发者实现自己的Connector后,就可以关联自己的issue,提交PR到github上了,提交之前,开发者记得Connector添加文档,通过review之后,大家贡献的Connector就成为BitSail的一部分了,我们按照贡献程度会选取活跃的Contributor成为我们的Committer,参与BitSail社区的重大决策,希望大家积极参与! diff --git a/website/zh/community/contribute.md b/website/zh/community/contribute.md index 99d1607fd..b0b7ef78d 100644 --- a/website/zh/community/contribute.md +++ b/website/zh/community/contribute.md @@ -1,6 +1,7 @@ --- order: 2 --- + # 贡献者指引 [English](../../en/community/contribute.md) | 简体中文 @@ -60,7 +61,7 @@ BitSail项目使用了[Google Java Style Guide](https://google.github.io/stylegu 我们在构建过程中检查重叠的包。当您在构建过程中看到冲突错误时,请从 pom 文件中排除有冲突的包。 ## 提交一个Pull Request -如果是第一次提交 pull request,可以阅读这个文档 [什么是Pull Request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) +如果是第一次提交pull request,可以阅读这个文档 [什么是Pull Request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) - [Fork ***BitSail*** 代码库](https://docs.github.com/en/get-started/quickstart/fork-a-repo) - 在你的fork的代码库中生成一个新分支 @@ -68,6 +69,8 @@ BitSail项目使用了[Google Java Style Guide](https://google.github.io/stylegu - 提交对分支的更改并推送到你fork的仓库 - 向 ***BitSail*** 存储库创建pull request +如果你是初次涉猎开源项目,可以通过阅读[如何提交一个Pull Request](pr_guide.md)了解更详细的指南。 + ## 请求代码审查 准备好pull request后,请确保pull request模板清单中的所有项目都已完成。 在pull request中@任意一个项目的committer进行代码审查。 diff --git a/website/zh/community/mailing.md b/website/zh/community/mailing.md index 5888e3c3a..e40709a29 100644 --- a/website/zh/community/mailing.md +++ b/website/zh/community/mailing.md @@ -1,8 +1,13 @@ --- -order: 3 +order: 5 --- + # 邮件列表 +[English](../../en/community/mailing.md) | 简体中文 + +----- + 当前,BitSail社区通过谷歌群组作为邮件列表的提供者,邮件列表可以在绝大部分地区正常收发邮件。 在订阅BitSail小组的邮件列表后可以通过发送邮件发言 diff --git a/website/zh/community/pr_guide.md b/website/zh/community/pr_guide.md new file mode 100644 index 000000000..8e7dfe45f --- /dev/null +++ b/website/zh/community/pr_guide.md @@ -0,0 +1,303 @@ +--- +order: 3 +--- + +# PR发布指南 + +[English](../../en/community/pr_guide.md) | 简体中文 + +----- + +![](../../images/community/pr_guide/repository_structure.png) + +## Fork BitSail 到自己的仓库 + +![](../../images/community/pr_guide/repository_fork.png) + +## Git的账户配置 + +用户名和邮箱地址的作用:用户名和邮箱地址是本地git客户端的一个变量,每次commit都会用用户名和邮箱纪录,Github的contributions统计就是按邮箱来统计的。 + +查看自己的账户和邮箱地址: + +```Bash +$ git config user.name +$ git config user.email +``` + +如果是第一次使用,或者需要对其进行修改,执行以下命令,将用户名和邮箱地址替换为你自己的即可。 + +```Bash +$ git config --global user.name "username" +$ git config --global user.email "your_email@example.com" +``` + +## 将Fork仓库克隆到本地 + +可选HTTPS或者SSH方式,之后的操作会以SSH方式示例,如果采用HTTPS方式,只需要将命令中的SSH地址全部替换为HTTPS地址即可。 + +### HTTPS + +```Bash +$ git clone git@github.com:{your_github_id}/bitsail.git +``` + +### SSH + +```Bash +$ git clone https://github.com/{your_github_id}/bitsail.git +``` + +![](../../images/community/pr_guide/git_clone_example.png) + +## 设置origin和upstream + +```Bash +$ git remote add origin git@github.com:{your_github_id}/bitsail.git +$ git remote add upstream git@github.com:bytedance/bitsail.git +$ git remote -v +origin git@github.com:{your_github_id}/bitsail.git (fetch) +origin git@github.com:{your_github_id}/bitsail.git (push) +upstream git@github.com:bytedance/bitsail.git (fetch) +upstream git@github.com:bytedance/bitsail.git (push) +``` + +如果`git`的`origin`设置错误,可以执行`git `*`remote`*` rm `*`origin`**清除后重新设置* + +`upstream`同理,设置错误可以通过`git `*`remote`*` rm `*`upstream`*清除后重新设置 + +## 创建自己的工作分支 + +```Bash +查看所有分支 +$ git branch -a +在本地新建一个分支 +$ git branch {your_branch_name} +切换到我的新分支 +$ git checkout {your_branch_name} +将本地分支推送到fork仓库 +$ git push -u origin {your_branch_name} +``` + +分支名称示例:add-sink-connector-redis + +之后就可以在自己的工作分支进行代码的编写,测试,并及时同步到你的个人分支。 + +```Bash +编辑区添加到暂存区 +$ git add . +暂存区提交到分支 +$ git commit -m "[BitSail] Message" +同步Fork仓库 +$ git push -u origin <分支名> +``` + +## 同步代码 + +BitSail对接口或者版本的更新迭代会谨慎的考量,如果开发者开发周期短,可以在提交代码前对原始仓库做一次同步即可,但是如果不幸遇到了大的版本变更,开发者可以随时跟进对原始仓库的变更。 + +这里为了保证代码分支的干净,推荐采用rebase的方式进行合并。 + +```Bash +$ git fetch upstream +$ git rebase upstream/master +``` + +在rebase过程中,有可能会报告文件的冲突 + +例如如下情况,我们要去手动合并产生冲突的文件`bitsail-connectors/pom.xml` + +```Bash +$ git rebase upstream/master +Auto-merging bitsail-dist/pom.xml +Auto-merging bitsail-connectors/pom.xml +CONFLICT (content): Merge conflict in bitsail-connectors/pom.xml +error: could not apply 054a4d3... [BitSail] Migrate hadoop source&sink to v1 interface +Resolve all conflicts manually, mark them as resolved with +"git add/rm ", then run "git rebase --continue". +You can instead skip this commit: run "git rebase --skip". +To abort and get back to the state before "git rebase", run "git rebase --abort". +Could not apply 054a4d3... [BitSail] Migrate hadoop source&sink to v1 interface +``` + +产生冲突的部分如下所示,`=======`为界, 决定您是否想只保持分支的更改、只保持其他分支的更改,还是进行全新的更改(可能包含两个分支的更改)。 删除冲突标记` <<<<<<<`、`=======`、`>>>>>>>`,并在最终合并中进行所需的更改。 + +```Plain + + bitsail-connectors-legacy + connector-print + connector-elasticsearch + connector-fake + connector-base + connector-doris + connector-kudu + connector-rocketmq + connector-redis + connector-clickhouse +<<<<<<< HEAD + connector-druid +======= + connector-hadoop +>>>>>>> 054a4d3 ([BitSail] Migrate hadoop source&sink to v1 interface) + +``` + +处理完成的示例: + +```Plain + + bitsail-connectors-legacy + connector-print + connector-elasticsearch + connector-fake + connector-base + connector-doris + connector-kudu + connector-rocketmq + connector-redis + connector-clickhouse + connector-druid + connector-hadoop + +``` + +处理完成之后执行`git add `,比如该例中执行: + +```Bash +$ git add bitsail-connectors/pom.xml +$ git rebase --continue +``` + +之后会出现如下窗口,这个是Vim编辑界面,编辑模式按照Vim的进行即可,通常我们只需要对第一行进行Commit信息进行编辑,也可以不修改,完成后按照Vim的退出方式,依次按`: w q 回车`即可。 + +![](../../images/community/pr_guide/git_rebase_example.png) + +之后出现如下表示rebase成功。 + +```Bash +$ git rebase --continue +[detached HEAD 9dcf4ee] [BitSail] Migrate hadoop source&sink to v1 interface + 15 files changed, 766 insertions(+) + create mode 100644 bitsail-connectors/connector-hadoop/pom.xml + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/constant/HadoopConstants.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/error/TextInputFormatErrorCode.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/format/HadoopDeserializationSchema.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/option/HadoopReaderOptions.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/sink/HadoopSink.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/sink/HadoopWriter.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/HadoopSource.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/coordinator/HadoopSourceSplitCoordinator.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/reader/HadoopSourceReader.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/reader/HadoopSourceReaderCommonBasePlugin.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/java/com/bytedance/bitsail/connector/hadoop/source/split/HadoopSourceSplit.java + create mode 100644 bitsail-connectors/connector-hadoop/src/main/resources/bitsail-connector-unified-hadoop.json +Successfully rebased and updated refs/heads/add-v1-connector-hadoop. +``` + +此时可以看到我们的`commit`已经被提到了最前面: + +![](../../images/community/pr_guide/commit_info.png) + +rebase之后代码可能无法正常推送 + +```Bash +$ git push +To github.com:love-star/bitsail.git + ! [rejected] add-v1-connector-hadoop -> add-v1-connector-hadoop (non-fast-forward) +error: failed to push some refs to 'github.com:love-star/bitsail.git' +hint: Updates were rejected because the tip of your current branch is behind +hint: its remote counterpart. Integrate the remote changes (e.g. +hint: 'git pull ...') before pushing again. +hint: See the 'Note about fast-forwards' in 'git push --help' for details. +``` + +此时需要`git push -f` 强制推送,强制推送是一个有风险的操作,操作前请仔细检查以避免出现无关代码被强制覆盖的问题。 + +```Bash +$ git push -f +Enumerating objects: 177, done. +Counting objects: 100% (177/177), done. +Delta compression using up to 12 threads +Compressing objects: 100% (110/110), done. +Writing objects: 100% (151/151), 26.55 KiB | 1.40 MiB/s, done. +Total 151 (delta 40), reused 0 (delta 0), pack-reused 0 +remote: Resolving deltas: 100% (40/40), completed with 10 local objects. +To github.com:love-star/bitsail.git + + adb90f4...b72d931 add-v1-connector-hadoop -> add-v1-connector-hadoop (forced update) +``` + +此时分支已经和原始仓库同步,之后的代码编写都会建立在最新的基础上。 + +## 提交代码 + +当开发者开发完毕,首先需要完成一次仓库的rebase,具体参考同步代码的场景。rebase之后,git的历史如下所示: + +![](../../images/community/pr_guide/git_history.png) + +在Github界面如图所示 + +![](../../images/community/pr_guide/github_status.png) + +我们希望在提交PR前仅仅保留一个Commit以保证分支的干净,如果有多次提交,最后可以合并为一个提交。具体操作如下: + +```Bash +$ git reset --soft HEAD~N(N为需要合并的提交次数) +$ git add . +$ git commit -m "[BitSail] Message" +$ git push -f +``` + +比如此例中,执行 + +```Bash +$ git reset --soft HEAD~4 +$ git add . +$ git commit -m "[BitSail#106][Connector] Migrate hadoop source connector to v1 interface" +$ git push -f +``` + +合并后: + +![](../../images/community/pr_guide/after_git_reset.png) + +## 提交PR + +![](../../images/community/pr_guide/github_pr.png) + +提交PR时,应注意Commit message和PR message的规范: + +![](../../images/community/pr_guide/create_pr.png) + +### Commit message 规范 + +1. 创建一个新的Github issue或者关联一个已经存在的 issue +2. 在issue description中描述你想要进行的工作. +3. 在commit message关联你的issue,格式如下: + +```Plain +[BitSail#${IssueNumber}][${Module}] Description +[BitSail#1234][Connector] Improve reader split algorithm to Kudu source connector + +//For Minor change +[Minor] Description +``` + +1. commit message的module格式列表如下,如果开发者的工作关联了多个module,选择最相关的module即可,例如:如果你在 kafka connector添加了新的feature,并且改变了common、components和cores中的代码,这时commit message应该绑定的module格式为[Connector]。 + +```Plain +[Common] bitsail-common +[Core] base client component cores +[Connector] all connector related changes +[Doc] documentation or java doc changes +[Build] build, dependency changes +``` + +注意 + +- commit 需遵循规范,给维护者减少维护成本及工作量,对于不符合规范的commit,我们不予合并。 +- 对于解决同一个Issue的PR,只能存在一个commit message,如果出现多次提交的message,我们希望你能将commit message 压缩成一个。 +- message 尽量保持清晰简洁,但是也千万不要因为过度追求简洁导致描述不清楚,如果有必要,我们也不介意message过长,前提是,能够把解决方案、修复内容描述清楚。 + +### PR message规范 + +PR message应概括清楚问题的前因后果,如果存在对应issue要附加issue地址,保证问题是可追溯的。 diff --git a/website/zh/community/release_guide.md b/website/zh/community/release_guide.md new file mode 100644 index 000000000..09d5ddd67 --- /dev/null +++ b/website/zh/community/release_guide.md @@ -0,0 +1,111 @@ +--- +order: 4 +--- + +# BitSail 发版指南 + +[English](../../en/community/release_guide.md) | 简体中文 + +----- + +## 提交 pull request 的流程 + +提交一个新的 commit 的标准程序 + +1. 创建一个新的Github issue或者关联一个已经存在的 issue +2. 在issue description中描述你想要进行的工作. +3. 在commit message关联你的issue,格式如下: + +```Plain +[BitSail#${IssueNumber}][${Module}] Description +[BitSail#1234][Connector] Improve reader split algorithm to Kudu source connector + +//For Minor change +[Minor] Description +``` + +4. commit message的module格式列表如下,如果开发者的工作关联了多个module,选择最相关的module即可,例如:如果你在 kafka connector添加了新的feature,并且改变了common、components和cores中的代码,这时commit message应该绑定的module格式为[Connector]。 + +```Plain +[Common] bitsail-common +[Core] base client component cores +[Connector] all connector related changes +[Doc] documentation or java doc changes +[Build] build, dependency changes +``` + +## release 流程 + +![img](../../images/community/release_guide/release_procedure.png) + +### 1. release 决议阶段 + +因为目前的订阅mailing list的用户不多,使用 Github issue 来讨论release相关的话题应该会有更好的可见性。 + +我们可以在 Github 上开始一个新的讨论,主题如下 + +`0.1.0` 发布讨论 + +决定发布和选择Release Manager是发布过程的第一步。 这是整个社区基于共识的决定。 + +任何人都可以在 Github issue上提议一个release,给出可靠的论据并提名一名committer作为Release Manager (包括他们自己)。 没有正式的流程,没有投票要求,也没有时间要求。 任何异议都应在开始发布之前以协商的方式解决。 + +### 2. relase 准备阶段 + +A. 审理 release-blocking issues + +B. 审查和更新文件 + +C. 跨团队测试 + +D. 查看发行说明 + +E. 验证构建和测试 + +F. 创建发布分支 + +G. 修改master的版本 + +### 3. release candidate 构建阶段 + +由于我们暂时没有maven central access,我们将在github上构建一个release candidate,让其他用户测试。 + +A. 添加 git release 标签 + +B. 发布在Github上供公众下载 + +### 4. release candidate 投票阶段 + +一旦release分支和release candidate准备就绪,release manager将要求其他committers测试release candidate并开始对相应的 Github Issue 进行投票。 我们至少需要 3 个来自 PMC 成员的盲选票。 + +### 5. 问题修复阶段 + +社区审查和投票期间发现的任何问题都应在此步骤中解决。 + +代码更改应该以标准pull requests的方式提交给 master 分支,并使用标准的贡献流程进行审查。 之后将相关更改同步到发布分支中。 使用 cherry-pick 命令将这些代码更改的commits应用于release分支,再次使用标准的贡献过程进行代码审查和合并。 + +解决所有问题后,将更改构建新的release candidate。 + +### 6. release 结束阶段 + +一旦release candidate通过投票,我们就可以最终确定release。 + +A. 将release分支版本从 `x.x.x-rc1` 更改为 `x.x.x`。 例如 `0.1.0-rc1` 到 `0.1.0` + +B. `git commit -am “[MINOR] 更新release版本以反映release版本 ${RELEASE_VERSION}”` + +C.推送到release分支 + +D. 解决相关Github issue + +E. 创建一个新的 Github release,去掉之前推送的 release version 标签 + +### 7. release 发布阶段 + +在我们发布release后的 24 小时内,会在所有社区渠道上推广该版本,包括微信、飞书、mailing list。 + +### 参考: + +Flink release 指南: [Creating a Flink Release](https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release) + +Hudi release 指南: [Apache Hudi Release Guide](https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide) \ No newline at end of file diff --git a/website/zh/community/sink_connector_detail.md b/website/zh/community/sink_connector_detail.md new file mode 100644 index 000000000..88c5ef40b --- /dev/null +++ b/website/zh/community/sink_connector_detail.md @@ -0,0 +1,391 @@ +--- +order: 9 +--- + +# Sink Connector 详解 + +[English](../../en/community/sink_connector_detail.md) | 简体中文 + +----- + +## BitSail Sink Connector交互流程介绍 + +![](../../images/community/connector_quick_start/sink_connector.png) + +- Sink:数据写入组件的生命周期管理,主要负责和框架的交互,构架作业,它不参与作业真正的执行。 +- Writer:负责将接收到的数据写到外部存储。 +- WriterCommitter(可选):对数据进行提交操作,来完成两阶段提交的操作;实现exactly-once的语义。 + +开发者首先需要创建`Sink`类,实现`Sink`接口,主要负责数据写入组件的生命周期管理,构架作业。通过`configure`方法定义`writerConfiguration`的配置,通过`createTypeInfoConverter`方法来进行数据类型转换,将内部类型进行转换写到外部系统,同`Source`部分。之后我们再定义`Writer`类实现具体的数据写入逻辑,在`write`方法调用时将`BitSail Row`类型把数据写到缓存队列中,在`flush`方法调用时将缓存队列中的数据刷写到目标数据源中。 + +## Sink + +数据写入组件的生命周期管理,主要负责和框架的交互,构架作业,它不参与作业真正的执行。 + +对于每一个Sink任务,我们要实现一个继承Sink接口的类。 + +![](../../images/community/sink_connector/sink_diagram.png) + +### Sink接口 + +```Java +public interface Sink extends Serializable { + + /** + * @return The name of writer operation. + */ + String getWriterName(); + + /** + * Configure writer with user defined options. + * + * @param commonConfiguration Common options. + * @param writerConfiguration Options for writer. + */ + void configure(BitSailConfiguration commonConfiguration, BitSailConfiguration writerConfiguration) throws Exception; + + /** + * Create a writer for processing elements. + * + * @return An initialized writer. + */ + Writer createWriter(Writer.Context context) throws IOException; + + /** + * @return A converter which supports conversion from BitSail {@link TypeInfo} + * and external engine type. + */ + default TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); + } + + /** + * @return A committer for commit committable objects. + */ + default Optional> createCommitter() { + return Optional.empty(); + } + + /** + * @return A serializer which convert committable object to byte array. + */ + default BinarySerializer getCommittableSerializer() { + return new SimpleBinarySerializer(); + } + + /** + * @return A serializer which convert state object to byte array. + */ + default BinarySerializer getWriteStateSerializer() { + return new SimpleBinarySerializer(); + } +} +``` + +### configure方法 + +负责configuration的初始化,通过commonConfiguration中的配置区分流式任务或者批式任务,向Writer类传递writerConfiguration。 + +#### 示例 + +ElasticsearchSink: + +```Java +public void configure(BitSailConfiguration commonConfiguration, BitSailConfiguration writerConfiguration) { + writerConf = writerConfiguration; +} +``` + +### createWriter方法 + +负责生成一个继承自Writer接口的connector Writer类。 + +### createTypeInfoConverter方法 + +类型转换,将内部类型进行转换写到外部系统,同Source部分。 + +### createCommitter方法 + +可选方法,书写具体数据提交逻辑,一般用于想要保证数据exactly-once语义的场景,writer在完成数据写入后,committer来完成提交,进而实现二阶段提交,详细可以参考Doris Connector的实现。 + +## Writer + +具体的数据写入逻辑 + +![](../../images/community/sink_connector/writer_diagram.png) + +### Writer接口 + +```Java +public interface Writer extends Serializable, Closeable { + + /** + * Output an element to target source. + * + * @param element Input data from upstream. + */ + void write(InputT element) throws IOException; + + /** + * Flush buffered input data to target source. + * + * @param endOfInput Flag indicates if all input data are delivered. + */ + void flush(boolean endOfInput) throws IOException; + + /** + * Prepare commit information before snapshotting when checkpoint is triggerred. + * + * @return Information to commit in this checkpoint. + * @throws IOException Exceptions encountered when preparing committable information. + */ + List prepareCommit() throws IOException; + + /** + * Do snapshot for at each checkpoint. + * + * @param checkpointId The id of checkpoint when snapshot triggered. + * @return The current state of writer. + * @throws IOException Exceptions encountered when snapshotting. + */ + default List snapshotState(long checkpointId) throws IOException { + return Collections.emptyList(); + } + + /** + * Closing writer when operator is closed. + * + * @throws IOException Exception encountered when closing writer. + */ + default void close() throws IOException { + + } + + interface Context extends Serializable { + + TypeInfo[] getTypeInfos(); + + int getIndexOfSubTaskId(); + + boolean isRestored(); + + List getRestoreStates(); + } +} +``` + +### 构造方法 + +根据writerConfiguration配置初始化数据源的连接对象。 + +#### 示例 + +```Java +public RedisWriter(BitSailConfiguration writerConfiguration) { + // initialize ttl + int ttl = writerConfiguration.getUnNecessaryOption(RedisWriterOptions.TTL, -1); + TtlType ttlType; + try { + ttlType = TtlType.valueOf(StringUtils.upperCase(writerConfiguration.get(RedisWriterOptions.TTL_TYPE))); + } catch (IllegalArgumentException e) { + throw BitSailException.asBitSailException(RedisPluginErrorCode.ILLEGAL_VALUE, + String.format("unknown ttl type: %s", writerConfiguration.get(RedisWriterOptions.TTL_TYPE))); + } + int ttlInSeconds = ttl < 0 ? -1 : ttl * ttlType.getContainSeconds(); + log.info("ttl is {}(s)", ttlInSeconds); + + // initialize commandDescription + String redisDataType = StringUtils.upperCase(writerConfiguration.get(RedisWriterOptions.REDIS_DATA_TYPE)); + String additionalKey = writerConfiguration.getUnNecessaryOption(RedisWriterOptions.ADDITIONAL_KEY, "default_redis_key"); + this.commandDescription = initJedisCommandDescription(redisDataType, ttlInSeconds, additionalKey); + this.columnSize = writerConfiguration.get(RedisWriterOptions.COLUMNS).size(); + + // initialize jedis pool + JedisPoolConfig jedisPoolConfig = new JedisPoolConfig(); + jedisPoolConfig.setMaxTotal(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_TOTAL_CONNECTIONS)); + jedisPoolConfig.setMaxIdle(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_IDLE_CONNECTIONS)); + jedisPoolConfig.setMinIdle(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MIN_IDLE_CONNECTIONS)); + jedisPoolConfig.setMaxWait(Duration.ofMillis(writerConfiguration.get(RedisWriterOptions.JEDIS_POOL_MAX_WAIT_TIME_IN_MILLIS))); + + String redisHost = writerConfiguration.getNecessaryOption(RedisWriterOptions.HOST, RedisPluginErrorCode.REQUIRED_VALUE); + int redisPort = writerConfiguration.getNecessaryOption(RedisWriterOptions.PORT, RedisPluginErrorCode.REQUIRED_VALUE); + String redisPassword = writerConfiguration.get(RedisWriterOptions.PASSWORD); + int timeout = writerConfiguration.get(RedisWriterOptions.CLIENT_TIMEOUT_MS); + + if (StringUtils.isEmpty(redisPassword)) { + this.jedisPool = new JedisPool(jedisPoolConfig, redisHost, redisPort, timeout); + } else { + this.jedisPool = new JedisPool(jedisPoolConfig, redisHost, redisPort, timeout, redisPassword); + } + + // initialize record queue + int batchSize = writerConfiguration.get(RedisWriterOptions.WRITE_BATCH_INTERVAL); + this.recordQueue = new CircularFifoQueue<>(batchSize); + + this.logSampleInterval = writerConfiguration.get(RedisWriterOptions.LOG_SAMPLE_INTERVAL); + this.jedisFetcher = RetryerBuilder.newBuilder() + .retryIfResult(Objects::isNull) + .retryIfRuntimeException() + .withStopStrategy(StopStrategies.stopAfterAttempt(3)) + .withWaitStrategy(WaitStrategies.exponentialWait(100, 5, TimeUnit.MINUTES)) + .build() + .wrap(jedisPool::getResource); + + this.maxAttemptCount = writerConfiguration.get(RedisWriterOptions.MAX_ATTEMPT_COUNT); + this.retryer = RetryerBuilder.newBuilder() + .retryIfResult(needRetry -> Objects.equals(needRetry, true)) + .retryIfException(e -> !(e instanceof BitSailException)) + .withWaitStrategy(WaitStrategies.fixedWait(3, TimeUnit.SECONDS)) + .withStopStrategy(StopStrategies.stopAfterAttempt(maxAttemptCount)) + .build(); +} +``` + +### write方法 + +该方法调用时会将BitSail Row类型把数据写到缓存队列中,也可以在这里对Row类型数据进行各种格式预处理。直接存储到缓存队列中,或者进行加工处理。如果这里设定了缓存队列的大小,那么在缓存队列写满后要调用flush进行刷写。 + +#### 示例 + +redis:将BitSail Row格式的数据直接存储到一定大小的缓存队列中 + +```Java +public void write(Row record) throws IOException { + validate(record); + this.recordQueue.add(record); + if (recordQueue.isAtFullCapacity()) { + flush(false); + } +} +``` + +Druid:将BitSail Row格式的数据做格式预处理,转化到StringBuffer中储存起来。 + +```Java +@Override +public void write(final Row element) { + final StringJoiner joiner = new StringJoiner(DEFAULT_FIELD_DELIMITER, "", ""); + for (int i = 0; i < element.getArity(); i++) { + final Object v = element.getField(i); + if (v != null) { + joiner.add(v.toString()); + } + } + // timestamp column is a required field to add in Druid. + // See https://druid.apache.org/docs/24.0.0/ingestion/data-model.html#primary-timestamp + joiner.add(String.valueOf(processTime)); + data.append(joiner); + data.append(DEFAULT_LINE_DELIMITER); +} +``` + +### flush方法 + +该方法中主要实现将write方法的缓存中的数据刷写到目标数据源中。 + +#### 示例 + +redis:将缓存队列中的BitSail Row格式的数据刷写到目标数据源中。 + +```Java +public void flush(boolean endOfInput) throws IOException { + processorId++; + try (PipelineProcessor processor = genPipelineProcessor(recordQueue.size(), this.complexTypeWithTtl)) { + Row record; + while ((record = recordQueue.poll()) != null) { + + String key = (String) record.getField(0); + String value = (String) record.getField(1); + String scoreOrHashKey = value; + if (columnSize == SORTED_SET_OR_HASH_COLUMN_SIZE) { + value = (String) record.getField(2); + // Replace empty key with additionalKey in sorted set and hash. + if (key.length() == 0) { + key = commandDescription.getAdditionalKey(); + } + } + + if (commandDescription.getJedisCommand() == JedisCommand.ZADD) { + // sorted set + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), parseScoreFromString(scoreOrHashKey), value.getBytes())); + } else if (commandDescription.getJedisCommand() == JedisCommand.HSET) { + // hash + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), scoreOrHashKey.getBytes(), value.getBytes())); + } else if (commandDescription.getJedisCommand() == JedisCommand.HMSET) { + //mhset + if ((record.getArity() - 1) % 2 != 0) { + throw new BitSailException(CONVERT_NOT_SUPPORT, "Inconsistent data entry."); + } + List datas = Arrays.stream(record.getFields()) + .collect(Collectors.toList()).stream().map(o -> ((String) o).getBytes()) + .collect(Collectors.toList()).subList(1, record.getFields().length); + Map map = new HashMap<>((record.getArity() - 1) / 2); + for (int index = 0; index < datas.size(); index = index + 2) { + map.put(datas.get(index), datas.get(index + 1)); + } + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), map)); + } else { + // set and string + processor.addInitialCommand(new Command(commandDescription, key.getBytes(), value.getBytes())); + } + } + retryer.call(processor::run); + } catch (ExecutionException | RetryException e) { + if (e.getCause() instanceof BitSailException) { + throw (BitSailException) e.getCause(); + } else if (e.getCause() instanceof RedisUnexpectedException) { + throw (RedisUnexpectedException) e.getCause(); + } + throw e; + } catch (IOException e) { + throw new RuntimeException("Error while init jedis client.", e); + } +} +``` + +Druid:使用HTTP post方式提交sink作业给数据源。 + +```Java +private HttpURLConnection provideHttpURLConnection(final String coordinatorURL) throws IOException { + final URL url = new URL("http://" + coordinatorURL + DRUID_ENDPOINT); + final HttpURLConnection con = (HttpURLConnection) url.openConnection(); + con.setRequestMethod("POST"); + con.setRequestProperty("Content-Type", "application/json"); + con.setRequestProperty("Accept", "application/json, text/plain, */*"); + con.setDoOutput(true); + return con; + } + + public void flush(final boolean endOfInput) throws IOException { + final ParallelIndexIOConfig ioConfig = provideDruidIOConfig(data); + final ParallelIndexSupervisorTask indexTask = provideIndexTask(ioConfig); + final String inputJSON = provideInputJSONString(indexTask); + final byte[] input = inputJSON.getBytes(); + try (final OutputStream os = httpURLConnection.getOutputStream()) { + os.write(input, 0, input.length); + } + try (final BufferedReader br = + new BufferedReader(new InputStreamReader(httpURLConnection.getInputStream(), StandardCharsets.UTF_8))) { + final StringBuilder response = new StringBuilder(); + String responseLine; + while ((responseLine = br.readLine()) != null) { + response.append(responseLine.trim()); + } + LOG.info("Druid write task has been sent, and the response is {}", response); + } + } +``` + +### close方法 + +关闭之前创建的各种目标数据源连接对象。 + +#### 示例 + +```Java +public void close() throws IOException { + bulkProcessor.close(); + restClient.close(); + checkErrorAndRethrow(); +} +``` \ No newline at end of file diff --git a/website/zh/community/source_connector_detail.md b/website/zh/community/source_connector_detail.md new file mode 100644 index 000000000..20c1be933 --- /dev/null +++ b/website/zh/community/source_connector_detail.md @@ -0,0 +1,1544 @@ +--- +order: 8 +--- + +# Source Connector 详解 + +[English](../../en/community/source_connector_detail.md) | 简体中文 + +----- + +## BitSail Source Connector交互流程介绍 + +![](../../images/community/connector_quick_start/bitsail_model.png) + +- Source: 参与数据读取组件的生命周期管理,主要负责和框架的交互,构架作业,不参与作业真正的执行。 +- SourceSplit: 数据读取分片,大数据处理框架的核心目的就是将大规模的数据拆分成为多个合理的Split并行处理。 +- State:作业状态快照,当开启checkpoint之后,会保存当前执行状态。 +- SplitCoordinator: SplitCoordinator承担创建、管理Split的角色。 +- SourceReader: 真正负责数据读取的组件,在接收到Split后会对其进行数据读取,然后将数据传输给下一个算子。 + +## Source + +数据读取组件的生命周期管理,主要负责和框架的交互,构架作业,它不参与作业真正的执行。 + +以RocketMQSource为例:Source方法需要实现Source和ParallelismComputable接口。 + +![](../../images/community/source_connector/source_diagram.png) + +### Source接口 + +```Java +public interface Source + extends Serializable, TypeInfoConverterFactory { + + /** + * Run in client side for source initialize; + */ + void configure(ExecutionEnviron execution, BitSailConfiguration readerConfiguration) throws IOException; + + /** + * Indicate the Source type. + */ + Boundedness getSourceBoundedness(); + + /** + * Create Source Reader. + */ + SourceReader createReader(SourceReader.Context readerContext); + + /** + * Create split coordinator. + */ + SourceSplitCoordinator createSplitCoordinator(SourceSplitCoordinator.Context coordinatorContext); + + /** + * Get Split serializer for the framework,{@link SplitT}should implement from {@link Serializable} + */ + default BinarySerializer getSplitSerializer() { + return new SimpleBinarySerializer<>(); + } + + /** + * Get State serializer for the framework, {@link StateT}should implement from {@link Serializable} + */ + default BinarySerializer getSplitCoordinatorCheckpointSerializer() { + return new SimpleBinarySerializer<>(); + } + + /** + * Create type info converter for the source, default value {@link BitSailTypeInfoConverter} + */ + default TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); + } + + /** + * Get Source' name. + */ + String getReaderName(); +} +``` + +#### configure方法 + +主要去做一些客户端的配置的分发和提取,可以操作运行时环境ExecutionEnviron的配置和readerConfiguration的配置。 + +##### 示例 + +```Java +@Override +public void configure(ExecutionEnviron execution, BitSailConfiguration readerConfiguration) { + this.readerConfiguration = readerConfiguration; + this.commonConfiguration = execution.getCommonConfiguration(); +} +``` + +#### getSourceBoundedness方法 + +设置作业的处理方式,是采用流式处理方法、批式处理方法,或者是流批一体的处理方式,在流批一体的场景中,我们需要根据作业的不同类型设置不同的处理方式。 + +具体对应关系如下: + +| Job Type | Boundedness | +| -------- | --------------------------- | +| batch | Boundedness.*BOUNDEDNESS* | +| stream | Boundedness.*UNBOUNDEDNESS* | + +##### 流批一体场景示例 + +```Java +@Override +public Boundedness getSourceBoundedness() { + return Mode.BATCH.equals(Mode.getJobRunMode(commonConfiguration.get(CommonOptions.JOB_TYPE))) ? + Boundedness.BOUNDEDNESS : + Boundedness.UNBOUNDEDNESS; +} +``` + +##### 批式场景示例 + +```Java +public Boundedness getSourceBoundedness() { + return Boundedness.BOUNDEDNESS; +} +``` + +#### createTypeInfoConverter方法 + +用于指定Source连接器的类型转换器;我们知道大多数的外部数据系统都存在着自己的类型定义,它们的定义与BitSail的类型定义不会完全一致;为了简化类型定义的转换,我们支持了通过配置文件来映射两者之间的关系,进而来简化配置文件的开发。 + +在行为上表现为对任务描述Json文件中`reader`部分的`columns`的解析,对于`columns`中不同字段的type会根据上面描述文件从`ClickhouseReaderOptions.`*`COLUMNS`*字段中解析到`readerContext.getTypeInfos()`中。 + +##### 实现 + +- `BitSailTypeInfoConverter` + - 默认的`TypeInfoConverter`,直接对`ReaderOptions.`*`COLUMNS`*字段进行字符串的直接解析,*`COLUMNS`*字段中是什么类型,`TypeInfoConverter`中就是什么类型。 +- `FileMappingTypeInfoConverter` + - 会在BitSail类型系统转换时去绑定`{readername}-type-converter.yaml`文件,做数据库字段类型和BitSail类型的映射。`ReaderOptions.`*`COLUMNS`*字段在通过这个映射文件转换后才会映射到`TypeInfoConverter`中。 + +##### 示例 + +###### FileMappingTypeInfoConverter + +通过JDBC方式连接的数据库,包括MySql、Oracle、SqlServer、Kudu、ClickHouse等。这里数据源的特点是以`java.sql.ResultSet`的接口形式返回获取的数据,对于这类数据库,我们往往将`TypeInfoConverter`对象设计为`FileMappingTypeInfoConverter`,这个对象会在BitSail类型系统转换时去绑定`{readername}-type-converter.yaml`文件,做数据库字段类型和BitSail类型的映射。 + +```Java +@Override +public TypeInfoConverter createTypeInfoConverter() { + return new FileMappingTypeInfoConverter(getReaderName()); +} +``` + +对于`{readername}-type-converter.yaml`文件的解析,以`clickhouse-type-converter.yaml`为例。 + +```Plain +# Clickhouse Type to BitSail Type +engine.type.to.bitsail.type.converter: + + - source.type: int32 + target.type: int + + - source.type: float64 + target.type: double + + - source.type: string + target.type: string + + - source.type: date + target.type: date.date + + - source.type: null + target.type: void + +# BitSail Type to Clickhouse Type +bitsail.type.to.engine.type.converter: + + - source.type: int + target.type: int32 + + - source.type: double + target.type: float64 + + - source.type: date.date + target.type: date + + - source.type: string + target.type: string +``` + +这个文件起到的作用是进行job描述json文件中`reader`部分的`columns`的解析,对于`columns`中不同字段的type会根据上面描述文件从`ClickhouseReaderOptions.`*`COLUMNS`*字段中解析到`readerContext.getTypeInfos()`中。 + +```Json +"reader": { + "class": "com.bytedance.bitsail.connector.clickhouse.source.ClickhouseSource", + "jdbc_url": "jdbc:clickhouse://localhost:8123", + "db_name": "default", + "table_name": "test_ch_table", + "split_field": "id", + "split_config": "{\"name\": \"id\", \"lower_bound\": 0, \"upper_bound\": \"10000\", \"split_num\": 3}", + "sql_filter": "( id % 2 == 0 )", + "columns": [ + { + "name": "id", + "type": "int64" + }, + { + "name": "int_type", + "type": "int32" + }, + { + "name": "double_type", + "type": "float64" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "p_date", + "type": "date" + } + ] +}, +``` + +![](../../images/community/source_connector/file_mapping_converter.png) + +这种方式不仅仅适用于数据库,也适用于所有需要在类型转换中需要引擎侧和BitSail侧进行类型映射的场景。 + +###### BitSailTypeInfoConverter + +通常采用默认的方式进行类型转换,直接对`ReaderOptions.``COLUMNS`字段进行字符串的直接解析。 + +```Java +@Override +public TypeInfoConverter createTypeInfoConverter() { + return new BitSailTypeInfoConverter(); +} +``` + +以Hadoop为例: + +```Json +"reader": { + "class": "com.bytedance.bitsail.connector.hadoop.source.HadoopSource", + "path_list": "hdfs://127.0.0.1:9000/test_namespace/source/test.json", + "content_type":"json", + "reader_parallelism_num": 1, + "columns": [ + { + "name":"id", + "type": "int" + }, + { + "name": "string_type", + "type": "string" + }, + { + "name": "map_string_string", + "type": "map" + }, + { + "name": "array_string", + "type": "list" + } + ] +} +``` + +![](../../images/community/source_connector/bitsail_converter.png) + +#### createSourceReader方法 + +书写具体的数据读取逻辑,负责数据读取的组件,在接收到Split后会对其进行数据读取,然后将数据传输给下一个算子。 + +具体传入构造SourceReader的参数按需求决定,但是一定要保证所有参数可以序列化。如果不可序列化,将会在createJobGraph的时候出错。 + +##### 示例 + +```Java +public SourceReader createReader(SourceReader.Context readerContext) { + return new RocketMQSourceReader( + readerConfiguration, + readerContext, + getSourceBoundedness()); +} +``` + +#### createSplitCoordinator方法 + +书写具体的数据分片、分片分配逻辑,SplitCoordinator承担了去创建、管理Split的角色。 + +具体传入构造SplitCoordinator的参数按需求决定,但是一定要保证所有参数可以序列化。如果不可序列化,将会在createJobGraph的时候出错。 + +##### 示例 + +```Java +public SourceSplitCoordinator createSplitCoordinator(SourceSplitCoordinator + .Context coordinatorContext) { + return new RocketMQSourceSplitCoordinator( + coordinatorContext, + readerConfiguration, + getSourceBoundedness()); +} +``` + +### ParallelismComputable接口 + +```Java +public interface ParallelismComputable extends Serializable { + + /** + * give a parallelism advice for reader/writer based on configurations and upstream parallelism advice + * + * @param commonConf common configuration + * @param selfConf reader/writer configuration + * @param upstreamAdvice parallelism advice from upstream (when an operator has no upstream in DAG, its upstream is + * global parallelism) + * @return parallelism advice for the reader/writer + */ + ParallelismAdvice getParallelismAdvice(BitSailConfiguration commonConf, + BitSailConfiguration selfConf, + ParallelismAdvice upstreamAdvice) throws Exception; +} +``` + +#### getParallelismAdvice方法 + +用于指定下游reader的并行数目。一般有以下的方式: + +可以选择`selfConf.get(ClickhouseReaderOptions.`*`READER_PARALLELISM_NUM`*`)`来指定并行度。 + +也可以自定义自己的并行度划分逻辑。 + +##### 示例 + +比如在RocketMQ中,我们可以定义每1个reader可以处理至多4个队列*`DEFAULT_ROCKETMQ_PARALLELISM_THRESHOLD `*`= 4` + +通过这种自定义的方式获取对应的并行度。 + +```Java +public ParallelismAdvice getParallelismAdvice(BitSailConfiguration commonConfiguration, + BitSailConfiguration rocketmqConfiguration, + ParallelismAdvice upstreamAdvice) throws Exception { + String cluster = rocketmqConfiguration.get(RocketMQSourceOptions.CLUSTER); + String topic = rocketmqConfiguration.get(RocketMQSourceOptions.TOPIC); + String consumerGroup = rocketmqConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP); + DefaultLitePullConsumer consumer = RocketMQUtils.prepareRocketMQConsumer(rocketmqConfiguration, String.format(SOURCE_INSTANCE_NAME_TEMPLATE, + cluster, + topic, + consumerGroup, + UUID.randomUUID() + )); + try { + consumer.start(); + Collection messageQueues = consumer.fetchMessageQueues(topic); + int adviceParallelism = Math.max(CollectionUtils.size(messageQueues) / DEFAULT_ROCKETMQ_PARALLELISM_THRESHOLD, 1); + + return ParallelismAdvice.builder() + .adviceParallelism(adviceParallelism) + .enforceDownStreamChain(true) + .build(); + } finally { + consumer.shutdown(); + } + } +} +``` + +## SourceSplit + +数据源的数据分片格式,需要我们实现SourceSplit接口。 + +![](../../images/community/source_connector/source_split_diagram.png) + +### SourceSplit接口 + +要求我们实现一个实现一个获取splitId的方法。 + +```Java +public interface SourceSplit extends Serializable { + String uniqSplitId(); +} +``` + +对于具体切片的格式,开发者可以按照自己的需求进行自定义。 + +### 示例 + +#### JDBC类存储 + +一般会通过主键,来对数据进行最大、最小值的划分;对于无主键类则通常会将其认定为一个split,不再进行拆分,所以split中的参数包括主键的最大最小值,以及一个布尔类型的`readTable`,如果无主键类或是不进行主键的切分则整张表会视为一个split,此时`readTable`为`true`,如果按主键最大最小值进行切分,则设置为`false`。 + +以ClickhouseSourceSplit为例: + +```Java +@Setter +public class ClickhouseSourceSplit implements SourceSplit { + public static final String SOURCE_SPLIT_PREFIX = "clickhouse_source_split_"; + private static final String BETWEEN_CLAUSE = "( `%s` BETWEEN ? AND ? )"; + + private final String splitId; + + /** + * Read whole table or range [lower, upper] + */ + private boolean readTable; + private Long lower; + private Long upper; + + public ClickhouseSourceSplit(int splitId) { + this.splitId = SOURCE_SPLIT_PREFIX + splitId; + } + + @Override + public String uniqSplitId() { + return splitId; + } + + public void decorateStatement(PreparedStatement statement) { + try { + if (readTable) { + lower = Long.MIN_VALUE; + upper = Long.MAX_VALUE; + } + statement.setObject(1, lower); + statement.setObject(2, upper); + } catch (SQLException e) { + throw BitSailException.asBitSailException(CommonErrorCode.RUNTIME_ERROR, "Failed to decorate statement with split " + this, e.getCause()); + } + } + + public static String getRangeClause(String splitField) { + return StringUtils.isEmpty(splitField) ? null : String.format(BETWEEN_CLAUSE, splitField); + } + + @Override + public String toString() { + return String.format( + "{\"split_id\":\"%s\", \"lower\":%s, \"upper\":%s, \"readTable\":%s}", + splitId, lower, upper, readTable); + } +} +``` + +#### 消息队列 + +一般按照消息队列中topic注册的partitions的数量进行split的划分,切片中主要应包含消费的起点和终点以及消费的队列。 + +以RocketMQSplit为例: + +```Java +@Builder +@Getter +public class RocketMQSplit implements SourceSplit { + + private MessageQueue messageQueue; + + @Setter + private long startOffset; + + private long endOffset; + + private String splitId; + + @Override + public String uniqSplitId() { + return splitId; + } + + @Override + public String toString() { + return "RocketMQSplit{" + + "messageQueue=" + messageQueue + + ", startOffset=" + startOffset + + ", endOffset=" + endOffset + + '}'; + } +} +``` + +#### 文件系统 + +一般会按照文件作为最小粒度进行划分,同时有些格式也支持将单个文件拆分为多个子Splits。文件系统split中需要包装所需的文件切片。 + +以FtpSourceSplit为例: + +```Java +public class FtpSourceSplit implements SourceSplit { + + public static final String FTP_SOURCE_SPLIT_PREFIX = "ftp_source_split_"; + + private final String splitId; + + @Setter + private String path; + @Setter + private long fileSize; + + public FtpSourceSplit(int splitId) { + this.splitId = FTP_SOURCE_SPLIT_PREFIX + splitId; + } + + @Override + public String uniqSplitId() { + return splitId; + } + + @Override + public boolean equals(Object obj) { + return (obj instanceof FtpSourceSplit) && (splitId.equals(((FtpSourceSplit) obj).splitId)); + } + +} +``` + +特别的,在Hadoop文件系统中,我们也可以利用对`org.apache.hadoop.mapred.InputSplit`类的包装来自定义我们的Split。 + +```Java +public class HadoopSourceSplit implements SourceSplit { + private static final long serialVersionUID = 1L; + private final Class splitType; + private transient InputSplit hadoopInputSplit; + + private byte[] hadoopInputSplitByteArray; + + public HadoopSourceSplit(InputSplit inputSplit) { + if (inputSplit == null) { + throw new NullPointerException("Hadoop input split must not be null"); + } + + this.splitType = inputSplit.getClass(); + this.hadoopInputSplit = inputSplit; + } + + public InputSplit getHadoopInputSplit() { + return this.hadoopInputSplit; + } + + public void initInputSplit(JobConf jobConf) { + if (this.hadoopInputSplit != null) { + return; + } + + checkNotNull(hadoopInputSplitByteArray); + + try { + this.hadoopInputSplit = (InputSplit) WritableFactories.newInstance(splitType); + + if (this.hadoopInputSplit instanceof Configurable) { + ((Configurable) this.hadoopInputSplit).setConf(jobConf); + } else if (this.hadoopInputSplit instanceof JobConfigurable) { + ((JobConfigurable) this.hadoopInputSplit).configure(jobConf); + } + + if (hadoopInputSplitByteArray != null) { + try (ObjectInputStream objectInputStream = new ObjectInputStream(new ByteArrayInputStream(hadoopInputSplitByteArray))) { + this.hadoopInputSplit.readFields(objectInputStream); + } + + this.hadoopInputSplitByteArray = null; + } + } catch (Exception e) { + throw new RuntimeException("Unable to instantiate Hadoop InputSplit", e); + } + } + + private void writeObject(ObjectOutputStream out) throws IOException { + + if (hadoopInputSplit != null) { + try ( + ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); + ObjectOutputStream objectOutputStream = new ObjectOutputStream(byteArrayOutputStream) + ) { + this.hadoopInputSplit.write(objectOutputStream); + objectOutputStream.flush(); + this.hadoopInputSplitByteArray = byteArrayOutputStream.toByteArray(); + } + } + out.defaultWriteObject(); + } + + @Override + public String uniqSplitId() { + return hadoopInputSplit.toString(); + } +} +``` + +## State + +在需要做checkpoint的场景下,通常我们会通过Map来保留当前的执行状态 + +### 流批一体场景 + +在流批一体场景中,我们需要保存状态以便从异常中断的流式作业恢复 + +以RocketMQState为例: + +```Java +public class RocketMQState implements Serializable { + + private final Map assignedWithSplitIds; + + public RocketMQState(Map assignedWithSplitIds) { + this.assignedWithSplitIds = assignedWithSplitIds; + } + + public Map getAssignedWithSplits() { + return assignedWithSplitIds; + } +} +``` + +### 批式场景 + +对于批式场景,我们可以使用`EmptyState`不存储状态,如果需要状态存储,和流批一体场景采用相似的设计方案。 + +```Java +public class EmptyState implements Serializable { + + public static EmptyState fromBytes() { + return new EmptyState(); + } +} +``` + +## SourceSplitCoordinator + +大数据处理框架的核心目的就是将大规模的数据拆分成为多个合理的Split,SplitCoordinator承担这个创建、管理Split的角色。 + +![](../../images/community/source_connector/source_split_coordinator_diagram.png) + +### SourceSplitCoordinator接口 + +```Java +public interface SourceSplitCoordinator extends Serializable, AutoCloseable { + + void start(); + + void addReader(int subtaskId); + + void addSplitsBack(List splits, int subtaskId); + + void handleSplitRequest(int subtaskId, @Nullable String requesterHostname); + + default void handleSourceEvent(int subtaskId, SourceEvent sourceEvent) { + } + + StateT snapshotState() throws Exception; + + default void notifyCheckpointComplete(long checkpointId) throws Exception { + } + + void close(); + + interface Context { + + boolean isRestored(); + + /** + * Return the state to the split coordinator, for the exactly-once. + */ + StateT getRestoreState(); + + /** + * Return total parallelism of the source reader. + */ + int totalParallelism(); + + /** + * When Source reader started, it will be registered itself to coordinator. + */ + Set registeredReaders(); + + /** + * Assign splits to reader. + */ + void assignSplit(int subtaskId, List splits); + + /** + * Mainly use in boundedness situation, represents there will no more split will send to source reader. + */ + void signalNoMoreSplits(int subtask); + + /** + * If split coordinator have any event want to send source reader, use this method. + * Like send Pause event to Source Reader in CDC2.0. + */ + void sendEventToSourceReader(int subtaskId, SourceEvent event); + + /** + * Schedule to run the callable and handler, often used in un-boundedness mode. + */ + void runAsync(Callable callable, + BiConsumer handler, + int initialDelay, + long interval); + + /** + * Just run callable and handler once, often used in boundedness mode. + */ + void runAsyncOnce(Callable callable, + BiConsumer handler); + } +} +``` + +### 构造方法 + +开发者在构造方法中一般主要进行一些配置的设置和分片信息存储的容器的创建。 + +以ClickhouseSourceSplitCoordinator的构造为例: + +```Java +public ClickhouseSourceSplitCoordinator(SourceSplitCoordinator.Context context, + BitSailConfiguration jobConf) { + this.context = context; + this.jobConf = jobConf; + this.splitAssignmentPlan = Maps.newConcurrentMap(); +} +``` + +在自定义了State的场景中,需要对checkpoint时存储在`SourceSplitCoordinator.Context`的状态进行保存和恢复。 + +以RocketMQSourceSplitCoordinator为例: + +```Java +public RocketMQSourceSplitCoordinator( + SourceSplitCoordinator.Context context, + BitSailConfiguration jobConfiguration, + Boundedness boundedness) { + this.context = context; + this.jobConfiguration = jobConfiguration; + this.boundedness = boundedness; + this.discoveryInternal = jobConfiguration.get(RocketMQSourceOptions.DISCOVERY_INTERNAL); + this.pendingRocketMQSplitAssignment = Maps.newConcurrentMap(); + + this.discoveredPartitions = new HashSet<>(); + if (context.isRestored()) { + RocketMQState restoreState = context.getRestoreState(); + assignedPartitions = restoreState.getAssignedWithSplits(); + discoveredPartitions.addAll(assignedPartitions.keySet()); + } else { + assignedPartitions = Maps.newHashMap(); + } + + prepareConsumerProperties(); +} +``` + +### start方法 + +进行一些数据源所需分片元数据的提取工作,如果有抽象出来的Split Assigner类,一般在这里进行初始化。如果使用的是封装的Split Assign函数,这里会进行待分配切片的初始化工作。 + +#### 流批一体场景 + +以RocketMQSourceSplitCoordinator为例: + +```Java +private void prepareRocketMQConsumer() { + try { + consumer = RocketMQUtils.prepareRocketMQConsumer(jobConfiguration, + String.format(COORDINATOR_INSTANCE_NAME_TEMPLATE, + cluster, topic, consumerGroup, UUID.randomUUID())); + consumer.start(); + } catch (Exception e) { + throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e); + } +} + +@Override +public void start() { + prepareRocketMQConsumer(); + splitAssigner = new FairRocketMQSplitAssigner(jobConfiguration, assignedPartitions); + if (discoveryInternal > 0) { + context.runAsync( + this::fetchMessageQueues, + this::handleMessageQueueChanged, + 0, + discoveryInternal + ); + } else { + context.runAsyncOnce( + this::fetchMessageQueues, + this::handleMessageQueueChanged + ); + } +} +``` + +#### 批式场景 + +以ClickhouseSourceSplitCoordinator为例: + +```Java +public void start() { + List splitList; + try { + SimpleDivideSplitConstructor constructor = new SimpleDivideSplitConstructor(jobConf); + splitList = constructor.construct(); + } catch (IOException e) { + ClickhouseSourceSplit split = new ClickhouseSourceSplit(0); + split.setReadTable(true); + splitList = Collections.singletonList(split); + LOG.error("Failed to construct splits, will directly read the table.", e); + } + + int readerNum = context.totalParallelism(); + LOG.info("Found {} readers and {} splits.", readerNum, splitList.size()); + if (readerNum > splitList.size()) { + LOG.error("Reader number {} is larger than split number {}.", readerNum, splitList.size()); + } + + for (ClickhouseSourceSplit split : splitList) { + int readerIndex = ReaderSelector.getReaderIndex(readerNum); + splitAssignmentPlan.computeIfAbsent(readerIndex, k -> new HashSet<>()).add(split); + LOG.info("Will assign split {} to the {}-th reader", split.uniqSplitId(), readerIndex); + } +} +``` + +### Assigner + +将划分好的切片分配给Reader,开发过程中,我们通常让SourceSplitCoordinator专注于处理和Reader 的通讯工作,实际split的分发逻辑一般封装在Assigner进行,这个Assigner可以是一个封装的Split Assign函数,也可以是一个抽象出来的Split Assigner类。 + +#### Assign函数示例 + +以ClickhouseSourceSplitCoordinator为例: + +tryAssignSplitsToReader函数将存储在splitAssignmentPlan中的划分好的切片分配给相应的Reader。 + +```Java +private void tryAssignSplitsToReader() { + Map> splitsToAssign = new HashMap<>(); + + for (Integer readerIndex : splitAssignmentPlan.keySet()) { + if (CollectionUtils.isNotEmpty(splitAssignmentPlan.get(readerIndex)) && context.registeredReaders().contains(readerIndex)) { + splitsToAssign.put(readerIndex, Lists.newArrayList(splitAssignmentPlan.get(readerIndex))); + } + } + + for (Integer readerIndex : splitsToAssign.keySet()) { + LOG.info("Try assigning splits reader {}, splits are: [{}]", readerIndex, + splitsToAssign.get(readerIndex).stream().map(ClickhouseSourceSplit::uniqSplitId).collect(Collectors.toList())); + splitAssignmentPlan.remove(readerIndex); + context.assignSplit(readerIndex, splitsToAssign.get(readerIndex)); + context.signalNoMoreSplits(readerIndex); + LOG.info("Finish assigning splits reader {}", readerIndex); + } +} +``` + +#### Assigner方法示例 + +以RocketMQSourceSplitCoordinator为例: + +```Java +public class FairRocketMQSplitAssigner implements SplitAssigner { + + private BitSailConfiguration readerConfiguration; + + private AtomicInteger atomicInteger; + + public Map rocketMQSplitIncrementMapping; + + public FairRocketMQSplitAssigner(BitSailConfiguration readerConfiguration, + Map rocketMQSplitIncrementMapping) { + this.readerConfiguration = readerConfiguration; + this.rocketMQSplitIncrementMapping = rocketMQSplitIncrementMapping; + this.atomicInteger = new AtomicInteger(CollectionUtils + .size(rocketMQSplitIncrementMapping.keySet())); + } + + @Override + public String assignSplitId(MessageQueue messageQueue) { + if (!rocketMQSplitIncrementMapping.containsKey(messageQueue)) { + rocketMQSplitIncrementMapping.put(messageQueue, String.valueOf(atomicInteger.getAndIncrement())); + } + return rocketMQSplitIncrementMapping.get(messageQueue); + } + + @Override + public int assignToReader(String splitId, int totalParallelism) { + return splitId.hashCode() % totalParallelism; + } +} +``` + +### addReader方法 + +调用Assigner,为Reader添加切片。 + +#### 批式场景示例 + +以ClickhouseSourceSplitCoordinator为例: + +```Java +public void addReader(int subtaskId) { + LOG.info("Found reader {}", subtaskId); + tryAssignSplitsToReader(); +} +``` + +#### 流批一体场景示例 + +以RocketMQSourceSplitCoordinator为例: + +```Java +private void notifyReaderAssignmentResult() { + Map> tmpRocketMQSplitAssignments = new HashMap<>(); + + for (Integer pendingAssignmentReader : pendingRocketMQSplitAssignment.keySet()) { + + if (CollectionUtils.isNotEmpty(pendingRocketMQSplitAssignment.get(pendingAssignmentReader)) + && context.registeredReaders().contains(pendingAssignmentReader)) { + + tmpRocketMQSplitAssignments.put(pendingAssignmentReader, Lists.newArrayList(pendingRocketMQSplitAssignment.get(pendingAssignmentReader))); + } + } + + for (Integer pendingAssignmentReader : tmpRocketMQSplitAssignments.keySet()) { + + LOG.info("Assigning splits to reader {}, splits = {}.", pendingAssignmentReader, + tmpRocketMQSplitAssignments.get(pendingAssignmentReader)); + + context.assignSplit(pendingAssignmentReader, + tmpRocketMQSplitAssignments.get(pendingAssignmentReader)); + Set removes = pendingRocketMQSplitAssignment.remove(pendingAssignmentReader); + removes.forEach(removeSplit -> { + assignedPartitions.put(removeSplit.getMessageQueue(), removeSplit.getSplitId()); + }); + + LOG.info("Assigned splits to reader {}", pendingAssignmentReader); + + if (Boundedness.BOUNDEDNESS == boundedness) { + LOG.info("Signal reader {} no more splits assigned in future.", pendingAssignmentReader); + context.signalNoMoreSplits(pendingAssignmentReader); + } + } +} + +@Override +public void addReader(int subtaskId) { + LOG.info( + "Adding reader {} to RocketMQ Split Coordinator for consumer group {}.", + subtaskId, + consumerGroup); + notifyReaderAssignmentResult(); +} +``` + +### addSplitsBack方法 + +对于一些Reader没有处理完的切片,进行重新分配,重新分配的策略可以自己定义,常用的策略是哈希取模,对于返回的Split列表中的所有Split进行重新分配后再Assign给不同的Reader。 + +#### 批式场景示例 + +以ClickhouseSourceSplitCoordinator为例: + +ReaderSelector使用哈希取模的策略对Split列表进行重分配。 + +tryAssignSplitsToReader方法将重分配后的Split集合通过Assigner分配给Reader。 + +```Java +public void addSplitsBack(List splits, int subtaskId) { + LOG.info("Source reader {} return splits {}.", subtaskId, splits); + + int readerNum = context.totalParallelism(); + for (ClickhouseSourceSplit split : splits) { + int readerIndex = ReaderSelector.getReaderIndex(readerNum); + splitAssignmentPlan.computeIfAbsent(readerIndex, k -> new HashSet<>()).add(split); + LOG.info("Re-assign split {} to the {}-th reader.", split.uniqSplitId(), readerIndex); + } + + tryAssignSplitsToReader(); +} +``` + +#### 流批一体场景示例 + +以RocketMQSourceSplitCoordinator为例: + +addSplitChangeToPendingAssignment使用哈希取模的策略对Split列表进行重分配。 + +notifyReaderAssignmentResult将重分配后的Split集合通过Assigner分配给Reader。 + +```Java +private synchronized void addSplitChangeToPendingAssignment(Set newRocketMQSplits) { + int numReader = context.totalParallelism(); + for (RocketMQSplit split : newRocketMQSplits) { + int readerIndex = splitAssigner.assignToReader(split.getSplitId(), numReader); + pendingRocketMQSplitAssignment.computeIfAbsent(readerIndex, r -> new HashSet<>()) + .add(split); + } + LOG.debug("RocketMQ splits {} finished assignment.", newRocketMQSplits); +} + +@Override +public void addSplitsBack(List splits, int subtaskId) { + LOG.info("Source reader {} return splits {}.", subtaskId, splits); + addSplitChangeToPendingAssignment(new HashSet<>(splits)); + notifyReaderAssignmentResult(); +} +``` + +### snapshotState方法 + +存储处理切片的快照信息,用于恢复时在构造方法中使用。 + +```Java +public RocketMQState snapshotState() throws Exception { + return new RocketMQState(assignedPartitions); +} +``` + +### close方法 + +关闭在分片过程中与数据源交互读取元数据信息的所有未关闭连接器。 + +```Java +public void close() { + if (consumer != null) { + consumer.shutdown(); + } +} +``` + +## SourceReader + +每个SourceReader都在独立的线程中执行,只要我们保证SourceSplitCoordinator分配给不同SourceReader的切片没有交集,在SourceReader的执行周期中,我们就可以不考虑任何有关并发的细节。 + +![](../../images/community/source_connector/source_reader_diagram.png) + +### SourceReader接口 + +```Java +public interface SourceReader extends Serializable, AutoCloseable { + + void start(); + + void pollNext(SourcePipeline pipeline) throws Exception; + + void addSplits(List splits); + + /** + * Check source reader has more elements or not. + */ + boolean hasMoreElements(); + + /** + * There will no more split will send to this source reader. + * Source reader could be exited after process all assigned split. + */ + default void notifyNoMoreSplits() { + + } + + /** + * Process all events which from {@link SourceSplitCoordinator}. + */ + default void handleSourceEvent(SourceEvent sourceEvent) { + } + + /** + * Store the split to the external system to recover when task failed. + */ + List snapshotState(long checkpointId); + + /** + * When all tasks finished snapshot, notify checkpoint complete will be invoked. + */ + default void notifyCheckpointComplete(long checkpointId) throws Exception { + + } + + interface Context { + + TypeInfo[] getTypeInfos(); + + String[] getFieldNames(); + + int getIndexOfSubtask(); + + void sendSplitRequest(); + } +} +``` + +### 构造方法 + +这里需要完成和数据源访问各种配置的提取,比如数据库库名表名、消息队列cluster和topic、身份认证的配置等等。 + +#### 示例 + +```Java +public RocketMQSourceReader(BitSailConfiguration readerConfiguration, + Context context, + Boundedness boundedness) { + this.readerConfiguration = readerConfiguration; + this.boundedness = boundedness; + this.context = context; + this.assignedRocketMQSplits = Sets.newHashSet(); + this.finishedRocketMQSplits = Sets.newHashSet(); + this.deserializationSchema = new RocketMQDeserializationSchema( + readerConfiguration, + context.getTypeInfos(), + context.getFieldNames()); + this.noMoreSplits = false; + + cluster = readerConfiguration.get(RocketMQSourceOptions.CLUSTER); + topic = readerConfiguration.get(RocketMQSourceOptions.TOPIC); + consumerGroup = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP); + consumerTag = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_TAG); + pollBatchSize = readerConfiguration.get(RocketMQSourceOptions.POLL_BATCH_SIZE); + pollTimeout = readerConfiguration.get(RocketMQSourceOptions.POLL_TIMEOUT); + commitInCheckpoint = readerConfiguration.get(RocketMQSourceOptions.COMMIT_IN_CHECKPOINT); + accessKey = readerConfiguration.get(RocketMQSourceOptions.ACCESS_KEY); + secretKey = readerConfiguration.get(RocketMQSourceOptions.SECRET_KEY); +} +``` + +### start方法 + +初始化数据源的访问对象,例如数据库的执行对象、消息队列的consumer对象或者文件系统的连接。 + +#### 示例 + +消息队列 + +```Java +public void start() { + try { + if (StringUtils.isNotEmpty(accessKey) && StringUtils.isNotEmpty(secretKey)) { + AclClientRPCHook aclClientRPCHook = new AclClientRPCHook( + new SessionCredentials(accessKey, secretKey)); + consumer = new DefaultMQPullConsumer(aclClientRPCHook); + } else { + consumer = new DefaultMQPullConsumer(); + } + + consumer.setConsumerGroup(consumerGroup); + consumer.setNamesrvAddr(cluster); + consumer.setInstanceName(String.format(SOURCE_READER_INSTANCE_NAME_TEMPLATE, + cluster, topic, consumerGroup, UUID.randomUUID())); + consumer.setConsumerPullTimeoutMillis(pollTimeout); + consumer.start(); + } catch (Exception e) { + throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e); + } +} +``` + +数据库 + +```Java +public void start() { + this.connection = connectionHolder.connect(); + + // Construct statement. + String baseSql = ClickhouseJdbcUtils.getQuerySql(dbName, tableName, columnInfos); + String querySql = ClickhouseJdbcUtils.decorateSql(baseSql, splitField, filterSql, maxFetchCount, true); + try { + this.statement = connection.prepareStatement(querySql); + } catch (SQLException e) { + throw new RuntimeException("Failed to prepare statement.", e); + } + + LOG.info("Task {} started.", subTaskId); +} +``` + +FTP + +```Java +public void start() { + this.ftpHandler.loginFtpServer(); + if (this.ftpHandler.getFtpConfig().getSkipFirstLine()) { + this.skipFirstLine = true; + } +} +``` + +### addSplits方法 + +将SourceSplitCoordinator给当前Reader分配的Splits列表添加到自己的处理队列(Queue)或者集合(Set)中。 + +#### 示例 + +```Java +public void addSplits(List splits) { + LOG.info("Subtask {} received {}(s) new splits, splits = {}.", + context.getIndexOfSubtask(), + CollectionUtils.size(splits), + splits); + + assignedRocketMQSplits.addAll(splits); +} +``` + +### hasMoreElements方法 + +在无界的流计算场景中,会一直返回true保证Reader线程不被销毁。 + +在批式场景中,分配给该Reader的切片处理完之后会返回false,表示该Reader生命周期的结束。 + +```Java +public boolean hasMoreElements() { + if (boundedness == Boundedness.UNBOUNDEDNESS) { + return true; + } + if (noMoreSplits) { + return CollectionUtils.size(assignedRocketMQSplits) != 0; + } + return true; +} +``` + +### pollNext方法 + +在addSplits方法添加完成切片处理队列且hasMoreElements返回true时,该方法调用,开发者实现此方法真正和数据交互。 + +开发者在实现pollNext方法时候需要关注下列问题: + +- 切片数据的读取 + - 从构造好的切片中去读取数据。 +- 数据类型的转换 + - 将外部数据转换成BitSail的Row类型 + +#### 示例 + +以RocketMQSourceReader为例: + +从split队列中选取split进行处理,读取其信息,之后需要将读取到的信息转换成BitSail的Row类型,发送给下游处理。 + +```Java +public void pollNext(SourcePipeline pipeline) throws Exception { + for (RocketMQSplit rocketmqSplit : assignedRocketMQSplits) { + MessageQueue messageQueue = rocketmqSplit.getMessageQueue(); + PullResult pullResult = consumer.pull(rocketmqSplit.getMessageQueue(), + consumerTag, + rocketmqSplit.getStartOffset(), + pollBatchSize, + pollTimeout); + + if (Objects.isNull(pullResult) || CollectionUtils.isEmpty(pullResult.getMsgFoundList())) { + continue; + } + + for (MessageExt message : pullResult.getMsgFoundList()) { + Row deserialize = deserializationSchema.deserialize(message.getBody()); + pipeline.output(deserialize); + if (rocketmqSplit.getStartOffset() >= rocketmqSplit.getEndOffset()) { + LOG.info("Subtask {} rocketmq split {} in end of stream.", + context.getIndexOfSubtask(), + rocketmqSplit); + finishedRocketMQSplits.add(rocketmqSplit); + break; + } + } + rocketmqSplit.setStartOffset(pullResult.getNextBeginOffset()); + if (!commitInCheckpoint) { + consumer.updateConsumeOffset(messageQueue, pullResult.getMaxOffset()); + } + } + assignedRocketMQSplits.removeAll(finishedRocketMQSplits); +} +``` + +#### 转换为BitSail Row类型的常用方式 + +##### 自定义RowDeserializer类 + +对于不同格式的列应用不同converter,设置到相应Row的Field。 + +```Java +public class ClickhouseRowDeserializer { + + interface FiledConverter { + Object apply(ResultSet resultSet) throws SQLException; + } + + private final List converters; + private final int fieldSize; + + public ClickhouseRowDeserializer(TypeInfo[] typeInfos) { + this.fieldSize = typeInfos.length; + this.converters = new ArrayList<>(); + for (int i = 0; i < fieldSize; ++i) { + converters.add(initFieldConverter(i + 1, typeInfos[i])); + } + } + + public Row convert(ResultSet resultSet) { + Row row = new Row(fieldSize); + try { + for (int i = 0; i < fieldSize; ++i) { + row.setField(i, converters.get(i).apply(resultSet)); + } + } catch (SQLException e) { + throw BitSailException.asBitSailException(ClickhouseErrorCode.CONVERT_ERROR, e.getCause()); + } + return row; + } + + private FiledConverter initFieldConverter(int index, TypeInfo typeInfo) { + if (!(typeInfo instanceof BasicTypeInfo)) { + throw BitSailException.asBitSailException(CommonErrorCode.UNSUPPORTED_COLUMN_TYPE, typeInfo.getTypeClass().getName() + " is not supported yet."); + } + + Class curClass = typeInfo.getTypeClass(); + if (TypeInfos.BYTE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getByte(index); + } + if (TypeInfos.SHORT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getShort(index); + } + if (TypeInfos.INT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getInt(index); + } + if (TypeInfos.LONG_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getLong(index); + } + if (TypeInfos.BIG_INTEGER_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> { + BigDecimal dec = resultSet.getBigDecimal(index); + return dec == null ? null : dec.toBigInteger(); + }; + } + if (TypeInfos.FLOAT_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getFloat(index); + } + if (TypeInfos.DOUBLE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getDouble(index); + } + if (TypeInfos.BIG_DECIMAL_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getBigDecimal(index); + } + if (TypeInfos.STRING_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getString(index); + } + if (TypeInfos.SQL_DATE_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getDate(index); + } + if (TypeInfos.SQL_TIMESTAMP_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getTimestamp(index); + } + if (TypeInfos.SQL_TIME_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getTime(index); + } + if (TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> resultSet.getBoolean(index); + } + if (TypeInfos.VOID_TYPE_INFO.getTypeClass() == curClass) { + return resultSet -> null; + } + throw new UnsupportedOperationException("Unsupported data type: " + typeInfo); + } +} +``` + +##### 实现DeserializationSchema接口 + +相对于实现RowDeserializer,我们更希望大家去实现一个继承DeserializationSchema接口的实现类,将一定类型格式的数据对数据比如JSON、CSV转换为BitSail Row类型。 + +![](../../images/community/source_connector/deserialization_schema_diagram.png) + +在具体的应用时,我们可以使用统一的接口创建相应的实现类 + +```Java +public class TextInputFormatDeserializationSchema implements DeserializationSchema { + + private BitSailConfiguration deserializationConfiguration; + + private TypeInfo[] typeInfos; + + private String[] fieldNames; + + private transient DeserializationSchema deserializationSchema; + + public TextInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration, + TypeInfo[] typeInfos, + String[] fieldNames) { + this.deserializationConfiguration = deserializationConfiguration; + this.typeInfos = typeInfos; + this.fieldNames = fieldNames; + ContentType contentType = ContentType.valueOf( + deserializationConfiguration.getNecessaryOption(HadoopReaderOptions.CONTENT_TYPE, HadoopErrorCode.REQUIRED_VALUE).toUpperCase()); + switch (contentType) { + case CSV: + this.deserializationSchema = + new CsvDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames); + break; + case JSON: + this.deserializationSchema = + new JsonDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames); + break; + default: + throw BitSailException.asBitSailException(HadoopErrorCode.UNSUPPORTED_ENCODING, "unsupported parser type: " + contentType); + } + } + + @Override + public Row deserialize(Writable message) { + return deserializationSchema.deserialize((message.toString()).getBytes()); + } + + @Override + public boolean isEndOfStream(Row nextElement) { + return false; + } +} +``` + +也可以自定义当前需要解析类专用的DeserializationSchema: + +```Java +public class MapredParquetInputFormatDeserializationSchema implements DeserializationSchema { + + private final BitSailConfiguration deserializationConfiguration; + + private final transient DateTimeFormatter localDateTimeFormatter; + private final transient DateTimeFormatter localDateFormatter; + private final transient DateTimeFormatter localTimeFormatter; + private final int fieldSize; + private final TypeInfo[] typeInfos; + private final String[] fieldNames; + private final List converters; + + public MapredParquetInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration, + TypeInfo[] typeInfos, + String[] fieldNames) { + + this.deserializationConfiguration = deserializationConfiguration; + this.typeInfos = typeInfos; + this.fieldNames = fieldNames; + this.localDateTimeFormatter = DateTimeFormatter.ofPattern( + deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_TIME_PATTERN)); + this.localDateFormatter = DateTimeFormatter + .ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_PATTERN)); + this.localTimeFormatter = DateTimeFormatter + .ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.TIME_PATTERN)); + this.fieldSize = typeInfos.length; + this.converters = Arrays.stream(typeInfos).map(this::createTypeInfoConverter).collect(Collectors.toList()); + } + + @Override + public Row deserialize(Writable message) { + int arity = fieldNames.length; + Row row = new Row(arity); + Writable[] writables = ((ArrayWritable) message).get(); + for (int i = 0; i < fieldSize; ++i) { + row.setField(i, converters.get(i).convert(writables[i].toString())); + } + return row; + } + + @Override + public boolean isEndOfStream(Row nextElement) { + return false; + } + + private interface DeserializationConverter extends Serializable { + Object convert(String input); + } + + private DeserializationConverter createTypeInfoConverter(TypeInfo typeInfo) { + Class typeClass = typeInfo.getTypeClass(); + + if (typeClass == TypeInfos.VOID_TYPE_INFO.getTypeClass()) { + return field -> null; + } + if (typeClass == TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass()) { + return this::convertToBoolean; + } + if (typeClass == TypeInfos.INT_TYPE_INFO.getTypeClass()) { + return this::convertToInt; + } + throw BitSailException.asBitSailException(CsvFormatErrorCode.CSV_FORMAT_COVERT_FAILED, + String.format("Csv format converter not support type info: %s.", typeInfo)); + } + + private boolean convertToBoolean(String field) { + return Boolean.parseBoolean(field.trim()); + } + + private int convertToInt(String field) { + return Integer.parseInt(field.trim()); + } +} +``` + +### snapshotState方法 + +生成并保存State的快照信息,用于ckeckpoint。 + +#### 示例 + +```Java +public List snapshotState(long checkpointId) { + LOG.info("Subtask {} start snapshotting for checkpoint id = {}.", context.getIndexOfSubtask(), checkpointId); + if (commitInCheckpoint) { + for (RocketMQSplit rocketMQSplit : assignedRocketMQSplits) { + try { + consumer.updateConsumeOffset(rocketMQSplit.getMessageQueue(), rocketMQSplit.getStartOffset()); + LOG.debug("Subtask {} committed message queue = {} in checkpoint id = {}.", context.getIndexOfSubtask(), + rocketMQSplit.getMessageQueue(), + checkpointId); + } catch (MQClientException e) { + throw new RuntimeException(e); + } + } + } + return Lists.newArrayList(assignedRocketMQSplits); +} +``` + +### hasMoreElements方法 + +每次调用pollNext方法之前会做sourceReader.hasMoreElements()的判断,当且仅当判断通过,pollNext方法才会被调用。 + +#### 示例 + +```Java +public boolean hasMoreElements() { + if (noMoreSplits) { + return CollectionUtils.size(assignedHadoopSplits) != 0; + } + return true; +} +``` + +### notifyNoMoreSplits方法 + +当Reader处理完所有切片之后,会调用此方法。 + +#### 示例 + +```Java +public void notifyNoMoreSplits() { + LOG.info("Subtask {} received no more split signal.", context.getIndexOfSubtask()); + noMoreSplits = true; +} +``` diff --git a/website/zh/community/team.md b/website/zh/community/team.md index 47005a562..e55da1967 100644 --- a/website/zh/community/team.md +++ b/website/zh/community/team.md @@ -1,8 +1,13 @@ --- -order: 4 +order: 6 --- + # Team +[English](../../en/community/team.md) | 简体中文 + +----- + ## 贡献者
diff --git a/website/zh/documents/start/README.md b/website/zh/documents/start/README.md index 1503acd2d..96af051c0 100644 --- a/website/zh/documents/start/README.md +++ b/website/zh/documents/start/README.md @@ -8,4 +8,5 @@ dir: - [开发环境配置](env_setup.md) - [部署指南](deployment.md) - - [任务配置指南](config.md) \ No newline at end of file + - [任务配置指南](config.md) + - [BitSail 实机演示](quick_guide.md) \ No newline at end of file diff --git a/website/zh/documents/start/config.md b/website/zh/documents/start/config.md index b9f40458e..13fc426ac 100644 --- a/website/zh/documents/start/config.md +++ b/website/zh/documents/start/config.md @@ -1,6 +1,12 @@ +--- +order: 3 +--- + # 任务配置说明 -[English](../../../en/documents/start/config.md) | 简体中文 +[English](../../../en/documents/start/deployment.md) | 简体中文 + +----- ***BitSail*** 完整配置脚本是由一个 JSON 组成的,完整示意结构如下所示: @@ -169,4 +175,4 @@ | class | TRUE | - | 标识使用的connector的 class 名称 | com.bytedance.bitsail.connector.legacy.hive.sink.HiveParquetOutputFormat | | writer_parallelism_num | FALSE | - | 指定该Writer的并行度,默认情况下数据引擎会按照其实现逻辑计算得到一个并行度。 | 2 | -其他参数详情:参考具体 [connector](../connectors/README.md) 实现参数 \ No newline at end of file +其他参数详情:参考具体 [connector](../connectors/README.md) 实现参数 diff --git a/website/zh/documents/start/deployment.md b/website/zh/documents/start/deployment.md index 5b18da9e1..161157988 100644 --- a/website/zh/documents/start/deployment.md +++ b/website/zh/documents/start/deployment.md @@ -1,7 +1,13 @@ +--- +order: 1 +--- + # 部署指南 [English](../../../en/documents/start/deployment.md) | 简体中文 +----- + > 目前 BitSail 仅支持本地和Yarn上部署。 > 其他平台的部署(例如原生kubernetes)将在不久后支持。 @@ -175,4 +181,3 @@ bash bin/bitsail run \ --conf examples/Fake_Hive_Example.json \ --jm-address ``` - diff --git a/website/zh/documents/start/env_setup.md b/website/zh/documents/start/env_setup.md index 2b813a7ac..18367ae1e 100644 --- a/website/zh/documents/start/env_setup.md +++ b/website/zh/documents/start/env_setup.md @@ -1,3 +1,7 @@ +--- +order: 2 +--- + # 开发环境配置 [English](../../../en/documents/start/env_setup.md) | 简体中文 @@ -71,4 +75,4 @@ public class KafkaSourceITCase { // ... } -``` \ No newline at end of file +``` diff --git a/website/zh/documents/start/quick_guide.md b/website/zh/documents/start/quick_guide.md new file mode 100644 index 000000000..7c81dbff9 --- /dev/null +++ b/website/zh/documents/start/quick_guide.md @@ -0,0 +1,105 @@ +--- +order: 4 +--- + +# BitSail实机演示 + +[English](../../../en/documents/start/quick_guide.md) | 简体中文 + +----- + +## BitSail演示视频 + +[BitSail实机演示](https://zhuanlan.zhihu.com/p/595157599) + +## BitSail源码编译 + +BitSail在项目中内置了编译脚本build.sh,存放在项目根目录中。新下载的用户可以直接该脚本进行编译,编译成功后可以在目录:bitsail-dist/target/bitsail-dist-${rversion}-bin 中找到相应的产物。 + +![](../../../images/documents/start/quick_guide/source_code_structure.png) + +## BitSail产物结构 + +![](../../../images/documents/start/quick_guide/compile_product_structure.png) + +![](../../../images/documents/start/quick_guide/product_structure.png) + +## BitSail如何提交作业 + +### Flink Session Job + +```Shell +第一步:启动Flink Session集群 + +session运行要求本地环境存在hadoop的依赖,同时需要HADOOP_CLASSPATH的环境变量存在。 + +bash ./embedded/flink/bin/start-cluster.sh + +第二步:提交作业到Flink Session 集群 + +bash bin/bitsail run \ + --engine flink \ + --execution-mode run \ + --deployment-mode local \ + --conf examples/Fake_Print_Example.json \ + --jm-address +``` + +### Yarn Cluster Job + +```Shell +第一步:设置HADOOP_HOME环境变量 + +export HADOOP_HOME=XXX + +第二步:设置HADOOP_HOME,使提交客户端就找到yarn集群的配置路径,然后就可以提交作业到Yarn集群 + +bash ./bin/bitsail run --engine flink \ +--conf ~/dts_example/examples/Hive_Print_Example.json \ +--execution-mode run \ +--deployment-mode yarn-per-job \ +--queue default +``` + +## BitSail 实机演示 + +### Fake->MySQL + +```Shell +// 创建mysql表 +CREATE TABLE `bitsail_fake_source` ( + `id` bigint(20) NOT NULL AUTO_INCREMENT, + `name` varchar(255) DEFAULT NULL, + `price` double DEFAULT NULL, + `image` blob, + `start_time` datetime DEFAULT NULL, + `end_time` datetime DEFAULT NULL, + `order_id` bigint(20) DEFAULT NULL, + `enabled` tinyint(4) DEFAULT NULL, + `datetime` int(11) DEFAULT NULL, + PRIMARY KEY (`id`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; +``` + +### MySQL->Hive + +```Shell +// 创建hive表 +CREATE TABLE `bitsail`.`bitsail_mysql_hive`( + `id` bigint , + `name` string , + `price` double , + `image` binary, + `start_time` timestamp , + `end_time` timestamp, + `order_id` bigint , + `enabled` int, + `datetime` int +)PARTITIONED BY (`date` string) +ROW FORMAT SERDE + 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' +STORED AS INPUTFORMAT + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' +OUTPUTFORMAT + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' +``` \ No newline at end of file