Use HTML splitter for JSX code? #4187
Replies: 1 comment
-
🤖 Hi @codinginflow! Good to see you back. Hope your coding journey is flowing smoothly as ever. Let's dig into your issue. Yes, your approach of using the HTML recursive text splitter for JSX code in the LangChain framework is fine. JSX is a syntax extension for JavaScript, and is mostly similar to HTML. Therefore, the HTML text splitter should work fine for JSX code as well, even after removing import statements and class names. The RecursiveCharacterTextSplitter in the given context is configured with a Here's the relevant code: const splitter = RecursiveCharacterTextSplitter.fromLanguage("html", {
chunkSize: 175,
chunkOverlap: 20,
}); This code creates a new RecursiveCharacterTextSplitter for HTML with a chunk size of 175 characters and a chunk overlap of 20 characters. If you're not seeing any overlap in your output, you might want to check the configuration of your text splitter and the content of your text. The RecursiveCharacterTextSplitter class in the LangChain framework handles JSX code by splitting it along HTML tags. This is done by defining a list of separators that include various HTML tags. When the splitText method is called, it first checks if the text includes any of the separators. If it does, it splits the text on that separator. If the resulting split is smaller than the chunk size, it is added to the final chunks. If it is larger, the method is called recursively with the remaining separators until all the text is split into chunks smaller than the chunk size. Here is the list of HTML tags used as separators: "<body>",
"<div>",
"<p>",
"<br>",
"<li>",
"<h1>",
"<h2>",
"<h3>",
"<h4>",
"<h5>",
"<h6>",
"<span>",
"<table>",
"<tr>",
"<td>",
"<th>",
"<ul>",
"<ol>",
"<header>",
"<footer>",
"<nav>",
"<head>",
"<style>",
"<script>",
"<meta>",
"<title>",
" ",
"" And here is the relevant code that handles the splitting: private async _splitText(text: string, separators: string[]) {
const finalChunks: string[] = [];
// Get appropriate separator to use
let separator: string = separators[separators.length - 1];
let newSeparators;
for (let i = 0; i < separators.length; i += 1) {
const s = separators[i];
if (s === "") {
separator = s;
break;
}
if (text.includes(s)) {
separator = s;
newSeparators = separators.slice(i + 1);
break;
}
}
// Now that we have the separator, split the text
const splits = this.splitOnSeparator(text, separator);
// Now go merging things, recursively splitting longer texts.
let goodSplits: string[] = [];
const _separator = this.keepSeparator ? "" : separator;
for (const s of splits) {
if ((await this.lengthFunction(s)) < this.chunkSize) {
goodSplits.push(s);
} else {
if (goodSplits.length) {
const mergedText = await this.mergeSplits(goodSplits, _separator);
finalChunks.push(...mergedText);
goodSplits = [];
}
if (!newSeparators) {
finalChunks.push(s);
} else {
const otherInfo = await this._splitText(s, newSeparators);
finalChunks.push(...otherInfo);
}
}
}
if (goodSplits.length) {
const mergedText = await this.mergeSplits(goodSplits, _separator);
finalChunks.push(...mergedText);
}
return finalChunks;
} You can find more details in the langchain/src/text_splitter.ts file. I hope this helps! If you have any other questions, feel free to ask. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
I am creating a chatbot over the data of my static React website.
I fetch the page files from the file system using the DirectoryLoader. I could use a web loader but I want it to work even in local development.
The issue is the text splitter.
I couldn't find a proper text splitter for JSX (React) code. But I seem to get decent results with the HTML recursive text splitter, probably because JSX and HTML are so similar.
Before I send my documents to the HTML splitter, I remove all import statements and class names (to get rid of the unnecessary clutter). I keep everything else (which might include some JavaScript code).
Is my approach fine? Is the HTML splitter suited for this use case? Is it normal that there is no text overlap in the generated documents?
Beta Was this translation helpful? Give feedback.
All reactions