Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Repository size restrictions #12

Open
2 tasks done
Anush008 opened this issue Aug 1, 2023 · 6 comments
Open
2 tasks done

Feature: Repository size restrictions #12

Anush008 opened this issue Aug 1, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@Anush008
Copy link
Member

Anush008 commented Aug 1, 2023

Type of feature

🍕 Feature

Current behavior

Presently, GitHub repositories are not subject to any size restrictions, which can potentially lead to potential exploits and create performance bottlenecks.

Suggested solution

To address this we'll need to implement a cap on the repository size and total file count before indexing the repository.

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Contributing Docs

  • I agree to follow this project's Contribution Docs
@Anush008 Anush008 added the enhancement New feature or request label Aug 1, 2023
@Anush008 Anush008 self-assigned this Aug 1, 2023
@bdougie
Copy link
Member

bdougie commented Aug 1, 2023

Could we instead only take the first 100 files or similar?

@Anush008
Copy link
Member Author

Anush008 commented Aug 1, 2023

That should be simpler. Right. We can start with 1000 files. As I've tried the number before and worked fine.

@Anush008
Copy link
Member Author

Anush008 commented Aug 1, 2023

Also, skip super-large files.

@Anush008
Copy link
Member Author

Anush008 commented Aug 1, 2023

An issue with this would be, since we're getting the repository as a zip, we still need to fetch the whole repo and use only the first 1000 files.

@jpmcb
Copy link
Member

jpmcb commented Aug 1, 2023

To address this we'll need to implement a cap on the repository size and total file count before indexing the repository.

This is a similar performance bottleneck I encountered on designing the pizza service. Here are a few things we could explore to get us going and achieve some scale for the repo-query service without having to reduce functionality.

  1. In general, we should try our best to design any service to be horizontally scalable. I.e., we should prefer deploying multiple instances vs. increasing memory/CPU power on a single monolith. As it's designed now, I believe that is true as long as qdrant supports multiple, atomic read/writes. If not, then that's a separate problem. We should look into deploying multiple instances of the repo-query service behind a load balancer to avoid the problem of a single instance being pounded during heavy traffic.
  2. An individual server could have a configurable pool of workers that block new requests until a free worker is available. You can get pretty fancy with this method and advertise the number of free workers to some separate orchestrator service, but for most use cases, having multiple instances behind a load balancer that round robin's requests is pretty good and avoids the same instance getting many requests at the same time.
  3. Is there a way to query the size of the zip ahead of downloading it? Inspecting the size ahead of getting the entire thing would probably help us avoid malicious users trying to drop a huge zip file on our service that we can't expand. We could also add a configuration for the size that should be rejected outright if it's too big.
  4. Skipping large files would help during batch processing, but how would we define "super-large files"? Some big projects have rather large monolithic code files in their codebases and it'd be a bummer to skip those. I think it may be more practical to skip certain types of files (like jpegs, mp4s, pdfs, etc.)
  5. The number of files to be processed on the service doesn't sound like a performance bottleneck. What does is the many round trip embedding to the model. I.e., right now, each file is processed individually and sent to be embedded individually:
    let file_embeddings: Vec<FileEmbeddings> = files
    .into_par_iter()
    .filter_map(|file| {
    let embed_content = file.to_string();
    let embeddings = model.embed(&embed_content).unwrap();
    Some(FileEmbeddings {
    path: file.path,
    embeddings,
    })
    })
    .collect();

    I wonder if there's an optimization here where entire bulk files can be embedded instead of doing each one within a filter_map. But that would require some research into what the model is doing / what the API library offers. The same can be said for writes to the qdrant database. It looks like it supports batch uploads out of the box.

@Anush008
Copy link
Member Author

Anush008 commented Aug 1, 2023

Hey. The https://api.github.com/repos/open-sauced/ai endpoint provides us with the size of the size. We can introduce a check. 100 MB is what I was thinking. We can work the number.

We're ignoring files that can't be read as UTF-8 like multi-media files and a few other extensions.

//Fails for non UTF-8 files
match file.read_to_string(&mut content) {
Ok(_) => Some(File {
path: file_path,
content,
length,
}),
Err(_) => None,
const IGNORED_EXTENSIONS: &[&str] = &[
"bpg", "eps", "pcx", "ppm", "tga", "tiff", "wmf", "xpm", "svg", "ttf", "woff2", "fnt", "fon",
"otf", "pdf", "ps", "dot", "docx", "dotx", "xls", "xlsx", "xlt", "lock", "odt", "ott", "ods",
"ots", "dvi", "pcl", "mod", "jar", "pyc", "war", "ear", "bz2", "xz", "rpm", "coff", "obj",
"dll", "class", "log",
];

A worker pool could be great. We'll have to monitor how the deployment performs once the feature is out to decide on the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants