Feature: Repository size restrictions #12

Anush008 · 2023-08-01T05:24:23Z

Type of feature

🍕 Feature

Current behavior

Presently, GitHub repositories are not subject to any size restrictions, which can potentially lead to potential exploits and create performance bottlenecks.

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Contributing Docs

I agree to follow this project's Contribution Docs

bdougie · 2023-08-01T05:27:44Z

Could we instead only take the first 100 files or similar?

Anush008 · 2023-08-01T05:30:53Z

That should be simpler. Right. We can start with 1000 files. As I've tried the number before and worked fine.

Anush008 · 2023-08-01T05:35:13Z

Also, skip super-large files.

Anush008 · 2023-08-01T05:40:20Z

An issue with this would be, since we're getting the repository as a zip, we still need to fetch the whole repo and use only the first 1000 files.

jpmcb · 2023-08-01T14:01:00Z

To address this we'll need to implement a cap on the repository size and total file count before indexing the repository.

This is a similar performance bottleneck I encountered on designing the pizza service. Here are a few things we could explore to get us going and achieve some scale for the repo-query service without having to reduce functionality.

In general, we should try our best to design any service to be horizontally scalable. I.e., we should prefer deploying multiple instances vs. increasing memory/CPU power on a single monolith. As it's designed now, I believe that is true as long as qdrant supports multiple, atomic read/writes. If not, then that's a separate problem. We should look into deploying multiple instances of the repo-query service behind a load balancer to avoid the problem of a single instance being pounded during heavy traffic.
An individual server could have a configurable pool of workers that block new requests until a free worker is available. You can get pretty fancy with this method and advertise the number of free workers to some separate orchestrator service, but for most use cases, having multiple instances behind a load balancer that round robin's requests is pretty good and avoids the same instance getting many requests at the same time.
Is there a way to query the size of the zip ahead of downloading it? Inspecting the size ahead of getting the entire thing would probably help us avoid malicious users trying to drop a huge zip file on our service that we can't expand. We could also add a configuration for the size that should be rejected outright if it's too big.
Skipping large files would help during batch processing, but how would we define "super-large files"? Some big projects have rather large monolithic code files in their codebases and it'd be a bummer to skip those. I think it may be more practical to skip certain types of files (like jpegs, mp4s, pdfs, etc.)

The number of files to be processed on the service doesn't sound like a performance bottleneck. What does is the many round trip embedding to the model. I.e., right now, each file is processed individually and sent to be embedded individually:

repo-query/src/github/mod.rs

Lines 63 to 73 in b183145

    
           let file_embeddings: Vec<FileEmbeddings> = files 
        
               .into_par_iter() 
        
               .filter_map(|file| { 
        
                   let embed_content = file.to_string(); 
        
                   let embeddings = model.embed(&embed_content).unwrap(); 
        
                   Some(FileEmbeddings { 
        
                       path: file.path, 
        
                       embeddings, 
        
                   }) 
        
               }) 
        
               .collect();

I wonder if there's an optimization here where entire bulk files can be embedded instead of doing each one within a filter_map. But that would require some research into what the model is doing / what the API library offers. The same can be said for writes to the qdrant database. It looks like it supports batch uploads out of the box.

Anush008 · 2023-08-01T14:28:46Z

Hey. The https://api.github.com/repos/open-sauced/ai endpoint provides us with the size of the size. We can introduce a check. 100 MB is what I was thinking. We can work the number.

We're ignoring files that can't be read as UTF-8 like multi-media files and a few other extensions.

repo-query/src/github/mod.rs

Lines 102 to 109 in b183145

    
           //Fails for non UTF-8 files 
        
           match file.read_to_string(&mut content) { 
        
               Ok(_) => Some(File { 
        
                   path: file_path, 
        
                   content, 
        
                   length, 
        
               }), 
        
               Err(_) => None,

repo-query/src/github/mod.rs

Lines 136 to 141 in b183145

    
           const IGNORED_EXTENSIONS: &[&str] = &[ 
        
               "bpg", "eps", "pcx", "ppm", "tga", "tiff", "wmf", "xpm", "svg", "ttf", "woff2", "fnt", "fon", 
        
               "otf", "pdf", "ps", "dot", "docx", "dotx", "xls", "xlsx", "xlt", "lock", "odt", "ott", "ods", 
        
               "ots", "dvi", "pcl", "mod", "jar", "pyc", "war", "ear", "bz2", "xz", "rpm", "coff", "obj", 
        
               "dll", "class", "log", 
        
           ];

A worker pool could be great. We'll have to monitor how the deployment performs once the feature is out to decide on the changes.

Anush008 added the enhancement New feature or request label Aug 1, 2023

Anush008 self-assigned this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Repository size restrictions #12

Feature: Repository size restrictions #12

Anush008 commented Aug 1, 2023

bdougie commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Anush008 commented Aug 1, 2023 •

edited

Loading

jpmcb commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Feature: Repository size restrictions #12

Feature: Repository size restrictions #12

Comments

Anush008 commented Aug 1, 2023

Type of feature

Current behavior

Suggested solution

Additional context

Code of Conduct

Contributing Docs

bdougie commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Anush008 commented Aug 1, 2023 • edited Loading

jpmcb commented Aug 1, 2023

Anush008 commented Aug 1, 2023

Anush008 commented Aug 1, 2023 •

edited

Loading