Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Case for json requests #81

Open
Anwesh1 opened this issue Apr 13, 2021 · 16 comments
Open

Edge Case for json requests #81

Anwesh1 opened this issue Apr 13, 2021 · 16 comments

Comments

@Anwesh1
Copy link
Contributor

Anwesh1 commented Apr 13, 2021

Hey @JohnGiorgi, I've been bombarding semantic search with a lot of weird scenarios of inputs. I also sent a request for 30+ documents and it seems to work fine (which is great!!). I'm going to continue pushing weird requests to it but is there any specific thing that I should try instead that you think could break the code?
Let me know, thanks.

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 13, 2021

The goal is to try to group your requests to cover different scenarios. Here's a few I can think of:

  • top_k
    • greater than the possible number of document
    • less than the possible number of documents
    • equal to the possible number of documents
    • 0 & less than 0 (is the error message informative?)
    • Something that isn't an integer (float, str -- is the error message informative?)
  • query
    • Provide a uid that doesn't exist (is the error message informative?)
    • With text and uid
    • With uid only
    • Identical to one of the documents
  • documents
    • Provide a uid that doesn't exist (is the error message informative?)
    • With text and uid
    • With uid only
    • Identical to the query (it shouldn't appear in the results, is top k handled correctly in this case?)
    • Providing no documents at all (I think this is supported, if not we would have to add it first)

I encourage you to be creative and expand this list!

If you are having trouble finding edge cases, you could move on to writing unit tests.

@jvwong
Copy link
Member

jvwong commented Apr 14, 2021

Two questions based on trying a few examples:

  1. What does top k do?
  2. What if any, are the constraints on the POST request if I include text?

(should update README)

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 14, 2021

@jvwong Does the readme on develop answer question 1? (see here, "The return value is a JSON representation of the top_k most similar documents (defaults to 10)"

I updated the docstring example in #82 so that it is more informative.

@Anwesh1
Copy link
Contributor Author

Anwesh1 commented Apr 14, 2021

By the way, all the situations you suggested I check out, work. Except for when the uid provided doesn't exist.
This breaks the code producing an Internal Server error, trying to fix that.

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 14, 2021

By the way, all the situations you suggested I check out, work. Except for when the uid provided doesn't exist.
This breaks the code producing an Internal Server error, trying to fix that.

Great! Yeah ideally we would catch this and return a more informative status code (maybe a 400 or 422) with a message (the "uid" provided does not appear to be a valid PMID)

@jvwong
Copy link
Member

jvwong commented Apr 14, 2021

@jvwong Does the readme on develop answer question 1? (see here, "The return value is a JSON representation of the top_k most similar documents (defaults to 10)"

I updated the docstring example in #82 so that it is more informative.

I'm not clear on how to interpret the output based on the input:

  • e.g. I give you 1 query uid, 3 document uids, top_k = 20 --> I get 20 back, but where did the other 17 things come from?

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 14, 2021

I'm not clear on how to interpret the output based on the input:

* e.g. I give you 1 query uid, 3 document uids, top_k = 20 --> I get 20 back, but where did the  other 17 things come from?

I see.

The other 17 things come from the index. During every request, we check if the query or documents are in the index. If not, we add them. They will be included in the search during subsequent requests.

I can think about where to explain this, probably in the readme somewhere.

@jvwong
Copy link
Member

jvwong commented Apr 14, 2021

I'm not clear on how to interpret the output based on the input:

* e.g. I give you 1 query uid, 3 document uids, top_k = 20 --> I get 20 back, but where did the  other 17 things come from?

I see.

The other 17 things come from the index. During every request, we check if the query or documents are in the index. If not, we add them. They will be included in the search during subsequent requests.

I can think about where to explain this, probably in the readme somewhere.

I see. So when you provide a POST body with documents, you're basically saying add these to the index?
This sorta breaks from the original idea of: "Rank the provided query against the provided documents".

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 14, 2021

I see. So when you provide a POST body with documents, you're basically saying add these to the index?

Yes

This sorta breaks from the original idea of: "Rank the provided query against the provided documents".

I see. Yeah, this is why I was trying to be such a stickler about the API/Schemas/Use cases during all the meetings.

I think it would be easy enough to support the original idea. Probably just a single param like use_index which could be set to False, and that would mean to only compare the query against documents. Already being tracked in #61. Do you want me to add this?

@jvwong
Copy link
Member

jvwong commented Apr 14, 2021

This sorta breaks from the original idea of: "Rank the provided query against the provided documents".

I see. Yeah, this is why I was trying to be such a stickler about the API/Schemas/Use cases during all the meetings.

I think it would be easy enough to support the original idea. Probably just a single param like use_index which could be set to False, and that would mean to only compare the query against documents. Already being tracked in #61. Do you want me to add this?

No don't change anything code-wise, I was just trying to clarify the expected behaviour as it exists. The only ambiguity is what's in the index at the time that a client hits it.

@Anwesh1
Copy link
Contributor Author

Anwesh1 commented Apr 14, 2021

Hey @JohnGiorgi, had a question regarding displaying the message. I pinpointed the location to fix the problem where uid is not valid to this place in the ncbi.py file:

if "PMID" not in record:
logging.warn(f"No PMID for {json.dumps(record)}")
continue

Essentially this only triggers for us when the PMID is not valid so when the if statement were to trigger, we would stop our iteration and display a message. I was confused as to how you actually display the message for the user. I think technically that should work.
Let me know if I'm wrong and I'd appreciate insight on how you get the message displayed. (ex: when you have the message display that top_k must be greater than 0, or something like that.)

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 15, 2021

I'd appreciate insight on how you get the message displayed.

Check out the FastAPI documentation on handling errors: https://fastapi.tiangolo.com/tutorial/handling-errors/?h=erro. It should just be a matter of adding something like:

raise HTTPException(status_code=422, detail='the "uid" provided does not appear to be a valid PMID')

@jvwong, @maxkfranz, not sure which status_code is most appropriate here. I figure 400 of 422.

@Anwesh1 Is there a way to pinpoint which "uid" (provided by the user) actually triggered the error? I don't see that here

@Anwesh1
Copy link
Contributor Author

Anwesh1 commented Apr 15, 2021

Line 102 where it states logging.warn, when changed to logging.error shows this:
ERROR:root:No PMID for {"id:": ["8463928683 Error occurred: The following PMID is not available: 938463928683"]}
So, when the record variable is accessed, it actually displays the uid that is causing problems.

@JohnGiorgi
Copy link
Collaborator

JohnGiorgi commented Apr 15, 2021

Cool. Can we access this as a string and pass it to the detail keyword of HTTPException?

So something like:

 if "PMID" not in record: 
     raise HTTPException(status_code=422, detail=f"No PMID for {json.dumps(record)}")

The only concern is that continue keyword. Looks like these errors were just being ignored previously, and now they are going to cause the code to stop and raise an error. Is that okay? (I don't really know how the ncbi module is supposed to work).

@Anwesh1
Copy link
Contributor Author

Anwesh1 commented Apr 15, 2021

Essentially, that if block only triggers when we have a case of an invalid uid, so we don't actually need the continue here. We had the continue for the pubmed-dl library because we were bulk downloading and so we didn't mind if we skipped a few of them.

@Anwesh1
Copy link
Contributor Author

Anwesh1 commented Apr 15, 2021

The error message looks like this:
image

This was referenced Apr 15, 2021
@JohnGiorgi JohnGiorgi reopened this Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants