Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ncbi tests #86

Merged
merged 22 commits into from
Apr 26, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions semantic_search/ncbi.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
from pathlib import Path
from typing import Any, Dict, List, Generator
import logging
import json
import time

import requests
from Bio import Medline
from dotenv import load_dotenv
from pydantic import BaseSettings
from fastapi import HTTPException

log = logging.getLogger(__name__)

Expand Down Expand Up @@ -99,8 +99,7 @@ def _medline_to_docs(records: List[Dict[str, str]]) -> List[Dict[str, str]]:
docs = []
for record in records:
if "PMID" not in record:
logging.warn(f"No PMID for {json.dumps(record)}")
continue
raise HTTPException(status_code=422, detail=record["id:"][-1])
pmid = record["PMID"]
abstract = record["AB"] if "AB" in record else ""
title = record["TI"] if "TI" in record else ""
Expand Down
54 changes: 54 additions & 0 deletions tests/test_ncbi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import pytest
import types
from fastapi.exceptions import HTTPException

from semantic_search.ncbi import (
_medline_to_docs,
_safe_request,
_parse_medline,
_get_eutil_records,
uids_to_docs,
Settings,
)

settings = Settings()


def test_invalid_uid_test():
with pytest.raises(HTTPException):
uid = ["93846392868"]
records = [{"id:": [uid]}]
_medline_to_docs(records)


def test_safe_request():
eutils_params = {
"db": "pubmed",
"id": "9887103",
"retstart": 0,
"retmode": "xml",
"api_key": settings.ncbi_eutils_api_key,
}
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
assert _safe_request(url, "POST", files=eutils_params).status_code == 200


def test_get_eutil_records():
eutil = "efetch"
id = "9887103"
expected_response = '<generator object parse at 0x7f562cfe9900>'
Copy link
Collaborator

@JohnGiorgi JohnGiorgi Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. You need to figure out what the output should be (after casting it to a list) and set expected_response to that. This is just the string representation of the object.

Copy link
Contributor Author

@Anwesh1 Anwesh1 Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected list is something like this: (I got it from printing the response of the function)

[{'PMID': '9887103', 'OWN': 'NLM', 'STAT': 'MEDLINE', 'DCOM': '19990225', 'LR': '20190516', 'IS': '0890-9369 (Print) 0890-9369 (Linking)', 'VI': '13', 'IP': '1', 'DP': '1999 Jan 1', 'TI': 'The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development.', 'PG': '98-111', 'AB': 'The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms. These factors signal through a heteromeric complex of type I and II serine/threonine kinase receptors that phosphorylate members of the Smad family of transcription factors, thereby promoting their nuclear localization. Although components of TGF-beta/Activin signaling pathways are well defined in vertebrates, no such pathway has been clearly defined in invertebrates. In this study we describe the role of Baboon (Babo), a type I Activin receptor previously called Atr-I, in Drosophila development and characterize aspects of the Babo intracellular signal-transduction pathway. Genetic analysis of babo loss-of-function mutants and ectopic activation studies indicate that Babo signaling plays a role in regulating cell proliferation. In mammalian cells, activated Babo specifically stimulates Smad2-dependent pathways to induce TGF-beta/Activin-responsive promoters but not BMP-responsive elements. Furthermore, we identify a new Drosophila Smad, termed dSmad2, that is most closely related to vertebrate Smads 2 and 3. Activated Babo associates with dSmad2 but not Mad, phosphorylates the carboxy-terminal SSXS motif and induces heteromeric complex formation with Medea, the Drosophila Smad4 homolog. Our results define a novel Drosophila Activin/TGF-beta pathway that is analogous to its vertebrate counterpart and show that this pathway functions to promote cellular growth with minimal effects on patterning.', 'FAU': ['Brummel, T', 'Abdollah, S', 'Haerry, T E', 'Shimell, M J', 'Merriam, J', 'Raftery, L', 'Wrana, J L', "O'Connor, M B"], 'AU': ['Brummel T', 'Abdollah S', 'Haerry TE', 'Shimell MJ', 'Merriam J', 'Raftery L', 'Wrana JL', "O'Connor MB"], 'AD': ['Department of Molecular Biology and Biochemistry, University of California, Irvine, California 92697, USA.'], 'LA': ['eng'], 'SI': ['GENBANK/AF101386'], 'GR': ['GM47462/GM/NIGMS NIH HHS/United States'], 'PT': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], 'PL': 'United States', 'TA': 'Genes Dev', 'JT': 'Genes & development', 'JID': '8711660', 'RN': ['0 (Bone Morphogenetic Proteins)', '0 (DNA-Binding Proteins)', '0 (Drosophila Proteins)', '0 (RNA, Messenger)', '0 (Receptors, Growth Factor)', '0 (Smad2 Protein)', '0 (Trans-Activators)', 'EC 2.7.11.30 (Activin Receptors)', 'EC 2.7.11.30 (Activin Receptors, Type I)', 'EC 2.7.11.30 (Babo protein, Drosophila)'], 'SB': 'IM', 'MH': ['Activin Receptors', 'Activin Receptors, Type I', 'Amino Acid Sequence', 'Animals', 'Bone Morphogenetic Proteins/genetics', 'Cell Division', 'Cloning, Molecular', 'DNA-Binding Proteins/chemistry/*genetics', 'Drosophila/*embryology', 'Drosophila Proteins', 'Gene Expression Regulation, Developmental', 'In Situ Hybridization', 'Larva/genetics/*growth & development', 'Molecular Sequence Data', 'Phosphorylation', 'RNA, Messenger/genetics', 'Receptors, Growth Factor/*genetics/metabolism', 'Sequence Alignment', 'Sequence Analysis, DNA', 'Signal Transduction/*physiology', 'Smad2 Protein', 'Trans-Activators/chemistry/*genetics', 'Wings, Animal/growth & development'], 'PMC': 'PMC316373', 'EDAT': '1999/01/14 00:00', 'MHDA': '1999/01/14 00:01', 'CRDT': ['1999/01/14 00:00'], 'PHST': ['1999/01/14 00:00 [pubmed]', '1999/01/14 00:01 [medline]', '1999/01/14 00:00 [entrez]'], 'AID': ['10.1101/gad.13.1.98 [doi]'], 'PST': 'ppublish', 'SO': 'Genes Dev. 1999 Jan 1;13(1):98-111. doi: 10.1101/gad.13.1.98.'}]

Now I made that equal to the expected. Then I made actual equal to what you had said (list(_get_eutil_records(eutil,id))). However, this still triggers an error on the pytest test. The error looks like this:

>       assert actual_response == expected_response
E       assert [{'<?xm': ['v...icle>']}, ...] == [{'AB': 'The ...', ...], ...}]
E         At index 0 diff: {'<?xm': ['version="1.0" ?>'], '<!DO': ['YPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">'], '<Pub': ['dArticleSet>', 'dArticle>'], '': ['edlineCitation Status="MEDLINE" Owner="NLM">', '  <PMID Version="1">9</PMID>', '  <DateCompleted>', '      <Year>1976</Year>', '      <Month>01</Month>', '      <Day>23</Day>', '  </DateCompleted>', '  <DateRevised>', '      <Year>2019</Year>', '      <Month>06</Month>', '      <Day>23</Day>', '  </DateRevised>', '  <Article PubM...
E         
E         ...Full output truncated (3 lines hidden), use '-vv' to show

tests/test_ncbi.py:42: AssertionError

This is my updated function:

def test_get_eutil_records():
    eutil = "efetch"
    id = "9887103"
    expected_response = [{'PMID': '9887103', 'OWN': 'NLM', 'STAT': 'MEDLINE', 'DCOM': '19990225', 'LR': '20190516', 'IS': '0890-9369 (Print) 0890-9369 (Linking)', 'VI': '13', 'IP': '1', 'DP': '1999 Jan 1', 'TI': 'The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development.', 'PG': '98-111', 'AB': 'The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms. These factors signal through a heteromeric complex of type I and II serine/threonine kinase receptors that phosphorylate members of the Smad family of transcription factors, thereby promoting their nuclear localization. Although components of TGF-beta/Activin signaling pathways are well defined in vertebrates, no such pathway has been clearly defined in invertebrates. In this study we describe the role of Baboon (Babo), a type I Activin receptor previously called Atr-I, in Drosophila development and characterize aspects of the Babo intracellular signal-transduction pathway. Genetic analysis of babo loss-of-function mutants and ectopic activation studies indicate that Babo signaling plays a role in regulating cell proliferation. In mammalian cells, activated Babo specifically stimulates Smad2-dependent pathways to induce TGF-beta/Activin-responsive promoters but not BMP-responsive elements. Furthermore, we identify a new Drosophila Smad, termed dSmad2, that is most closely related to vertebrate Smads 2 and 3. Activated Babo associates with dSmad2 but not Mad, phosphorylates the carboxy-terminal SSXS motif and induces heteromeric complex formation with Medea, the Drosophila Smad4 homolog. Our results define a novel Drosophila Activin/TGF-beta pathway that is analogous to its vertebrate counterpart and show that this pathway functions to promote cellular growth with minimal effects on patterning.', 'FAU': ['Brummel, T', 'Abdollah, S', 'Haerry, T E', 'Shimell, M J', 'Merriam, J', 'Raftery, L', 'Wrana, J L', "O'Connor, M B"], 'AU': ['Brummel T', 'Abdollah S', 'Haerry TE', 'Shimell MJ', 'Merriam J', 'Raftery L', 'Wrana JL', "O'Connor MB"], 'AD': ['Department of Molecular Biology and Biochemistry, University of California, Irvine, California 92697, USA.'], 'LA': ['eng'], 'SI': ['GENBANK/AF101386'], 'GR': ['GM47462/GM/NIGMS NIH HHS/United States'], 'PT': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], 'PL': 'United States', 'TA': 'Genes Dev', 'JT': 'Genes & development', 'JID': '8711660', 'RN': ['0 (Bone Morphogenetic Proteins)', '0 (DNA-Binding Proteins)', '0 (Drosophila Proteins)', '0 (RNA, Messenger)', '0 (Receptors, Growth Factor)', '0 (Smad2 Protein)', '0 (Trans-Activators)', 'EC 2.7.11.30 (Activin Receptors)', 'EC 2.7.11.30 (Activin Receptors, Type I)', 'EC 2.7.11.30 (Babo protein, Drosophila)'], 'SB': 'IM', 'MH': ['Activin Receptors', 'Activin Receptors, Type I', 'Amino Acid Sequence', 'Animals', 'Bone Morphogenetic Proteins/genetics', 'Cell Division', 'Cloning, Molecular', 'DNA-Binding Proteins/chemistry/*genetics', 'Drosophila/*embryology', 'Drosophila Proteins', 'Gene Expression Regulation, Developmental', 'In Situ Hybridization', 'Larva/genetics/*growth & development', 'Molecular Sequence Data', 'Phosphorylation', 'RNA, Messenger/genetics', 'Receptors, Growth Factor/*genetics/metabolism', 'Sequence Alignment', 'Sequence Analysis, DNA', 'Signal Transduction/*physiology', 'Smad2 Protein', 'Trans-Activators/chemistry/*genetics', 'Wings, Animal/growth & development'], 'PMC': 'PMC316373', 'EDAT': '1999/01/14 00:00', 'MHDA': '1999/01/14 00:01', 'CRDT': ['1999/01/14 00:00'], 'PHST': ['1999/01/14 00:00 [pubmed]', '1999/01/14 00:01 [medline]', '1999/01/14 00:00 [entrez]'], 'AID': ['10.1101/gad.13.1.98 [doi]'], 'PST': 'ppublish', 'SO': 'Genes Dev. 1999 Jan 1;13(1):98-111. doi: 10.1101/gad.13.1.98.'}]
    actual_response = list(_get_eutil_records(eutil,id))
    assert isinstance(_get_eutil_records(eutil, id), types.GeneratorType)
    assert actual_response == expected_response

Copy link
Collaborator

@JohnGiorgi JohnGiorgi Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this means actual does not equal expected. Either expected is wrong or actual is wrong. It is helpful to run pytest with the -vv argument to get more information on where they differ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the actual value and replaced it and now it seems to work. I was getting the wrong list in the beginning. 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Is it easy to update the other unit tests now?

actual_response = str(_get_eutil_records(eutil,id))
Anwesh1 marked this conversation as resolved.
Show resolved Hide resolved
assert isinstance(_get_eutil_records(eutil, id), types.GeneratorType)
assert actual_response == expected_response


def test_parse_medline():
text = "\nPMID- 9887103\nOWN - NLM\nSTAT- MEDLINE\nDCOM- 19990225\nLR - 20190516\nIS - 0890-9369 (Print)\nIS - 0890-9369 (Linking)\nVI - 13\nIP - 1\nDP - 1999 Jan 1\nTI - The Drosophila activin receptor baboon signals through dSmad2 and controls cell\n proliferation but not patterning during larval development.\nPG - 98-111\nAB - The TGF-beta superfamily of growth and differentiation factors, including\n TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in \n regulating the development of many organisms. These factors signal through a\n heteromeric complex of type I and II serine/threonine kinase receptors that\n phosphorylate members of the Smad family of transcription factors, thereby\n promoting their nuclear localization. Although components of TGF-beta/Activin\n signaling pathways are well defined in vertebrates, no such pathway has been\n clearly defined in invertebrates. In this study we describe the role of Baboon\n (Babo), a type I Activin receptor previously called Atr-I, in Drosophila\n development and characterize aspects of the Babo intracellular\n signal-transduction pathway. Genetic analysis of babo loss-of-function mutants\n and ectopic activation studies indicate that Babo signaling plays a role in\n regulating cell proliferation. In mammalian cells, activated Babo specifically\n stimulates Smad2-dependent pathways to induce TGF-beta/Activin-responsive\n promoters but not BMP-responsive elements. Furthermore, we identify a new\n Drosophila Smad, termed dSmad2, that is most closely related to vertebrate Smads \n 2 and 3. Activated Babo associates with dSmad2 but not Mad, phosphorylates the\n carboxy-terminal SSXS motif and induces heteromeric complex formation with Medea,\n the Drosophila Smad4 homolog. Our results define a novel Drosophila\n Activin/TGF-beta pathway that is analogous to its vertebrate counterpart and show\n that this pathway functions to promote cellular growth with minimal effects on\n patterning.\nFAU - Brummel, T\nAU - Brummel T\nAD - Department of Molecular Biology and Biochemistry, University of California,\n Irvine, California 92697, USA.\nFAU - Abdollah, S\nAU - Abdollah S\nFAU - Haerry, T E\nAU - Haerry TE\nFAU - Shimell, M J\nAU - Shimell MJ\nFAU - Merriam, J\nAU - Merriam J\nFAU - Raftery, L\nAU - Raftery L\nFAU - Wrana, J L\nAU - Wrana JL\nFAU - O'Connor, M B\nAU - O'Connor MB\nLA - eng\nSI - GENBANK/AF101386\nGR - GM47462/GM/NIGMS NIH HHS/United States\nPT - Journal Article\nPT - Research Support, Non-U.S. Gov't\nPT - Research Support, U.S. Gov't, P.H.S.\nPL - United States\nTA - Genes Dev\nJT - Genes & development\nJID - 8711660\nRN - 0 (Bone Morphogenetic Proteins)\nRN - 0 (DNA-Binding Proteins)\nRN - 0 (Drosophila Proteins)\nRN - 0 (RNA, Messenger)\nRN - 0 (Receptors, Growth Factor)\nRN - 0 (Smad2 Protein)\nRN - 0 (Trans-Activators)\nRN - EC 2.7.11.30 (Activin Receptors)\nRN - EC 2.7.11.30 (Activin Receptors, Type I)\nRN - EC 2.7.11.30 (Babo protein, Drosophila)\nSB - IM\nMH - Activin Receptors\nMH - Activin Receptors, Type I\nMH - Amino Acid Sequence\nMH - Animals\nMH - Bone Morphogenetic Proteins/genetics\nMH - Cell Division\nMH - Cloning, Molecular\nMH - DNA-Binding Proteins/chemistry/*genetics\nMH - Drosophila/*embryology\nMH - Drosophila Proteins\nMH - Gene Expression Regulation, Developmental\nMH - In Situ Hybridization\nMH - Larva/genetics/*growth & development\nMH - Molecular Sequence Data\nMH - Phosphorylation\nMH - RNA, Messenger/genetics\nMH - Receptors, Growth Factor/*genetics/metabolism\nMH - Sequence Alignment\nMH - Sequence Analysis, DNA\nMH - Signal Transduction/*physiology\nMH - Smad2 Protein\nMH - Trans-Activators/chemistry/*genetics\nMH - Wings, Animal/growth & development\nPMC - PMC316373\nEDAT- 1999/01/14 00:00\nMHDA- 1999/01/14 00:01\nCRDT- 1999/01/14 00:00\nPHST- 1999/01/14 00:00 [pubmed]\nPHST- 1999/01/14 00:01 [medline]\nPHST- 1999/01/14 00:00 [entrez]\nAID - 10.1101/gad.13.1.98 [doi]\nPST - ppublish\nSO - Genes Dev. 1999 Jan 1;13(1):98-111. doi: 10.1101/gad.13.1.98.\n"
# checking if generator is returned, need to check integrity of the returned value
assert isinstance(_parse_medline(text), types.GeneratorType)


def test_uids_to_docs():
uids = ["9887103"]
# checking if generator is returned, need to check integrity of the returned value
assert isinstance(uids_to_docs(uids), types.GeneratorType)