- Introduction
- Documentation
- Usage
ProfileManager was designed for the aggregation of information related to corporate entities to support building business profiles. It uses the United States Securities and Exchange Commission (SEC) Central Index Key (CIK) to act as universally unique identifiers (UUIDs) and allows the user to compile a variety of information on corporate entities in an easy to use and query format because each profile is a dictionary. Assisting the accessibility of information, Profile Manager includes a series of mappings from CIK codes to names and back, names to aliases, and mappings from industry codes (namely The North American Industry Classification System (NAICS) and Standard Industrial Classification (SIC) codes) and descriptions of them. The hope to provide for a flexible data solution for complex business oriented applications.
Search a company name on EWG and get all products made by the company in EWG database
Parameters
- company (str) : A company name to find products for
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- dict
- COMPANY (str) : a list of products made by the company
Extract relevant content of company name in a html tag
Parameters
- text (str) : raw content in a html tag
Returns
- str : a clean company name after junk texts are filtered
Build a dictionary that contains parent company, subsidiary company information for a certain company
Parameters
- company (str) : company name to build dictionary for
- parent (str) : The parent company name for the company
- children_list (list of strings) : list of subsidiary names of the company
Returns
- dict :
- parent (str) : the parent company name
- child (list of str) : a list of subsidiary names of the company
Search "COMPANY_NAME+subsidiaries" RECURSIVELY on google chromedrivers directory to get all-level subsidiaries of a company and build a master dictionary that contains all-level subsidiary information for a company
Parameters
- company (str) : company name to find subsidiary for
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- dict
- company (str) :
- parent (str) : the parent company of the company,'NA' if not found
- child(list): a list of subsidiary names
- company (str) :
Open facility report page and scrape facility information into a dictionary
Parameters
- tri_id (str) : TRI facility id used as a unique identifier for a facility on TRI Search
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- fac_dict (dict)
- fac_name(str): Facility Name
- tri_id(str): TRI facility ID
- address(str): Facility Address
- frs_id(str): FRS ID
- mailing_name(str): Facility Mailing Name
- mailing_address(str): Facility Mailing Address
- duns_num(str): Facility Duns Number
- parent_company(str): Facility's Parent Company Name
- county(str): County
- pub_contact(str): Public Contact Name
- region(str): EPA Region Code
- phone(str): Contact Number
- latitude(str): Latitude
- tribe(str): Tribe
- longitude(str): Longitude
- bia_tribal_code(str): BIA Tribal Code
- naics(str): Naics Code
- sic(str): SIC Code
- last_form(str): Last Year of Report
Search "COMPANY_NAME+subsidiaries" on google chromedrivers directory and scrape the knowledge graph results of subsidiary names returned by Google on the top
Parameters
- company (str) : A company name to find subsidiary for
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- list : a list of subsidiary names(str)
Search NPIRS by entering a chemical name and get a list of companies that use the chemical in their products in NPIRS database
Parameters
- chemical (str) : a hazard name
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- list of strings : a list of companies that use the hazard
Search a product name on EWG, get all ingredients in the product in EWG database, and build a master dictionary that contains information for company-products-ingredients
Parameters
- comp_prod_dict (dict) : dictionary that contains company to products information after calling 'comp_prod_dict = company_to_product(company, driver)'
- driver (selenium.webdriver.Chrome) : Chrome driver after calling 'driver = setDriver()'
Returns
- dict
- COMPANY (str) :
- PRODUCT (str): a list of ingredients in the company product
- COMPANY (str) :
Remove null values in a company list
Parameters
- comp_list (list of strings) : a list of companies
Returns
- str : a clean list of company names with no null values
Sets a selenium webdriver object for running web-crawlers on various systems. Note: Requires chromedrivers for various platforms in a chromedrivers directory
Parameters
- headless (bool) : if True, sets a headless browser. if False (Default), sets a browser with head
Returns
- selenium.webdriver.Chrome : driver with standard option settings
Search the Wikipedia page for a company and get wikipedia infobox together with all other contents
Parameters
- company (str) : the company you would like to query Wikipedia for
Returns
- tuple
- dict : a dictionary of all other contents on wikipedia
- dict : a dictionary of wikipedia infobox
- str : page title
- str : page url
- beautifulsoup.table : wikipedia infobox HTML
NPIRS Engine is a site crawler for NPIRS(http://npirspublic.ceris.purdue.edu/ppis/) that gets all companies that use certain ingredients the user is looking for.
Function | Input | Processing | Output |
---|---|---|---|
hazard_to_company(chemical,driver) | ingredient name, chrome webdriver | search NPIRS by entering ingredient name and get company names | a list of companies that use the ingredient |
setDriver() | None | set chrome driver used to automatically crawl websites | driver |
get_comp_name(text) | an unfiltered string in html tags | extract relevant content | a string of the exact company name |
remove_null(comp_list) | a list of company names | remove null values | a clean list of company names |
Usage:
- call
driver = setDriver()
to set chrome driver for crawling - call
hazard_to_company(chemical, driver)
to get a list of companies
Google Engine is a google crawler to find subsidiaries directly returned by google for a search query "COMPANY_NAME+subsidiaries".
Function | Input | Processing | Output |
---|---|---|---|
get_sub(company, driver) | company name, chrome webdriver | search "COMPANY_NAME+subsidiaries" on google | a list of subsidiaries that is directly returned by google on the top |
get_recursive_sub(company, driver) | a company name, chrome webdriver | search subsidiaries recursively on google | build a dictionary that maps a company name to its parent company and a list of subsidiaries |
Usage:
- call
driver = setDriver()
to set chrome driver for crawling - call
master_google_sub = get_recursive_sub(company,driver)
to get a all-level-down subsidiaries for a company
TRI Engine is a site crawler for TRI Facility(https://www.epa.gov/enviro/tri-search) that gets all facility information with a tri id the user provides
Function | Input | Processing | Output |
---|---|---|---|
get_tri_dict(tri_id, driver) | tri facility id, chrome webdriver | open facility report page and scrape information into a dictionary | a dictionary of facility information |
Usage:
- call
driver = setDriver()
to set chrome driver for crawling - call
get_tri_dict(tri_id,driver)
to get a dictionary of facility information
EWG Engine is a site crawler for EWG Skindeep Database(https://www.ewg.org/skindeep/#.W3H8HNJKiUk) that gets product and ingredient information for a company in their database
Function | Input | Processing | Output |
---|---|---|---|
company_to_product(company,driver) | company name, chrome webdriver | search company name on EWG and get all products | a dictionary of a company to a list of products |
product_to_ingredient(comp_prod_dict,driver) | company-product dictionay, chrome webdriver | search product name and get all ingredients | a dictionary of company to products to ingredients |
Usage:
IMPORTANT NOTE: the driver needs to be set in a NON-HEADLESS mode. The user needs to manually close pop-up ads at the beginning for the crawler to function.
- call
driver = setDriver()
to set chrome driver for crawling - call
comp_prod_dict = company_to_product(company,driver)
to get a dictionary of company to products - call
product_to_ingredient(comp_prod_dict,driver)
to get a dictionary of company to products to ingredients