Skip to content

Ingestion pipeline to fetch and store mutual fund data from AMFI website

License

Notifications You must be signed in to change notification settings

potatohead-mg/amfi-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem statement

Design an ingestion pipeline to fetch Mutual Fund data from the AMFI website, model, and store that data into a database for further processing.

Submission Requirements

  • The pipeline should be able to do a full load to begin with and then fetch the incremental data every day.
  • Modelling the data and choice of database is left to you.
  • The pipeline should be able to deal with failures and should be designed to fetch and store data in a performance-efficient manner.
  • (Extra) Try to get the initial load(fetch+DB insertion) to complete in less than 3 hours.
  • (Extra) Model the data and tune it such that read queries for a particular fund/multiple funds over a certain time period are performant.

Data analysis

Available AMFI data

The AMFI NAV history page has two main sections:

  • Latest NAV, providing latest NAV reports in different formats (and different APIs)
  • NAV history, providing historical NAV data through a unified API

Latest NAV

  • Limiting my scope to Complete NAV report
  • Available data
    • NAV data format: Scheme Code;ISIN Div Payout/ ISIN Growth;ISIN Div Reinvestment;Scheme Name;Net Asset Value;Date
    • Scheme type: Open ended scheme variations, close ended scheme variations
    • Mutual fund
  • Each NAV row contains the latest available data for a given scheme
  • NAV data is calculated and updated at the end of every market day
  • API currently gives around 13000 records (excluding blank lines)
  • (Optimization) Since data from inactive schemes will be processed during the historical load, the incremental loads can only focus on recent data
    • In scenarios where old NAV data is edited for any reason, the changes will not get reflected in this approach

NAV history

Query scenarios

Since data usage scenarios will influence the data should be modeled, I looked for some common queries using NAV data. (The problem statement did not specify how the data might be used)

  • Filter NAV data based on scheme, category, start and end dates
  • Compare NAV between dates
  • Calculate and compare averages over month/year
  • 52 week highest/lowest NAVs Other observations
  • NAV data usually cannot be compared across schemes. (Potential for data partitioning)
  • Any operation/aggregation on NAV will ALWAYS require some level of filtering, at least on scheme
  • Data loads happen rarely (daily at most), and loaded data will not be updated in this design

Pipeline design and implementation

  1. Fetch data from API (latest or historical)
  2. Create temporary csv file to enable COPY
  3. COPY data from csv to staging table
  4. Move data from staging table to nav_data based on conditions
  5. Use a scheduling tool (chron, Airflow etc) to schedule daily incremental loads

About

Ingestion pipeline to fetch and store mutual fund data from AMFI website

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages