update lars partition function #12

aharjatiRaft · 2025-01-27T20:55:59Z

update lars partition function removinglar_row_counts_by_lei parameter.

Originally, it's used to calculate the size of the lar data chunks returned by sql query. this means that we need to run another sql query.

Found out that the chunks list is returned as iterable type so we can use regular loop to into each chunk (and skip the lar_row_counts_by_lei sql query)

Tested by running 2023 annual and Q3 lars data locally (lar_raw_parquets_2023 and lar_raw_parquets_2023_q2 outputs) and verified that the parquet output files count and contents are the same

closes #11

tptignor · 2025-01-28T14:34:48Z

hmda-etl-pipeline/src/hmda_etl_pipeline/pipelines/ingest_data_from_pg/nodes.py


+    for partition_index, chunk in enumerate(pg_lar_data):


It looks like we depend on the # of chunks to be equal to the # of partitions. What determines the TextFileReader chunk size, and where?

the f"pg_lar_{year}" and f"pg_lar_{year}_{quarter}" sql queries in pipeline.py returned chunks based on the sql query result size. For example, pg_lar_2023 sql returned 47 chunks and `pg_lar_2023_q2' returned 9 chunks.

*These sql queries originated from raw_typed.yaml

I now see the chunksize: 250000 avals for those targets in dev_postgres.yaml. If we can confirm each of those parquet files stored in s3 has 250K or fewer data rows, I think we're good.

I confirmed the output files are 250k rows locally (I set it to output files into local folders)
Additionally, I also compared the file content from previous code and this new code, and I confirmed the result/parquet file count, files' size and content are the same.

tptignor

LGTM pending confirmation of the chunk size functionality

update lars partition function

918a98f

aharjatiRaft requested review from tptignor and PatrickGoRaft January 27, 2025 20:56

tptignor reviewed Jan 28, 2025

View reviewed changes

tptignor approved these changes Jan 28, 2025

View reviewed changes

aharjatiRaft merged commit beb1830 into main Jan 28, 2025

aharjatiRaft deleted the 11_lars_partition branch January 28, 2025 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update lars partition function #12

update lars partition function #12

aharjatiRaft commented Jan 27, 2025 •

edited

Loading

tptignor Jan 28, 2025

aharjatiRaft Jan 28, 2025

tptignor Jan 28, 2025

aharjatiRaft Jan 28, 2025

tptignor left a comment

update lars partition function #12

update lars partition function #12

Conversation

aharjatiRaft commented Jan 27, 2025 • edited Loading

tptignor Jan 28, 2025

Choose a reason for hiding this comment

aharjatiRaft Jan 28, 2025

Choose a reason for hiding this comment

tptignor Jan 28, 2025

Choose a reason for hiding this comment

aharjatiRaft Jan 28, 2025

Choose a reason for hiding this comment

tptignor left a comment

Choose a reason for hiding this comment

aharjatiRaft commented Jan 27, 2025 •

edited

Loading