Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update lars partition function #12

Merged
merged 1 commit into from
Jan 28, 2025
Merged

update lars partition function #12

merged 1 commit into from
Jan 28, 2025

Conversation

aharjatiRaft
Copy link
Contributor

@aharjatiRaft aharjatiRaft commented Jan 27, 2025

update lars partition function removinglar_row_counts_by_lei parameter.

Originally, it's used to calculate the size of the lar data chunks returned by sql query. this means that we need to run another sql query.

Found out that the chunks list is returned as iterable type so we can use regular loop to into each chunk (and skip the lar_row_counts_by_lei sql query)

Tested by running 2023 annual and Q3 lars data locally (lar_raw_parquets_2023 and lar_raw_parquets_2023_q2 outputs) and verified that the parquet output files count and contents are the same

closes #11


for partition_index, chunk in enumerate(pg_lar_data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we depend on the # of chunks to be equal to the # of partitions. What determines the TextFileReader chunk size, and where?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the f"pg_lar_{year}" and f"pg_lar_{year}_{quarter}" sql queries in pipeline.py returned chunks based on the sql query result size. For example, pg_lar_2023 sql returned 47 chunks and `pg_lar_2023_q2' returned 9 chunks.

*These sql queries originated from raw_typed.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see the chunksize: 250000 avals for those targets in dev_postgres.yaml. If we can confirm each of those parquet files stored in s3 has 250K or fewer data rows, I think we're good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the output files are 250k rows locally (I set it to output files into local folders)
Additionally, I also compared the file content from previous code and this new code, and I confirmed the result/parquet file count, files' size and content are the same.

Copy link
Contributor

@tptignor tptignor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending confirmation of the chunk size functionality

@aharjatiRaft aharjatiRaft merged commit beb1830 into main Jan 28, 2025
@aharjatiRaft aharjatiRaft deleted the 11_lars_partition branch January 28, 2025 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize kedro's process lars partitions function
2 participants