Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniq Command #93

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

uniq Command #93

wants to merge 17 commits into from

Conversation

n-t-eastham
Copy link

@n-t-eastham n-t-eastham commented Oct 1, 2024

Issue #72

uniq Command

Purpose

Reports or deletes repeated lines in a file.

Syntax

uniq [-c | -t | -f]

Description

The uniq command removes consecutive duplicate lines from a file. It reads input
from either standard input or a specified file and compares adjacent lines to eliminate
repeated occurrences. Because only adjacent duplicate lines are removed it is typically
used in conjunction with the sort command to ensure all duplicates are adjacent. The
unique lines are then written to standard output.

Flags

  • -c: prints the count of the unique lines
  • -t: output the total number of unique lines or words at the end
  • -f: allows the user to ignore case (case-insensitive comparison for uniqueness)

Example

To remove repeated adjacent lines, enter:
uniq README.md

To remove nonadjacent repeated lines, enter:
sort README.md | uniq

To remove nonadjacent repeated words, enter:
split README.md | sort | uniq

Limitations

  • The -f currently does not produce intended output on nonadjacent words.
  • If the file being processed includes any empty lines the uniq command will break. The split command will fix this if pipped into sort then uniq but forces the user to specifically search for unique words rather than lines.
  • The split and sort commands currently do not support any flags.
  • The uniq command cannot chain multiple flags together.

@AbelA72
Copy link

AbelA72 commented Oct 1, 2024

I will code review this

@Unbanneduser
Copy link

I will code review this.

@alach2
Copy link

alach2 commented Oct 2, 2024

I will code review this!

@Unbanneduser
Copy link

Unbanneduser commented Oct 4, 2024

uniq.md:
Some spelling errors. //not important
The description mentions two bugs:
-f flag: The case-insensitive comparison doesn't work as expected with nonadjacent words.
Empty Lines: If the file has empty lines, the uniq command will break.

sort.c:
The file is well structured, but the comment regarding omitted flag handling needs to be updated as flags could be useful for future enhancements.

split.c:
Proper indentation and structure are maintained.
The code lacks proper memory cleanup in the event of failure (e.g., during reallocation).

ulib.c:
fgets: Adjustments were made to handle leading spaces or tabs before the first character is recorded.
getline: The change adds tabs ('\t') as end-of-line characters.

uniq.c:
writewords Function: The function works well but lacks robust error handling if write() fails.
free_lines() is used to free allocated memory, but the code lacks extensive error checks for memory allocations (e.g., malloc failures).

There are minor formatting inconsistencies (e.g., missing comments for new code and typos in function prototypes). Consistency in formatting helps with code readability and maintainability.

High-Level Checks:
[✔️] PR fully resolves the task/issue.
[✔️] Does the PR have tests? (No specific test files added, but functionality has been improved incrementally.)
[ ] Does the PR follow project code formatting standards? (Minor formatting inconsistencies noted.)
[✔️] Does the OS still compile and run? (No compilation issues were noted.)
[✔️] Is documentation included? (Some documentation, though there are typos and redundant lines.)
Code Checks:
[✔️] Does the code achieve its intended purpose?
[✔️] Were all edge cases considered? (Some missing, especially around invalid flags and error handling.)
[✔️] Is the code organized and maintainable?
[✔️] Does the code maintain the same level of performance? (No performance regressions, though potential inefficiencies exist in sorting algorithms.)

@alach2
Copy link

alach2 commented Oct 7, 2024

High-Level Checks:
[✔️] PR fully resolves the task/issue.
[✔️] Does the PR have tests?
[✔️] Does the PR follow project code formatting standards?
[✔️] Does the OS still compile and run?
[✔️] Is documentation included?
Code Checks:
[✔️] Does the code achieve its intended purpose?
[✔️] Were all edge cases considered?
[✔️] Is the code organized and maintainable?
[✔️] Does the code maintain the same level of performance? (No performance regressions)

This is a really good implementation of the uniq command. After testing, it seems like all of your intended implementation works besides the -f flag like you mentioned in your bugs. There are a few minor commenting/documentation issues in split.c and uniq.c. There is some commented out/unused code in your split.c code, so I think it would be good to just remove it. All your files have Javadoc style commenting, but I think it would be helpful to expand on most of those comments, adding more to how the function works. In uniq.c, a few of your functions do not have comments, so I think it would be helpful to keep consistent documentation for all of your functions. Besides those few documentation issues, your code looks great! All of your files and tests fulfill their intended purpose and are easy to follow along with/understand. You also account for any potential errors like malloc failure, file reading errors, and sort failure. Great job on this project!

@AbelA72
Copy link

AbelA72 commented Oct 16, 2024

High-Level Checks:
[✔️] PR fully resolves the task/issue.
[✔️] Does the PR have tests?
[✔️] Does the PR follow project code formatting standards?
[✔️] Does the OS still compile and run?
[✔️] Is documentation included?
Code Checks:
[✔️] Does the code achieve its intended purpose?
[✔️] Were all edge cases considered?
[✔️] Is the code organized and maintainable?
[✔️] Does the code maintain the same level of performance? (No performance regressions)

This implementation looks great to me! I also tested this after you added the -f flag, and that seems to work as well. This was a great project idea, and you did very well at achieving your goal.

Sort.c
I like that you made the actual code for this into separate functions like print_lines etc, made your code way more readable and easier to understand. You still have the unused flags part for sort, which I don't think should be there but I'm guessing you know that and will fix that.

Uniq.c

I also like the formatting again, looks very clean. I don't have any other issues here.

Split.c

You have a bit of extra code at the bottom that you commented out here and I suggest getting rid of that or finishing it up if you were going to use it.

@malensek
Copy link
Contributor

This is a nice refactoring of the original. I like how you now have a separate implementation of split, uniq assumes the input is sorted so you don't need a built-in sort function, etc.

I think the reviewers made some good points here.

My main feedback revolves around efficiency of the approach. As it is, we currently read all the lines into memory, but that isn't necessary since we are assuming the input is sorted. So implementing uniq -c would look like:

void
uniq(int fd)
{
  char *last_line = NULL;
  char *line;
  uint sz = 0;
  int dupe_count = 1;
  while (getline(&line, &sz, fd) > 0) {
    if (last_line == NULL) {
      goto done;
    }

    if (strcmp(line, last_line) == 0) {
      dupe_count += 1;
    } else {
      printf("%d\t%s", dupe_count, last_line);
      dupe_count = 1;
    }

  done:
    if (last_line != NULL) {
      free(last_line);
    }
    last_line = strdup(line);
  }
}

This isn't as efficient as it could be because the string gets dup'd and freed each cycle, and that could be done with a single buffer that is resized as necessary... but that's a lot of complexity to add. This also expects the standard fgets, getline, and strdup -- no changes as were done in the original source you started with (I'm not a fan of adding checks for tabs etc like they did, ideally those changes wouldn't be in the standard library)

If you want to refactor around a memory-efficient version as shown above that'd be +1.5 to your project score.

@n-t-eastham
Copy link
Author

n-t-eastham commented Nov 11, 2024

Modified Files

  • README.md: Updated the test case.
  • docs/uniq.md: Fixed a few typos.
  • user/sort.c: Added documentation.
  • user/split.c: Added documentation.
  • user/ulib.c: Reimplemented the standard fgets and getline.
  • user/uniq.c: Refactored my implementation for a more memory efficient approach.
  • user/user.h: Updated function declaration for sort.

Summary

This update includes improvements to the uniq command's memory management, documentation updates, and bug fixes for getline and fgets functions.

@geoweb999
Copy link

checked uniq and sort for memory leaks, none found.
sort:
user/umalloc.c:454:free(): Returning merged block at heap end to system: 0x0000000000006000 size: 4096
-- Leak Check --

-- Summary --
Used blocks: 0 (0 bytes)
Free blocks: 0 (0 bytes)
Total memory: 0 bytes

uniq
user/umalloc.c:454:free(): Returning merged block at heap end to system: 0x0000000000006000 size: 4096
-- Leak Check --

-- Summary --
Used blocks: 0 (0 bytes)
Free blocks: 0 (0 bytes)
Total memory: 0 bytes
:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants