Skip to main content

How to Identify Outdated Files in a Git Repository

· 9 min read

Scenario

When maintaining a Git repository, particularly one for documentation, it is common to have files that haven't been updated in a while. To address this issue, you can link a section or sentence in the document to the corresponding code or an existing issue, so that the document can be updated when the code changes or the issue is resolved. This is useful when starting a new project, but it can be difficult to maintain these links for an existing large project.

To identify potential outdated files, you can use the git log command to retrieve the last commit log of each file and find long-term inactive files. The following sections describe how to generate a last commit report for a repository and how to write the script step by step.

Usage

note

If you want to use the script on macOS, you need to first install gnu-sed and findutils:

brew install gnu-sed findutils
  1. Get the generate_last_commit_report.sh script:

    git clone https://gist.github.com/862f24cec9a5915c71019dea2795c423.git scripts
    chmod +x scripts/generate_last_commit_report.sh
  2. Clone a documentation repository and check out the branch you want to generate the report for:

    git clone https://github.com/{OWNER}/{REPO}.git {DOC_REPO}
    cd {DOC_REPO}
    git checkout {BRANCH}
  3. Run the generate_last_commit_report.sh script to generate a report in Markdown format. The first argument {DOC_REPO} is the path to the documentation repository, and the second argument {OWNER}/{REPO} is the repository name on GitHub:

    ./scripts/generate_last_commit_report.sh {DOC_REPO} {OWNER}/{REPO}

Code reading

The following uses pingcap/docs as an example to show how to write the script step by step.

git clone https://github.com/pingcap/docs.git docs
cd docs

Step 1: Get the last commit of a file

  1. To get the commit logs of a file, use the git log command:

    Wiki: How to show the commit logs of a specific file?

    git log _index.md
  2. Get the last commit log of a file:

    To limit only the last commit to output, you can use the -<number>, -n <number>, --max-count=<number> option.

    Wiki: How to show the latest commit log of a specific file?

    git log -1 _index.md
  3. Customize the last commit information of a file:

    To customize the commit logs, you can use the --format=format:<string> option. For more details, refer to pretty formats.

    The following example uses a custom format to show "last commit time (UNIX), commit hash, author name, last commit date in YYYY-MM-DD format, last commit relative time":

    Wiki: How to show commit logs in a custom format?

    git log -1 --format=format:"%at,%H,%an,%as,%ar%n" _index.md
  4. Get the last commit log of a file without pager using the -P, -no-pager option:

    Wiki: How to show the commit logs without paging?

    git --no-pager log -1 --format=format:"%at,%H,%an,%as,%ar%n" _index.md

Step 2: Get the last commit of a repository and sort by commit date

  1. Get all Markdown files in a repository:

    for FILE in $(gfind . -name "*.abc"); do
    echo "$FILE"
    done

    The preceding command uses a for loop to iterate all Markdown files. If there is a file with a space in its name, the command handles it incorrectly. For example:

    touch "a b c.abc"
    for FILE in $(gfind . -name "*.abc"); do
    echo "$FILE"
    done
    quote

    For loops over find output are fragile. Use find -exec or a while read loop.

    —— SC2044: ShellCheck Wiki

    To avoid this issue, you can use the while read -r loop to iterate all .md files:

    gfind . -name '*.md' | while read -r FILE; do
    echo "$FILE"
    done
  2. Get the last commit log of all Markdown files in a repository:

    gfind . -name '*.md' | while read -r FILE; do
    git --no-pager log -1 --format=format:"%at,$FILE,%H,%an,%as,%ar%n" "$FILE"
    done
  3. Remove the ./ prefix from the file path using sed 's~^./~~' and then sort the results by the Unix timestamp:

    Wiki: sed 's/regexp/replacement/'

    (
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:"%at,$FILE,%H,%an,%as,%ar%n" "$FILE"
    done
    ) | sort --numeric-sort
  4. Remove the meaningless Unix timestamp using sed -E 's~^[0-9]+,~~':

    Wiki: sed -E 's/regexp/replacement/'

    (
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:"%at,$FILE,%H,%an,%as,%ar%n" "$FILE"
    done
    ) | sort --numeric-sort | sed -E 's~^[0-9]+,~~'

Step 3: Generate the last commit report

To make the report more readable, the following steps generate a Markdown table from the output of the previous step.

  1. Change the delimiter in --format to |, and modify the sed command to 's~^[0-9]+ ~~':

    (
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:"%at | $FILE | %H | %an | %as | %ar |%n" "$FILE"
    done
    ) | sort --numeric-sort | sed -E 's~^[0-9]+ ~~'
  2. Add a link to the FILE file using the GitHub repository address, commit hash value, and file name:

    (
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:'%at | '"[$FILE](https://github.com/pingcap/docs/blob/%H/$FILE)"' | %an | %as | %ar |%n' "$FILE"
    done
    ) | sort --numeric-sort | sed -E 's~^[0-9]+ ~~'
  3. Add table headers and output it to the docs_commit_log.md file:

    (
    echo '| File | Last Commit Author | Last Commit Date | Relative Date |'
    echo '| ---- | ------------------ | ---------------- | ------------- |'
    (
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:'%at | '"[$FILE](https://github.com/pingcap/docs/blob/%H/$FILE)"' | %an | %as | %ar |%n' "$FILE"
    done
    ) | sort --numeric-sort | sed -E 's~^[0-9]+ ~~'
    )>"$PWD"/docs_commit_log.md
  4. Use the DIR and REPO variables to make the script more generic:

    #!/bin/bash
    set -e

    DIR=$1
    REPO=$2

    (
    echo '| File | Last Commit Author | Last Commit Date | Relative Date |'
    echo '| ---- | ------------------ | ---------------- | ------------- |'
    (
    cd "$DIR"
    gfind . -name '*.md' | sed 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:'%at | '"[$FILE](https://github.com/$REPO/blob/%H/$FILE)"' | %an | %as | %ar |%n' "$FILE"
    done
    ) | sort --numeric-sort | sed -E 's~^[0-9]+ ~~'
    )>"$PWD"/docs_commit_log.md
  5. Use gfind and gsed to make the script compatible with macOS:

    #!/bin/bash
    set -e

    DIR=$1
    REPO=$2

    FIND=$(which gfind || which find)
    SED=$(which gsed || which sed)

    (
    echo '| File | Last Commit Author | Last Commit Date | Relative Date |'
    echo '| ---- | ------------------ | ---------------- | ------------- |'
    (
    cd "$DIR"
    $FIND . -name '*.md' | $SED 's~^./~~' | while read -r FILE; do
    git --no-pager log -1 --format=format:'%at | '"[$FILE](https://github.com/$REPO/blob/%H/$FILE)"' | %an | %as | %ar |%n' "$FILE"
    done
    ) | sort --numeric-sort | $SED -E 's~^[0-9]+ ~~'
    ) >"$PWD"/docs_commit_log.md