merged commit who does not has file changed data could not be shown in dataframe

Issue

I pulled git logs using the command git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s' > ../react_git.logs. I want to have a column for merge commit as well so I can analyze which author has merged commit the most and so on. I code in this way to create a dataframe out of the generate logs

COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])

commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]

commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--(?P<message>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']

file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]

file_stats = file_stats_marker['raw'].str.split("\t", expand=True)

file_stats = file_stats.rename(columns={0: "insertion", 1: "deletion", 2: "filepath"})
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']

commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")

commit_data = commit_data[~commit_data.index.isin(commit_info.index)]


df = commit_data.join(file_stats)
print(df.size)
print(df.head())

If I have such logs

--cae635054--Sat Jun 26 14:51:23 2021 -0400--Andrew Clark--`act`: Resolve to return value of scope function (#21759)
31      0       packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js
1       1       packages/react-test-renderer/src/ReactTestRenderer.js
24      14      packages/react/src/ReactAct.js

--e2453e200--Fri Jun 25 15:39:46 2021 -0400--Andrew Clark--act: Add test for bypassing queueMicrotask (#21743)
50      0       packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js

--8f03109cd--Wed Sep 11 09:51:32 2019 -0700--Brian Vaughn--Moved backend injection to the content script (#16752)
--efa780d0a--Wed Sep 11 09:51:24 2019 -0700--Brian Vaughn--Removed DT inject() script since it's no longer being used
0       24      packages/react-devtools-extensions/src/inject.js

--4290967d4--Wed Sep 11 09:34:31 2019 -0700--Brian Vaughn--Merge branch 'tt-compat' of https://github.com/onionymous/react into onionymous-tt-compat
--f09854a9e--Wed Sep 11 09:30:57 2019 -0700--Brian Vaughn--Moved inline comment.
3       5       packages/react-devtools-extensions/src/injectGlobalHook.js

then 8f03109cd and 4290967d4 won’t be in a dataframe which is very important for the analysis to find number of merge commit and like I said above.

How can I put those data in a dataframe with insertion: 0, deletion: 0 and filepath: 0 and a column or any way to differentiate them with other so it will be easier to know it’s related to merge commit?

I have this on repl as well with logs file
https://replit.com/@milanregmi/metricsLogs#main.py

Solution

I understand your problem as a two-fold problem:

  1. How to get information about merge commits from git log
  2. How to count merges per user and write it to the dataframe

First, you can add the %b formatter of pretty for git log. This will give you the commit body, which contains a line on top indicating, that this commit was a merge commit. A format string like this

git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s--%b'

will then result in something like this

--364d23ac--Wed Jul 14 13:44:55 2021 +0200--Doe, John--Pull request #114: Bugfix/foo-branch--Merge in <repository> from bugfix/foo-branch to dev

* commit '64ee15b12345670d1ec214cb83468cf0a55a341':
  Bugfix: lorem ipsum
  Fix: dolor sit amet

You can look for the Merge in your commit_marker dataframe to identify merge commits. Second, you can count the merges per author, making an inverse cumsum for all individual authors and their merge commits.

Here it is all together:

import os
import pandas as pd

COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])

commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]

# Add extraction for the new body part
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--("r"?P<message>.*?)--(?P<body>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']

file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]

file_stats = file_stats_marker['raw'].str.split("\t", expand=True)

file_stats = file_stats.rename(columns={0: "insertion", 1: "deletion", 2: "filepath"})
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']

commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")

commit_data = commit_data[~commit_data.index.isin(commit_info.index)]

df = commit_data.join(file_stats)

# Remove additional lines coming from the git commit body.
df.drop_duplicates(inplace=True)

# count merges per author
for author in df['author'].unique():
    idx = df[(df['author'] == author) & (df['body'].str.contains('Merge'))].index
    df.loc[idx, 'merges'] = list(
    range(1, len(df[(df['author'] == author) & (df['body'].str.contains('Merge'))]) + 1)[::-1])

print(df.size)
print(df.head())

What you end up with is the dataframe you had before plus a column merges counting the number of merges per author increasingly

sha      timestamp                       date                      author       message                            body                                      age                            insertion   deletion    filepath           churn  merges
b3e5eb7  Fri Jul 16 11:36:43 2021 +0200  2021-07-16 09:36:43+00:00 Doe, John    test deploy                                                                  -4 days +00:51:00.878903000    31.0        3.0         file/path/file1.py 28.0 
4fc0c34  Thu Jul 15 11:12:10 2021 +0200  2021-07-15 09:12:10+00:00 Cow, Jane    Pull request #116: Dev             Merge in repo from dev to master          -5 days +00:26:27.878903000                                                      14.0
8188751  Thu Jul 15 07:42:40 2021 +0200  2021-07-15 05:42:40+00:00 Doe, John    Pull request #115: Feature/foo-bar Merge in repo from feature/foo-bar to dev -6 days +20:56:57.878903000                                                      7.0
6fa89c3  Wed Jul 14 16:02:38 2021 +0200  2021-07-14 14:02:38+00:00 Cow, Jane    Added: foo bar                                                               -6 days +05:16:55.878903000    4056.0      0.0         file/path/file2.py 4056.0   

Answered By – gehbiszumeis

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published