Issue
I have a .csv file with many entries that looks like this:
observation1, observation2, tag
observation1, observation2, tag
...
b r e a k
observation1, observation2, tag
...
b r e a k
whereas the observations are some numbers and the tag the ground truth true/false.
the break
part comes with the data and symbolizes the end of a file and the end of an observation chain. Datapoints within two break
entries belong together. (All those datapoints are merged from multiple files into one huge csv).
With this data I am supposed to do some machine learning using the tensorflow TimeSeriesGenerator.
I found out however, that TSG uses a fixed time series chain length, which means I have to do some cutting/filtering of my data given.
Condition one, is that if a true
appears in the chain, it has to be the last value. Condition two, that all chains consist of the same amount of entries.
This means, if say my chain length would be 3, then the following chains are allowed:
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
b r e a k
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
b r e a k
but not
b r e a k
observation1, observation2, false
observation1, observation2, true
observation1, observation2, false
b r e a k
A chain like this would also be allowed
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
as I could simply throw the first line away to get a length of 3.
But not a chain like this:
observation1, observation2, false
b r e a k
observation1, observation2, false
observation1, observation2, true
b r e a k
This means I need some way (my guess would be pandas) to filter the .csv file and find all occurences, where between to b r e a k
lines there are at least x amount of false
datapoints followed by a true
or another false
.
What would be a good way of achieving this filtering?
Solution
I’ve got a solution for this, I think, and this Reverse part of a .csv file using Python, but it turned into a mini-project.
Here’s what my main looks like:
import csv
from series import series_iter, fix
BREAK_ROW = ["b r e a k", "", ""]
MIN_SERIES_LEN = 3
FIXED_SIZE = MIN_SERIES_LEN # add a leading minus sign to trim the bottom rows of each series
PAD_ROW = ["0", "0", "0"]
TRUE_TAG = "true"
DEBUG_SEP_ROW = ["--"] * 3
wrote_first_series = False
with (
open("input.csv", newline="") as f_in, # the csv module docs recommends newline="" for all CSV reads & writes
open("output.csv", "w", newline="") as f_out, # I changed "a" to "w" for my dev/testing
):
reader = csv.reader(f_in)
writer = csv.writer(f_out)
writer.writerow(next(reader)) # header
for series in series_iter(reader, BREAK_ROW):
# Condition 2: chain length
if len(series) < MIN_SERIES_LEN:
continue
fixed_len_series = fix(series, PAD_ROW, FIXED_SIZE)
# Condition 1: last true
tags = [row[-1] for row in fixed_len_series]
if TRUE_TAG in tags and tags[-1] != TRUE_TAG:
continue
if wrote_first_series:
writer.writerow(DEBUG_SEP_ROW) # for debuggin/visualizing
writer.writerows(fixed_len_series)
wrote_first_series = True # for debuggin/visualizing
When I run that on this input.csv:
| Observation1 | Observation2 | Tag |
|----------------|----------------|-------|
| b r e a k | | |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true |
| b r e a k | | |
| b r e a k | | |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| b r e a k | | |
| b r e a k | | |
| 3_observation1 | 3_observation2 | false |
| 3_observation1 | 3_observation2 | true |
| 3_observation1 | 3_observation2 | false |
| b r e a k | | |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true |
| b r e a k | | |
| 5_observation1 | 5_observation2 | false |
| b r e a k | | |
| 6_observation1 | 6_observation2 | false |
| 6_observation1 | 6_observation2 | true |
| b r e a k | | |
I get this output.csv (the debug/visualization separator is enabled)
| Observation1 | Observation2 | Tag |
|----------------|----------------|-------|
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true |
| -- | -- | -- |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| -- | -- | -- |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true |
If I just comment out the two lines under # Condition 2: chain length
, like the other question hints at, then fix() will pad smaller series:
| Observation1 | Observation2 | Tag |
|----------------|----------------|-------|
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true |
| -- | -- | -- |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| -- | -- | -- |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true |
| -- | -- | -- |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
| 5_observation1 | 5_observation2 | false |
| -- | -- | -- |
| 0 | 0 | 0 |
| 6_observation1 | 6_observation2 | false |
| 6_observation1 | 6_observation2 | true |
Series 3 is still dropped because of the true-must-be-last rule.
Here’s the GitHub for all of this: https://github.com/zacharysyoung/StackOverflow/tree/SO-73006138
Answered By – Zach Young
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0