Python read from csv with condition for TimeSeriesGenerator

Issue

I have a .csv file with many entries that looks like this:

observation1, observation2, tag
observation1, observation2, tag
...
b r e a k
observation1, observation2, tag
...
b r e a k

whereas the observations are some numbers and the tag the ground truth true/false.

the break part comes with the data and symbolizes the end of a file and the end of an observation chain. Datapoints within two break entries belong together. (All those datapoints are merged from multiple files into one huge csv).

With this data I am supposed to do some machine learning using the tensorflow TimeSeriesGenerator.

I found out however, that TSG uses a fixed time series chain length, which means I have to do some cutting/filtering of my data given.

Condition one, is that if a true appears in the chain, it has to be the last value. Condition two, that all chains consist of the same amount of entries.

This means, if say my chain length would be 3, then the following chains are allowed:

b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
b r e a k
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
b r e a k

but not

b r e a k
observation1, observation2, false
observation1, observation2, true
observation1, observation2, false
b r e a k

A chain like this would also be allowed

observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true

as I could simply throw the first line away to get a length of 3.

But not a chain like this:

observation1, observation2, false
b r e a k
observation1, observation2, false
observation1, observation2, true
b r e a k

This means I need some way (my guess would be pandas) to filter the .csv file and find all occurences, where between to b r e a k lines there are at least x amount of false datapoints followed by a true or another false.

What would be a good way of achieving this filtering?

Solution

I’ve got a solution for this, I think, and this Reverse part of a .csv file using Python, but it turned into a mini-project.

Here’s what my main looks like:

import csv

from series import series_iter, fix


BREAK_ROW = ["b r e a k", "", ""]
MIN_SERIES_LEN = 3
FIXED_SIZE = MIN_SERIES_LEN  # add a leading minus sign to trim the bottom rows of each series
PAD_ROW = ["0", "0", "0"]
TRUE_TAG = "true"

DEBUG_SEP_ROW = ["--"] * 3
wrote_first_series = False

with (
    open("input.csv", newline="") as f_in,  # the csv module docs recommends newline="" for all CSV reads & writes
    open("output.csv", "w", newline="") as f_out,  # I changed "a" to "w" for my dev/testing
):
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    writer.writerow(next(reader))  # header

    for series in series_iter(reader, BREAK_ROW):
        # Condition 2: chain length
        if len(series) < MIN_SERIES_LEN:
            continue

        fixed_len_series = fix(series, PAD_ROW, FIXED_SIZE)

        # Condition 1: last true
        tags = [row[-1] for row in fixed_len_series]
        if TRUE_TAG in tags and tags[-1] != TRUE_TAG:
            continue

        if wrote_first_series:
            writer.writerow(DEBUG_SEP_ROW)  # for debuggin/visualizing

        writer.writerows(fixed_len_series)

        wrote_first_series = True  # for debuggin/visualizing

When I run that on this input.csv:

| Observation1   | Observation2   | Tag   |
|----------------|----------------|-------|
| b r e a k      |                |       |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true  |
| b r e a k      |                |       |
| b r e a k      |                |       |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| b r e a k      |                |       |
| b r e a k      |                |       |
| 3_observation1 | 3_observation2 | false |
| 3_observation1 | 3_observation2 | true  |
| 3_observation1 | 3_observation2 | false |
| b r e a k      |                |       |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true  |
| b r e a k      |                |       |
| 5_observation1 | 5_observation2 | false |
| b r e a k      |                |       |
| 6_observation1 | 6_observation2 | false |
| 6_observation1 | 6_observation2 | true  |
| b r e a k      |                |       |

I get this output.csv (the debug/visualization separator is enabled)

| Observation1   | Observation2   | Tag   |
|----------------|----------------|-------|
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true  |
| --             | --             | --    |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| --             | --             | --    |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true  |

If I just comment out the two lines under # Condition 2: chain length, like the other question hints at, then fix() will pad smaller series:

| Observation1   | Observation2   | Tag   |
|----------------|----------------|-------|
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | false |
| 1_observation1 | 1_observation2 | true  |
| --             | --             | --    |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| 2_observation1 | 2_observation2 | false |
| --             | --             | --    |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | false |
| 4_observation1 | 4_observation2 | true  |
| --             | --             | --    |
| 0              | 0              | 0     |
| 0              | 0              | 0     |
| 5_observation1 | 5_observation2 | false |
| --             | --             | --    |
| 0              | 0              | 0     |
| 6_observation1 | 6_observation2 | false |
| 6_observation1 | 6_observation2 | true  |

Series 3 is still dropped because of the true-must-be-last rule.

Here’s the GitHub for all of this: https://github.com/zacharysyoung/StackOverflow/tree/SO-73006138

Answered By – Zach Young

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published