MVDataProcessing package

Submodules

MVDataProcessing.Clean module

MVDataProcessing.Clean.RemoveOutliersHardThreshold(x_in: ~pandas.DataFrame, hard_max: float, hard_min: float, remove_from_process: list = [], df_avoid_periods=Empty DataFrame Columns: [] Index: []) → DataFrame[source]

Removes outliers from the timeseries on each column using threshold.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

hard_max (float) – Max value for the threshold limit
hard_min (float) – Min value for the threshold limit
remove_from_process (list,optional) – Columns to be kept off the process;
df_avoid_periods (pandas.core.frame.DataFrame) – The first column with the start and the second column with the end date.

Returns:

Y: A pandas.core.frame.DataFrame without the outliers

Return type:

Y: pandas.core.frame.DataFrame

MVDataProcessing.Clean.RemoveOutliersHistogram(x_in: ~pandas.DataFrame, df_avoid_periods: ~pandas.DataFrame = Empty DataFrame Columns: [] Index: [], remove_from_process: list = [], sample_freq: int = 5, min_number_of_samples_limit: int = 12) → DataFrame[source]

Removes outliers from the timeseries on each column using the histogram. The parameter ‘min_number_of_samples_limit’ specify the minimum amount of hours, if integrate flag is True, or samples that a value must have to be considered not an outlier.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

remove_from_process (list,optional) – Columns to be kept off the process;
df_avoid_periods (pandas.core.frame.DataFrame) – The first column with the start and the second column with the end date.
integrate_hour (bool,optional) – Makes the analysis on the data integrated to an hour
sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
min_number_of_samples_limit (int,optional) – The number of samples to be considered valid

Returns:

Y: A pandas.core.frame.DataFrame without the outliers

Return type:

Y: pandas.core.frame.DataFrame

MVDataProcessing.Clean.RemoveOutliersMMADMM(x_in: ~pandas.DataFrame, df_avoid_periods: ~pandas.DataFrame = Empty DataFrame Columns: [] Index: [], len_mov_avg: int = 48, std_def: float = 2, min_var_def: float = 0.5, allow_negatives: bool = False, plot: bool = False, remove_from_process: list = []) → DataFrame[source]

Removes outliers from the timeseries on each column using the (M)oving (M)edian (A)bslute (D)eviation around the (M)oving (M)edian.

A statistical method is used for removing the remaining outliers. In LEYS et al. (2019), the authors state that it is common practice the use of plus and minus the standard deviation (±σ) around the mean (µ), however, this measurement is particularly sensitive to outliers. Furthermore, the authors propose the use of the absolute deviation around the median.Therefore, in this work the limit was set by the median absolute deviation (MADj) around the moving median (Mj) where j denotes the number of samples of the moving window. Typically, an MV feeder has a seasonality where in the summer load is higher than in the winter or vice-versa. Hence, it is vital to use the moving median instead of the median of all the time series.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

df_avoid_periods (pandas.core.frame.DataFrame) – The first column with the start and the second column with the end date.
len_mov_avg (int,optional) – Size of the windows of the moving average.
std_def (float,optional) – Absolute standard deviation to be computed around the moving average.
min_var_def – For low variance data this parameter will set a minimum distance from the upper and lower

boundaries. :type min_var_def: float,optional

Parameters:

allow_negatives (bool,optional) – Allow for the lower level to be below zero.
plot (bool,optional) – A plot of the boundaries and result to debug parameters.
remove_from_process (list,optional) – Columns to be kept off the process.

Raises:

Exception – if x_in has no DatetimeIndex.

Returns:

Y: A pandas.core.frame.DataFrame without the outliers

Return type:

Y: pandas.core.frame.DataFrame

MVDataProcessing.Clean.RemoveOutliersQuantile(x_in: ~pandas.DataFrame, remove_from_process: list = [], df_avoid_periods=Empty DataFrame Columns: [] Index: []) → DataFrame[source]

Removes outliers from the timeseries on each column using the top and bottom quantile metric as an outlier marker.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

remove_from_process (list,optional) – Columns to be kept off the process;
df_avoid_periods (pandas.core.frame.DataFrame) – The first column with the start and the second column with the end date.

Returns:

Y: A pandas.core.frame.DataFrame without the outliers

Return type:

Y: pandas.core.frame.DataFrame

MVDataProcessing.Example module

MVDataProcessing.Example.ShowExampleNSSCProcess(plot: bool = True)[source]

Demonstrates the normalized scaled standard weekday curve inputation method.

This function goes through various steps of data processing including synchronization, outlier removal, data ng, and NSSC application. Each step is demonstrated with optional plotting for visual analysis. The data loss and outliers are exaggerated for demonstration purposes. The process is applied between a predefined start and end date, with multiple methods applied to handle missing data, outliers, and to predict and replace data in the final output.

Parameters

plotbool, optional: If True, the function will plot the data at various stages of processing for visualization. Defaults to True.

Returns

None: The function does not return any value but optionally displays plots and prints information about the processing steps if plot is True.

Notes

The function is meant for demonstration and educational purposes, showing various stages in data processing.
The process is specifically tailored for current data and may need adjustments for other types of data.
The example dates and parameters are hardcoded for demonstration and should be adapted for practical use.

MVDataProcessing.Example.ShowExampleSimpleProcess(plot: bool = True)[source]

Demonstrates a simple data processing workflow using various functions to handle, analyze, and visualize data.

The function executes a sequence of operations on dummy data, including data synchronization, outlier removal, and data processing. It uses matplotlib to plot the results at each step. Additionally, it tracks the execution time for each step using the TimeProfile function and the number of missing data samples, outputting this information to the console along with some explanation.

Steps involved: - Close all existing matplotlib plots. - Generate dummy data and plot it. - Synchronize data with specified start and end dates. - Remove outliers using various methods (Hard Threshold, MMADMM, Quantile, Histogram). - Execute a simple data processing operation. - Plot the final output. - Display a time profile of the entire process.

Parameters:: plot (bool,optional) – Plot data for each step of the process. Disables the time profile.

Returns:: None: This function does not return any value.

MVDataProcessing.Fill module

MVDataProcessing.Fill.GetNSSCPredictedSamples(max_vet: ~pandas.DataFrame, min_vet: ~pandas.DataFrame, weekday_curve: ~pandas.DataFrame, start_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, end_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, sample_freq: int = 5, sample_time_base: str = 'm') → DataFrame[source]

Generate predicted samples for NSSC using maximum and minimum vectors, and a curve based on weekdays.

Parameters:

max_vet (pandas.core.frame.DataFrame) – The maximum vector DataFrame.
min_vet (pandas.core.frame.DataFrame) – The minimum vector DataFrame.
weekday_curve (pandas.core.frame.DataFrame) – DataFrame representing the curve based on weekdays.
sample_freq (int) – The frequency of sampling. Defaults to 5.
sample_time_base (str) – The base unit of time for sampling, can be ‘s’, ‘m’, or ‘h’. Defaults to ‘m’.

Raises:

Exception – If the sample_time_base is not ‘m’.

Returns:

A DataFrame with predicted values.

Return type:

pandas.core.frame.DataFrame

MVDataProcessing.Fill.NSSCInput(x_in: ~pandas.DataFrame, start_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, end_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, sample_freq: int = 5, sample_time_base: str = 'm', threshold_accept_min_max: float = 1.0, threshold_accept_curve: float = 1.0, num_samples_day: int = 288, num_samples_patamar: int = 72, day_threshold: float = 0.5, patamar_threshold: float = 0.5, min_sample_per_day: int = 3, min_sample_per_workday: int = 9) → DataFrame[source]

Implement the NSSC method.

Parameters:

x_in (pandas.core.frame.DataFrame) – Input data frame.
start_date_dt (datetime) – Start date for the processing.
end_date_dt (datetime) – End date for the processing.
sample_freq (int) – Sampling frequency, default is 5.
threshold_accept_min_max (float) – Threshold for accepting minimum and maximum values, default is 1.0.
threshold_accept_curve (float) – Threshold for accepting curve values, default is 1.0.
min_sample_per_day (int) – Minimum number of samples per day, default is 3.
num_samples_day (int) – Number of samples per day, default is 288 (12*24).
day_threshold (float) – Day threshold value, default is 0.5.
patamar_threshold (float) – Patamar threshold value, default is 0.5.
num_samples_patamar (int) – Number of samples for patamar, default is 72 (12*6).
sample_time_base (str) – Base unit for sample time, default is ‘m’ for minutes.
min_sample_per_workday (int) – Minimum number of samples per workday, default is 9.

Returns:

Processed data frame.

Return type:

pandas.core.frame.DataFrame

MVDataProcessing.Fill.PhaseProportionInput(x_in: DataFrame, threshold_accept: float = 0.75, plot: bool = False, apply_filter: bool = True, time_frame_apply: list = ['h', 'pd', 'D', 'M', 'S', 'Y', 'A'], remove_from_process: list = []) → DataFrame[source]

Processes input DataFrame to compute phase proportion based on various time frames and criteria.

Makes the imputation of missing data samples based on the ration between columns. (time series)

Theory background.:

Correlation between phases (φa,φb, φv) of the same quantity (V, I or pf) is used to infer a missing sample value based on adjacent samples. Adjacent samples are those of the same timestamp i but from different phases that the one which is missing. The main idea is to use a period where all three-phases (φa, φb, φv) exist and calculate the proportion between them. Having the relationship between phases, if one or two are missing in a given timestamp i it is possible to use the remaining phase and the previous calculated ratio to fill the missing ones. The number of samples used to calculate the ratio around the missing sample is an important parameter. For instance if a sample is missing in the afternoon it is best to use samples from that same day and afternoon to calculate the ratio and fill the missing sample. Unfortunately, there might be not enough samples in that period to calculate the ratio.Therefore, in this step, different periods T of analysis

around the missing sample reconsidered: hour, period of the day (dawn, morning, afternoon and night), day, month, season (humid/dry), and year.

The correlation between the feeder energy demand and the period of the day or the season is very high. The increase in consumption in the morning and afternoon in industrial areas is expected as those are the periods where most factories are fully functioning. In residential areas, the consumption is expected to be higher in the evening; however, it is lower during the day’s early hours. Furthermore, in the summer, a portion of the network (vacation destination) can be in higher demand. Nonetheless, in another period of the year (winter), the same area could have a lower energy demand. Therefore, if there is not enough information on that particular day to compute the ratio between phases, a good alternative is to use data from the month. Finally, given the amount of missing data for a particular feeder, the only option could be the use of the whole year to calculate the ratio between phases. Regarding the minimum amount of data that a period should have to be valid it is assumed the default of 50% for all phases.

Parameters:

x_in (pandas.core.frame.DataFrame) – Input DataFrame with a DatetimeIndex.
threshold_accept (float, optional) – Threshold for accepting data based on null value proportion, defaults to 0.75.
plot (bool, optional) – Flag to indicate if plots should be generated, defaults to False.
apply_filter (bool, optional) – Flag to indicate if outlier filter should be applied, defaults to True.
time_frame_apply (list, optional) – List of time frames to apply phase proportion analysis, defaults to [‘h’,’pd’,’D’,’M’,’S’,’Y’,’A’].
remove_from_process (list, optional) – List of columns to exclude from processing, defaults to an empty list.

Returns:

DataFrame with phase proportions computed and applied.

Return type:

pandas.core.frame.DataFrame

Raises:

Exception – If input DataFrame does not have a DatetimeIndex.
Exception – If input DataFrame has less than two columns.
Exception – If no time frames are provided in time_frame_apply.

The function applies various transformations and calculations based on specified time frames, handling missing data, computing correlations, and applying filters if required. It optionally generates plots for the analysis. The final DataFrame includes computed phase proportions and, if specified, the columns that were excluded from processing.

MVDataProcessing.Fill.ReplaceData(x_in: ~pandas.DataFrame, x_replace: ~pandas.DataFrame, start_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, end_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, num_samples_day: int = 288, day_threshold: float = 0.5, patamar_threshold: float = 0.5, num_samples_patamar: int = 72, sample_freq: int = 5, sample_time_base: str = 'm') → DataFrame[source]

Replaces data in a DataFrame based on specified conditions and thresholds.

Parameters:

x_in (pandas.core.frame.core.frame.DataFrame) – The input DataFrame containing the data to be analyzed and replaced.
x_replace (pandas.core.frame.core.frame.DataFrame) – The DataFrame containing replacement data.
start_date_dt (datetime) – The start date for the data replacement process.
end_date_dt (datetime) – The end date for the data replacement process.
num_samples_day (int) – The number of samples per day, default is 288 (12 * 24).
day_threshold (float) – The threshold for day-based null value analysis, default is 0.5.
patamar_threshold (float) – The threshold for patamar-based null value analysis, default is 0.5.
num_samples_patamar (int) – The number of samples per patamar, default is 72 (12 * 6).
sample_freq (int) – The frequency of samples, default is 5.
sample_time_base (str) – The time base unit for sampling, default is ‘m’ (minutes).

Returns:

A DataFrame with data replaced based on the specified conditions.

Return type:

pandas.core.frame.core.frame.DataFrame

Note: x_in and x_replace must have the same structure and index type.

MVDataProcessing.Fill.SimpleProcess(x_in: ~pandas.DataFrame, start_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, end_date_dt: <module 'datetime' from '/home/docs/.asdf/installs/python/3.12.10/lib/python3.12/datetime.py'>, remove_from_process: list = [], sample_freq: int = 5, sample_time_base: str = 'm', pre_interpol: int = False, pos_interpol: int = False, prop_phases: bool = False, integrate: bool = False, interpol_integrate: int = False) → DataFrame[source]

Simple pre-made imputation process.

ORGANIZE->INTERPOLATE->PHASE_PROPORTION->INTERPOLATE->INTEGRATE->INTERPOLATE

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

start_date_dt (datetime) – The start date where the synchronization should start.
end_date_dt (datetime) – The end date where the synchronization will consider samples.
remove_from_process (list,optional) – Columns to be kept off the process Only on PhaseProportionInput step.
sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
sample_time_base – The base time of the sample frequency. Specify if the sample frequency is in (D)ay,

(M)onth, (Y)ear, (h)ours, (m)inutes, or (s)econds. Defaults to (m)inutes. :type sample_time_base: srt,optional

Parameters:: pre_interpol – Number of samples to limit the first interpolation after organizing the data.

Defaults to False. :type pre_interpol: int,optional

Parameters:: pos_interpol – Number of samples to limit the second interpolation after PhaseProportionInput the data.

Defaults to False. :type pos_interpol: int,optional

Parameters:

prop_phases (bool,optional) – Apply the PhaseProportionInput method
integrate (bool,optional) – Integrates to 1 hour time stamps. Defaults to False.
interpol_integrate – Number of samples to limit the third interpolation after IntegrateHour the data.

Defaults to False. :type interpol_integrate: int,optional

Returns:: Y: The x_in pandas.core.frame.DataFrame with no missing data. Treated with a simple step process.
Return type:: Y: pandas.core.frame.DataFrame

MVDataProcessing.Util module

MVDataProcessing.Util.CalcUnbalance(x_in: DataFrame, remove_from_process: list = []) → DataFrame[source]

Calculates the unbalance between phases for every timestamp.

Equation:: Y = (MAX-MEAN)/MEAN

Ref.: Derating of induction motors operating with a combination of unbalanced voltages and over or under-voltages

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:: remove_from_process (list,optional) – Columns to be kept off the process.
Returns:: Y: A pandas.core.frame.DataFrame with the % of unbalance between columns (phases).
Return type:: Y: pandas.core.frame.DataFrame

MVDataProcessing.Util.Correlation(x_in: DataFrame) → float[source]

Calculates the correlation between each column of the DataFrame and outputs the average of all.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Returns:: corr_value: Value of the correlation
Return type:: corr_value: float

MVDataProcessing.Util.CountMissingData(x_in: DataFrame, remove_from_process: list = [], show=False) → float[source]

Calculates the number of vacacies on the dataset.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

remove_from_process (list,optional) – Columns to be kept off the process.
show (bool,optional) – Specify if the function should print or not the value that is also returned.

Returns:

Y: Returns the amount of vacancies.

Return type:

Y: float

MVDataProcessing.Util.CurrentDummyData(qty_weeks: int = 48, start_date_dt: datetime = datetime.datetime(2023, 1, 1, 0, 0))[source]

Generates a DataFrame containing dummy time series data.

This function creates a pandas DataFrame representing time series data over a specified number of weeks, starting from a given date. The data includes artificial variations to simulate different patterns, including seasonal variations and random noise. The DataFrame includes columns ‘IA’, ‘IB’, ‘IV’, and ‘IN’, each containing modified time series data. The index of the DataFrame is set to timestamps at 5-minute intervals, starting from the specified start date.

Parameters

qty_weeksint, optional: The number of weeks to generate data for, by default 48 weeks (12*4).
start_date_dtdatetime, optional: The start date for the time series data, by default datetime(2023,1,1).

Returns

pandas.DataFrame: A DataFrame containing the generated time series data with columns ‘IA’, ‘IB’, ‘IV’, and ‘IN’, and a timestamp index.

Examples

>>> dummy_data = CurrentDummyData(24, datetime(2023,1,1))
>>> dummy_data.head()

MVDataProcessing.Util.DataSynchronization(x_in: DataFrame, start_date_dt: datetime, end_date_dt: datetime, sample_freq: int = 5, sample_time_base: str = 'm') → DataFrame[source]

Makes the Data Synchronization between the columns (time series) of the data provided.

Theory background.:

The time series synchronization is the first step in processing the dataset. The synchronization is vital since the alignment between phases (φa, φb, φv) of the same quantity, between quantities (V, I, pf) of the same feeder, and between feeders, provides many advantages. The first one being the ability to combine all nine time series, the three-phase voltage, current, and power factor of each feeder to calculate the secondary quantities (Pactive/Preactive, Eactive/Ereactive).

Furthermore, the synchronization between feeders provides the capability to analyze the iteration between them, for instance, in load transfers for scheduled maintenance and to estimate substation’s transformers quantities by the sum of all feeders.

Most of the functions in this module assumes that the time series are “Clean” to a certain sample_freq. Therefore, this function must be executed first on the dataset.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

start_date_dt (datetime) – The start date where the synchronization should start.
end_date_dt (datetime) – The end date where the synchronization will consider samples.
sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
sample_time_base – The base time of the sample frequency. Specify if the sample frequency is in (D)ay,

(M)onth, (Y)ear, (h)ours, (m)inutes, or (s)econds. Defaults to (m)inutes. :type sample_time_base: srt,optional

Raises:

Exception – if x_in has no DatetimeIndex.
Exception – if start_date_dt not in datetime format.
Exception – if end_date_dt not in datetime format.
Exception – if sample_time_base is not in (D)ay, (M)onth, (Y)ear, (h)ours, (m)inutes, or (s)econds.

Returns:

Y: The synchronized pandas.core.frame.DataFrame

Return type:

Y: pandas.core.frame.DataFrame

MVDataProcessing.Util.DayPeriodMapper(hour: int) → int[source]

Maps a given hour to one of four periods of a day.

For 0 to 5 (hour) -> 0 night For 6 to 11 (hour) -> 1 moorning For 12 to 17 (hour) -> 2 afternoon For 18 to 23 (hour) -> 3 evening

Parameters:: hour (int) – an hour of the day between 0 and 23.
Returns:: mapped: Period of the day
Return type:: mapped: int

MVDataProcessing.Util.DayPeriodMapperVet(hour: Series) → Series[source]

Maps a given hour to one of four periods of a day.

For 0 to 5 (hour) -> 0 night For 6 to 11 (hour) -> 1 moorning For 12 to 17 (hour) -> 2 afternoon For 18 to 23 (hour) -> 3 evening

Parameters:: hour – A pandas.core.series.Series with values between 0 and 23 to map each hour in the series to a period

of the day. this is a “vector” format for DayPeriodMapper function. :type hour: pandas.core.series.Series

Returns:: period_day: The hour pandas.core.series.Series mapped to periods of the day
Return type:: period_day: pandas.core.series.Series

MVDataProcessing.Util.DefaultWeekDayCurve()[source]

MVDataProcessing.Util.EnergyDummyData(qty_weeks: int = 48, start_date_dt: datetime = datetime.datetime(2023, 1, 1, 0, 0))[source]

Generate a dummy pandas DataFrame containing cumulative energy data.

This function creates a DataFrame with two columns: ‘Eactive’ and ‘Ereactive’. ‘Eactive’ is the cumulative sum of the ‘P’ column from the PowerDummyData function, and ‘Ereactive’ is the absolute cumulative sum of the ‘Q’ column from the same function.

Parameters

qty_weeksint, optional: The quantity of weeks for which to generate the data, default is 48 weeks (12*4).
start_date_dtdatetime, optional: The starting date for the data generation, default is January 1, 2023.

Returns

pandas.DataFrame: A DataFrame with two columns ‘Eactive’ and ‘Ereactive’ representing the cumulative active and reactive energy data respectively.

Examples

>>> EnergyDummyData(4, datetime(2023, 1, 1))
DataFrame with the cumulative energy data for 4 weeks starting from January 1, 2023.

Notes

The function relies on PowerDummyData function to generate initial power data which is then cumulatively summed to generate energy data.

MVDataProcessing.Util.GetDayMaxMin(x_in: DataFrame, start_date_dt: datetime, end_date_dt: datetime, sample_freq: int = 5, threshold_accept: float = 1.0, exe_param: str = 'max')[source]

Returns a tuple of pandas.core.frame.DataFrame containing the values of maximum or minimum of each day and the timestamp of each occurrence. For each weekday that is not a valid day the maximum or minimum is interpolated->ffill->bff. The interpolation is made regarding each weekday.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

start_date_dt
end_date_dt
sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
threshold_accept (float,optional) – The amount of samples that is required to consider a valid day. Defaults to 1 (100%).
exe_param – ‘max’ return the maximum and min return the minimum value of each valid day

(Default value = ‘max’) :type exe_param: srt,optional

Returns:: Y: The first parameter is a pandas.core.frame.DataFrame with maximum value for each day

and the second parameter pandas.core.frame.DataFrame with the timestamps. :rtype: Y: tuple

MVDataProcessing.Util.GetWeekDayCurve(x_in: DataFrame, sample_freq: int = 5, threshold_accept: float = 1.0, min_sample_per_day: int = 3, min_sample_per_workday: int = 9)[source]

Analyzes and normalizes time series data in a DataFrame to compute average curves for each weekday, considering various sampling and validity thresholds.

Parameters:

x_in – Input DataFrame with a DatetimeIndex.
sample_freq – Sampling frequency in minutes, default is 5.
threshold_accept – Threshold for accepting valid data, default is 1.0.
min_sample_per_day – Minimum samples required per day to consider the data valid, default is 3.
min_sample_per_workday – Minimum samples required per workday (Monday to Friday) to consider the data valid, default is 9.

Type:

pandas.core.frame.DataFrame

Type:

int

Type:

float

Type:

int

Type:

int

Raises:

Exception – If the DataFrame does not have a DatetimeIndex.

Returns:

A DataFrame containing the normalized data for each weekday.

Return type:

pandas.core.frame.DataFrame

MVDataProcessing.Util.IntegrateHour(x_in: DataFrame, sample_freq: int = 5, sample_time_base: str = 'm') → DataFrame[source]

Integrates the input pandas.core.frame.DataFrame to an hour samples.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetimes.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
sample_time_base – The base time of the sample frequency. Specify if the sample frequency is in (m)inutes

or (s)econds. Defaults to (m)inutes. :type sample_time_base: srt,optional

Raises:: Exception – if x_in has no DatetimeIndex.
Returns:: df_y: The pandas.core.frame.DataFrame integrated by hour.
Return type:: df_y: pandas.core.frame.DataFrame

MVDataProcessing.Util.MarkNanPeriod(x_in: DataFrame, df_remove: DataFrame, remove_from_process: list = []) → DataFrame[source]

Marks as nan all specified timestamps

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:: df_remove – List of periods to mark as nan. The first column with the start and the second column with

the end date all in datetime. :type df_remove: pandas.core.frame.DataFrame

Parameters:: remove_from_process (list,optional) – Columns to be kept off the process;
Returns:: Y: The input pandas.core.frame.DataFrame with samples filled based on the proportion between time series.
Return type:: Y: pandas.core.frame.DataFrame

MVDataProcessing.Util.PowerDummyData(qty_weeks: int = 48, start_date_dt: datetime = datetime.datetime(2023, 1, 1, 0, 0))[source]

Generates dummy power data for a specified number of weeks from a start date.

This function calculates the apparent power (S), active power (P), and reactive power (Q) for a given number of weeks starting from a specified date. It uses the CurrentDummyData, VoltageDummyData, and PowerFactorDummyData functions to generate current (I), voltage (V), and power factor (pf) data, respectively. The final DataFrame includes columns for S, P, and Q.

Parameters: qty_weeks (int): The quantity of weeks for which to generate data. Default is 48 weeks. start_date_dt (datetime): The start date for data generation. Default is January 1, 2023.

Returns: pandas.DataFrame: A DataFrame containing the columns ‘S’ (apparent power),

‘P’ (active power), and ‘Q’ (reactive power).

Example: >>> PowerDummyData(4, datetime(2023, 1, 1)) [Output will be a DataFrame with the calculated power data for 4 weeks starting from January 1, 2023]

MVDataProcessing.Util.PowerFactorDummyData(qty_weeks: int = 48, start_date_dt: datetime = datetime.datetime(2023, 1, 1, 0, 0))[source]

Generates dummy power factor data for a specified number of weeks starting from a given date.

This function creates a pandas DataFrame containing simulated power factor data across three columns: ‘FPA’, ‘FPB’, and ‘FPV’. Each row represents a 5-minute interval within the specified time frame. The data includes base values with added random load transfer and noise effects to simulate real-world fluctuations in power factor measurements.

Parameters

qty_weeksint, optional: The quantity of weeks to generate data for, defaults to 48 weeks (approximately one year).
start_date_dtdatetime, optional: The start date for the data generation, defaults to January 1, 2023.

Returns

pandas.DataFrame: A DataFrame with a datetime index representing 5-minute intervals and columns ‘FPA’, ‘FPB’, and ‘FPV’ for power factor values. The data includes random variations to simulate realistic power factor changes over time.

Notes

The function internally generates a dummy week of data and replicates it for the number of weeks specified.
Random load transfers and noise are added to the base values to create variability in the data.
The DataFrame’s index is set to the timestamp of each record, making it suitable for time series analysis.

Examples

>>> import pandas
>>> from datetime import datetime
>>> dummy_data = PowerFactorDummyData(qty_weeks=12, start_date_dt=datetime(2023, 1, 1))
>>> dummy_data.head()

MVDataProcessing.Util.ReturnOnlyValidDays(x_in: DataFrame, sample_freq: int = 5, threshold_accept: float = 1.0, sample_time_base: str = 'm', remove_from_process=[]) → tuple[source]

Returns all valid days. A valid day is one with no missing values for any of the timeseries on each column.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:

sample_freq (int,optional) – The sample frequency of the time series. Defaults to 5.
threshold_accept (float,optional) – The amount of samples that is required to consider a valid day. Defaults to 1 (100%).
sample_time_base – The base time of the sample frequency. Specify if the sample frequency is in (h)ours,

(m)inutes, or (s)econds. Defaults to (m)inutes. :type sample_time_base: srt,optional

Parameters:

remove_from_process (list,optional) – Columns to be kept off the process;

Raises:

Exception – if x_in has no DatetimeIndex.
Exception – if sample_time_base is not in seconds, minutes or hours.

Returns:

Y: A tupole with the pandas.core.frame.DataFrame with samples filled based on the proportion

between time series and the number of valid days :rtype: Y: tuple

MVDataProcessing.Util.SavePeriod(x_in: DataFrame, df_save: DataFrame) → tuple[source]

For a given set of periods (Start->End) returns the data. It also returns the indexes.

Parameters:: x_in – A pandas.core.frame.DataFrame where the index is of type “pandas.core.indexes.datetime.DatetimeIndex”

and each column contain an electrical quantity time series. :type x_in: pandas.core.frame.DataFrame

Parameters:: df_save (pandas.core.frame.DataFrame) – The first column with the start and the second column with the end date.
Returns:: df_values,index_return: The input pandas.core.frame.DataFrame sliced by the df_save periods. it also returns

the indexes :rtype: df_values,index_return: tuple

MVDataProcessing.Util.TimeProfile(time_stopper: list, name: str = '', show: bool = False, estimate_for: int = 0)[source]

Simple code profiler.

How to use:

Create a list -> time_stopper = []

Put a -> time_stopper.append([‘time_init’,time.perf_counter()]) at the beginning.

Put time_stopper.append([‘Func_01’,time.perf_counter()]) after the code block with the first parameter being a name and the second being the time.

Call this function at the end.

Example:

time_stopper.append([‘time_init’,time.perf_counter()])

func1() time_stopper.append([‘func1’,time.perf_counter()]) func2() time_stopper.append([‘func2’,time.perf_counter()]) func3() time_stopper.append([‘func3’,time.perf_counter()]) func4() time_stopper.append([‘func4’,time.perf_counter()])

TimeProfile(time_stopper,’My Profiler’,show=True,estimate_for=500)

The estimate_for parameter makes the calculation as if you would run x times the code analyzed.

Parameters:

time_stopper (list) – A List that will hold all the stop times.
name (str, optional) – A name for this instance of time profile. Defaults to empty.
show (bool, optional) – If True shows the data on the console. Defaults to False.
estimate_for – A multiplier to be applied at the end. Takes the whole

time analyzed and multiplies by “estimate_for”. :type estimate_for: int

Returns:: None
Return type:: None

MVDataProcessing.Util.VoltageDummyData(qty_weeks: int = 48, start_date_dt: datetime = datetime.datetime(2023, 1, 1, 0, 0))[source]

Generate a DataFrame containing dummy voltage data over a specified number of weeks.

This function creates a time series of voltage data, simulating variations in voltage values over a given time period. The data includes random noise and step changes to mimic real-world fluctuations in voltage readings.

Parameters

qty_weeksint, optional: The number of weeks over which to generate the data (default is 48 weeks).
start_date_dtdatetime, optional: The start date for the data generation (default is January 1, 2023).

Returns

pandas.DataFrame: A DataFrame with timestamps as index and columns ‘VA’, ‘VB’, and ‘VV’ representing simulated voltage readings for three different phases or measurements. Each column contains voltage values that are affected by random noise and step changes.

Notes

The voltage values are simulated around a base value of 13.8, adjusted by a random noise factor and step changes.
The step changes in voltage are randomly introduced at various points in the time series.
The timestamps are spaced 5 minutes apart.

Examples

>>> dummy_data = VoltageDummyData()
>>> dummy_data.head()

MVDataProcessing.Util.YearPeriodMapperVet(month: Series) → Series[source]

Maps a given month to one of two periods of a year, being dry and humid .

For october to march (month) -> 0 humid For april to september (month) -> 1 dry

Parameters:: month – A pandas.core.series.Series with values between 0 and 12 to map each month

in the series to dry or humid.

Returns:: season: The months pandas.core.series.Series mapped to dry or humid.
Return type:: season: pandas.core.series.Series

MVDataProcessing package

Submodules

MVDataProcessing.Clean module

MVDataProcessing.Example module

Parameters

Returns

Notes

MVDataProcessing.Fill module

MVDataProcessing.Util module

Parameters

Returns

Examples

Parameters

Returns

Examples

Notes

Parameters

Returns

Notes

Examples

Parameters

Returns

Notes

Examples

Module contents