GossipCat API

Data Science Project Basics

gossipcat.dev.FileSys(project_name=True)[source]

Establish a data science project file system.

Parameters:: project_name (bool) – If a project name is needed, default True.
Yields:: A data science project file system.

gossipcat.dev.get_logger(logName, logFile=False)[source]

Get a logger in one step.

Logging is one of the most underrated features. Two things (5&3) to take away from Logging in Python: 1. FIVE levels of importance that logs can contain(debug, info, warning, error, critical); 2. THREE components to configure a logger in Python (a logger, a formatter, and at least one handler).

Parameters:

logName (str) – A logger name to display in loggings.
logFile (bool) – A target file to save loggins.

Returns:

A well organized logger.

Return type:

logger (logging.getLogger)

Data Science Experiment

class gossipcat.lab.Comparison(df, target, features, metric=None, cv_n=10, log_path='algorithms_comparison.log')[source]

Machine Learning Algorithms Comparison

Parameters:

df (pandas.DataFrame) – A training set.
target (str) – The target for supervised machine learning.
features (list) – The feature list for the model.
metric (str) – The metric used for cross-validation, sklearn.metrics. For classification, consider ‘roc_auc’, ‘average_precision’; for regression, consider ‘neg_root_mean_squared_error’.
cv_n (int) – The number of splits of cross-validation.
log_path (str) – The logging file.

classifers()[source]: Compare classification algorithms.

compare(classification=True)[source]

Compare supervised machine learning algorithms.

Parameters:: classification (bool) – If the task is classification, default True.
Returns:: The comparison results.
Return type:: Results

regressors()[source]: Compare regression algorithms.

visualize(time=False, figsize=(8, 8))[source]

Visualize the comparison results.

Parameters:

time (bool) – Whether to include computing time in the chart, defualt False.
figsize (tuple) – The figure size for output, default (8, 8).

Returns:

None

class gossipcat.lab.GridSearch(df=None, target=None, features=None, regression=False, if_visualize=False, log_path='grid_search.log')[source]

Perform a grid search for XGBoost hyper-parameter tuning, focusing on max_depth, subsample, and colsample_bytree.

Parameters:

df (pandas.DataFrame) – A training set.
target (str) – The target for supervised machine learning.
features (list) – The feature list for the model.
regression (bool) – Whether the machine learning task is regression.
if_visualize (bool) – Whether the task is to visualize, default False.
log_path (str) – The logging file.

search(range_max_depth=range(1, 10), range_subsample=range(50, 91, 5), range_colsample_bytree=range(50, 91, 5))[source]

To search on the hyper-parameter space.

Parameters:

range_max_depth (list) – The search space of max_depth, default range(1, 10, 1).
range_subsample (list) – The search space of subsample, default range(50, 91, 5).
range_colsample_bytree (list) – The search space of colsample_bytree, default range(50, 91, 5).

visualize(max_depth=1, top=1)[source]

To visualize the grid search results in 3D format. The x-axis: subsample, the y-axis: colsample_bytree, and the z-axis: the mean of cross-validation test score.

Parameters:

max_depth (int) – The max_depth for the 3D visualization.
top (int) – The top results to print out.

Returns:

The top results of grid search.

class gossipcat.lab.Explain(model, X, y, target, features, regression=False)[source]

Explain tree-based models with dtreeviz and SHAP.

Parameters:

model – A tree-based model, like XGBoost.
X (np.narray) – X values.
y (np.array) – y values.
target (str) – The target name.
features (list) – The list of features.
regression (bool) – Whether a regression model, defualt False.

feature_importance(max_display=20)[source]: Plot the feature and SHAP variable importance with SHAP.

tree(tree_index=0, class_names=None, show_node_labels=True, title='Decision Tree', orientation='TD', scale=1.5)[source]

Plot the tree with dtreeviz.

Parameters:

tree_index (int) – The tree index of the model, default 0.
class_names (list) – [For classifiers] A dictionary or list of strings mapping class value to class name.
show_node_labels (bool) – Add “Node id” to top of each node in graph for educational purposes.
title (str) – The plot title.
orientation (str) – Is the tree top down, “TD”, or left to right, “LR”?
scale (float) – Scale the width, height of the overall SVG preserving aspect ratio, default 1.5.

Returns:

A dtreeviz instance.

Return type:

viz (dtreeviz)

Model Development

class gossipcat.dev.XGB(df, indcol, target, features, regression=False, predicting=False, balanced=False, multi=False, gpu=False, seed=0)[source]

Develop a XGBoost model with best-practice parameters.

Parameters:

df (pandas.DataFrame) – A DataFrame for modeling.
indcol (str) – The indicator column name for the dataset.
target (str) – The target column name.
features (list) – The feature list.
predicting (bool) – Whether a predicting task, default False.
balance (bool) – Whether the sample is balanced for binary classification task, default False.
multi (bool) – Whether a multi-category task, default False.
gpu (bool) – Whether to use GPU, default False.
seed (int) – The seed for randomness.

algorithm(learning_rate=0.01, nfold=5, n_rounds=3000, early_stopping=50, verbose=100)[source]

Perform cross-validation on the training set.

Parameters:

learning_rate (float) – Boosting learning rate (xgb’s “eta”).
n_fold (int) – Number of folds in CV.
n_rounds (int) – Number of boosting iterations.
early_stopping (int) – Activates early stopping. Cross-Validation metric (average of validation metric computed over CV folds) needs to improve at least once in every early_stopping_rounds round(s) to continue training. The last entry in the evaluation history will represent the best iteration. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.
verbose (bool, int, or None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage.

evaluate(path_model='model_xgb.pkl')[source]

Evaluate a model loaded from the path.

Parameters:: path_model (str) – Path of the model.
Returns:: Model evaluation.

learning_curve(figsize=(10, 5))[source]

Draw a learning curve of the cross-validation.

Parameters:: figsize (tupe) – Figure size of the chart.

load_model(path_model='model_xgb.pkl')[source]

Load a pretrained model.

Parameters:: path_model (str) – Path of the model.

predict(path_model='model_xgb.pkl', path_result='prediction.csv')[source]

Predict with model loaded from the path and save it as a CSV file.

Parameters:

path_model (str) – Path of the model.
path_result (str) – Path of the prediction.

report()[source]: Report for the binary classification task.

retrain(path_model, path_model_update=None)[source]

Retrain a model with the model from path and save to a new path.

Parameters:

path_model (str) – Path to save the model.
path_model_update (str) – New path for the updated model.

save_model(path_model='model_xgb.pkl')[source]

Load a pretrained model.

Parameters:: path_model (str) – Path of the model.

train(path_model='model_xgb.pkl')[source]

Train a model with the best iteration rounds obtained from algorithm.

Parameters:: path_model (str) – Path to save the model.

class gossipcat.dev.CAT(df, indcol, target, features, features_cat, regression=False, predicting=False, multi=0, balanced=0, gpu=0, seed=0)[source]

Quickly develop a CatBoost model with best-practice parameters.

Parameters:

df (pandas.DataFrame) – A DataFrame for modeling.
indcol (str) – The indicator column name for the dataset.
target (str) – The target column name.
features (list) – The feature list.
features_cat (list) – Categorical feature list.
predicting (bool) – Whether a predicting task, default False.
balance (bool) – Whether the sample is balanced for binary classification task, default False.
multi (bool) – Whether a multi-category task, default False.
gpu (bool) – Whether to use GPU, default False.
seed (int) – The seed for randomness.

algorithm(learning_rate=0.01, iterations=100, early_stopping_rounds=20, nfold=10, verbose=100, plot=False)[source]

Perform cross-validation on the training set.

Parameters:

learning_rate (float) – Boosting learning rate (xgb’s “eta”).
iterations (int) – Number of boosting iterations.
early_stopping (int) – Activates early stopping. Cross-Validation metric (average of validation metric computed over CV folds) needs to improve at least once in every early_stopping_rounds round(s) to continue training. The last entry in the evaluation history will represent the best iteration. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.
n_fold (int) – Number of folds in CV.
verbose (bool, int, or None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage.
plot (bool) – Whether plot the output, default False.

learning_curve(figsize=(10, 5))[source]

Draw a learning curve of the cross-validation.

Parameters:: figsize (tupe) – Figure size of the chart.

load_model(path_model='model_cb.json', format='json')[source]

Load a pretrained model.

Parameters:

path_model (str) – Path of the model.
format (str) – Model format, default json.

predict(path_model='model_cb.json', path_result='prediction.csv', model_format='json')[source]

Predict with model loaded from the path and save it as a CSV file.

Parameters:

path_model (str) – Path of the model.
path_result (str) – Path of the prediction.
model_format (str) – Model format, default json.

report()[source]: Report for the binary classification task.

save_model(path_model='model_cb.json', format='json')[source]

Load a pretrained model.

Parameters:

path_model (str) – Path of the model.
format (str) – Model format, default json.

train(path_model='model_cb.json')[source]

Train a model with the best iteration rounds obtained from algorithm.

Parameters:: path_model (str) – Path to save the model.

Graph Data Science

class gossipcat.graph.Attribute(graph)[source]

Generate all node-based, edge-based, and graph-based attributes of all connected components in a whole graph.

Initialize the class and generate graph attributes. Initializes the class by extracting connected components and creating dataframes for graph, node, edge, and pair attributes.

Args: graph (networkx.Graph): The input graph.

mulTabular()[source]

Combine all node-based, edge-based, and graph-based attributes of all connected components in the whole graph. Combines attributes for all connected components in the graph into a single DataFrame.

Returns:: Combined DataFrame of node, edge, and graph-based attributes for all connected components in the graph.
Return type:: pd.DataFrame

sigTabular()[source]

Combine all node-based, edge-based, and graph-based attributes of a single connected component. Merges node and edge attributes and creates a combined DataFrame for a single connected component.

Returns:: Combined DataFrame of node, edge, and graph-based attributes for a single connected component.
Return type:: pd.DataFrame

class gossipcat.graph.GFeature(df, source, target)[source]

Feature engineering to add all node-based and graph-based attributes of all connected components in a whole graph.

Initialize the feature engineering class. Initializes the class variables and extracts connected components for the provided graph.

Parameters:

df (pd.DataFrame) – DataFrame with source and target nodes.
source (str) – Source node name.
target (str) – Target node name.

generate()[source]

Generate graph features for all connected components. Iterates through all connected components in the graph and generates combined graph features.

Returns:: DataFrame with graph features for all connected components.
Return type:: pd.DataFrame

graphFeaturesUpdate(graph, df, d_r)[source]

Update graph-based attributes in a given DataFrame. Combines existing graph features with the features of the provided graph and updates the DataFrame.

Parameters:

graph (nx.Graph) – The input graph representing a single connected component.
df (pd.DataFrame) – DataFrame with source and target nodes.
d_r (pd.DataFrame) – DataFrame with existing graph-based attributes.

Returns:

Updated DataFrame with combined graph features.

Return type:

pd.DataFrame

signleGraphFeatures(graph, df)[source]

Combine all node-based, edge-based, and graph-based attributes of a single connected component. Merges node and graph attributes to create a combined DataFrame for a single connected component.

Parameters:

graph (nx.Graph) – The input graph representing a single connected component.
df (pd.DataFrame) – DataFrame with source and target nodes.

Returns:

Combined DataFrame of node, edge, and graph-based attributes for a single connected component.

Return type:

pd.DataFrame

gossipcat.graph.beautiful_nx(g)[source]

A function to draw network graphs beautifully.

Parameters:: g (networkx.Graph) – A networkx graph object.

Draws a network graph, enhancing its appearance by creating a shadow effect for nodes and rendering the graph using matplotlib.

Source:: https://gist.github.com/jg-you/144a35013acba010054a2cc4a93b07c7.js

gossipcat.graph.graph_with_label(G, df_node, metric, shreshold, figsize=(8, 6))[source]

Draws a graph with labeled nodes based on a specified topology attribute. Generates a visualization of the graph, labeling nodes that meet a specific topology attribute threshold.

Parameters:

G (networkx.Graph) – A networkx graph.
df_node (gossipcat.GraphFE) – A node attribute dataframe.
metric (str) – The topology attribute used for labeling nodes.
threshold (int or float) – The threshold to select nodes to be labeled.
figsize (tuple, optional) – The size of the figure (width, height). Default is (8, 6).

Returns:

None

gossipcat.graph.graph_with_scale(G, weight='wt', node_scalar=40000, edge_scalar=0.002, seed=2021, figsize=(20, 20))[source]

Draws a weighted graph with variable node and edge sizes based on centrality and edge weight. Generates a visualization of the weighted graph, adjusting node sizes and edge widths based on centrality and edge weight.

Parameters:

G (networkx.Graph) – A networkx graph.
weight (str) – The edge attribute representing weight.
node_scalar (int or float) – Scalar for adjusting node sizes based on centrality.
edge_scalar (float) – Scalar for adjusting edge widths based on edge weight.
seed (int, optional) – Seed for the random number generator. Default is 2021.
figsize (tuple, optional) – The size of the figure (width, height). Default is (20, 20).

Returns:

None