GitHub IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Resolution Time Prediction System Analysis

☀️ 🌙

Contents

  1. System Overview & Model Development
  2. Classification Performance
  3. Resolution Time Prediction
  4. Feature Analysis
  5. System Architecture
  6. Cloud Migration Plan

System Overview & Model Development

The GitHub Issue Resolution Time Prediction System uses a two-layer approach to analyze and predict how long it will take to resolve GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..

Two-Layer Prediction System

  1. Classification Layer: Determines if issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are validIssues that will be fixed or addressed by developers. These typically include actual bugs, feature requests, or enhancements that align with the project's roadmap. or won't-fixIssues that developers decide not to address. These may include duplicate reports, invalid bug reports, requests outside the project scope, or features that conflict with the project's design philosophy..
  2. Regression Layer: For valid issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics., predicts resolution timeThe time duration (measured in hours) between when an issue is initially created and when it's finally closed. This metric is crucial for project planning and setting expectations with users about when their reported issues might be addressed..

Classification Methods

The system uses two approaches to classify issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.:

Deterministic Classification

Uses regular expressionsPattern matching expressions that search for specific text patterns using specialized syntax. In this context, they're used to identify keywords and phrases in issue labels that strongly indicate whether an issue will be fixed or not (e.g., "wontfix", "invalid", "bug", "enhancement"). to find keywords in issue labelsTags assigned to GitHub issues to categorize them, such as "bug", "enhancement", "wontfix", "duplicate", etc. These labels are often manually applied by project maintainers and provide valuable signals about an issue's status and nature. that clearly indicate if an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid or won't-fix.

Machine Learning Classification

For ambiguous issuesIssues that lack clear labels or contain conflicting signals, making it difficult to determine their status through simple rule-based methods. These often require more sophisticated analysis of their content and context., uses Random ForestA machine learning algorithm that creates multiple decision trees during training and outputs the class that is the mode of the individual trees' predictions. It helps avoid overfitting and handles complex relationships between features well., XGBoostAn optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework, creating an ensemble of weak prediction models (typically decision trees) to produce a strong classifier., or other classifiersMachine learning models designed to assign items into categories (classes). In this context, they decide whether an unlabeled or ambiguously labeled issue is likely to be fixed or not based on patterns learned from previously labeled issues. to predict if the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid or won't-fix.

Resolution Time Prediction

For issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. classified as valid, the system trains multiple machine learning models to predict resolution time:

Finally, an ensemble modelA machine learning approach that combines the predictions from multiple models to produce an improved result. The ensemble aggregates the individual model predictions, reducing errors that might be made by any single model and generally achieving better performance than any constituent model alone. combines the predictions from all models to achieve better accuracy than any individual model.

Model Development Process

The development of this prediction system involved several key stages in data handling and feature creation:

Data Acquisition and Storage:

  • Data Collection: Approximately 2 million GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. were downloaded. These were sourced from the top 100 repositories across diverse and popular programming languages, including Python, Go, and Java, to ensure a broad and representative dataset.
  • Data Management: The collected issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data was then systematically upserted into a PostgreSQL database. This relational database was chosen for its robustness and efficiency in handling large volumes of structured and semi-structured data, facilitating complex queries and data retrieval for model training.

Feature Engineering:

A crucial step was the engineering of a rich set of features from the raw issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data. This process aimed to extract meaningful signals that could predict resolution times. Key feature categories include:

  • Text Embeddings: Sophisticated text embeddings were generated for the title and body of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. These embeddings convert textual information into dense vector representations, capturing the semantic meaning and context, which are vital for understanding the issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content.
  • Numerical Features: A variety of numerical features were derived, such as:
    • The number of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. concurrently open within the same repository at the time of an issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. creation (reflecting repository load).
    • The total number of unique contributors to the repository (indicating community size and activity).
  • Additional features, as detailed in the Feature Analysis section (e.g., time-based features, content metadata), were also engineered to provide a comprehensive view of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..

This robust data foundation and meticulously engineered feature set form the basis for the two-layer prediction approach and the models subsequently trained and evaluated in this system.


Classification Performance

Dataset Distribution

IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Distribution

Classification Method

Class Imbalance Challenge

The dataset exhibits severe class imbalance:

  • Valid IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 98.1% (550,487 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)
  • Won't-fix IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 1.9% (10,562 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)

This extreme imbalance makes it difficult for the model to correctly identify won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics., resulting in low recall despite high accuracy.

Classification Metrics

  • AccuracyThe proportion of all predictions (both positive and negative) that were correct. While our model achieves high accuracy (96.5%), this metric can be misleading with imbalanced datasets because always predicting the majority class (valid) would still yield high accuracy.: 0.9653 (96.5%)
  • PrecisionThe proportion of positive identifications (won't-fix predictions) that were actually correct. Our high precision (95.5%) means that when the model predicts an issue as won't-fix, it's usually right. However, this doesn't tell us how many won't-fix issues we're missing.: 0.9552 (95.5%)
  • RecallThe proportion of actual positives (true won't-fix issues) that were correctly identified. Our low recall (10.4%) reveals a critical weakness: the model is missing about 90% of the issues that actually won't be fixed, likely due to the severe class imbalance in the training data.: 0.1041 (10.4%)
  • F1 ScoreThe harmonic mean of precision and recall, providing a single metric that balances both concerns. Our low F1 score (0.1878) indicates poor overall performance despite high accuracy, because the model is failing to identify most won't-fix issues.: 0.1878
  • AUCArea Under the Receiver Operating Characteristic curve - measures the model's ability to distinguish between classes across various threshold settings. Our relatively high AUC (0.8542) suggests that the model has good discriminative power, but we're not leveraging it effectively due to the imbalanced data problem.: 0.8542

Performance Analysis

The high accuracy (96.5%) is misleading due to the class imbalance. The low recall (10.4%) indicates that the model is missing about 90% of the actual won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. This is a critical limitation of the current approach.

Implications of Low Recall

The poor recall for won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. means that:

  • Many won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are incorrectly classified as valid
  • The system will waste time predicting resolution times for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that won't be fixed
  • The resolution time predictions may be less accurate because the dataset includes mislabeled won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.

Resolution Time Prediction Performance

Model MAE (hrs) RMSE (hrs) Median AE (hrs)
Without Embeddings
Random Forest 3906.01 10431.53 341.10 0.1303
Gradient Boosting 3938.32 10490.13 328.57 0.1379
XGBoost 3933.83 10483.27 329.33 0.1402
GPU Forest 3920.53 10488.01 320.06 0.1380
Ensemble 3925.11 10486.19 321.20 0.1459
With Embeddings
Random Forest 3878.43 10390.64 322.83 0.1825
Gradient Boosting 3881.12 10339.36 350.04 0.1821
XGBoost 3875.79 10321.36 352.11 0.1870
GPU Forest 3891.70 10416.41 323.42 0.1737
Ensemble 3879.43 10379.16 327.31 0.1903

MAE = Mean Absolute Error, RMSE = Root Mean Square Error, Median AE = Median Absolute Error, R² = R-squared coefficient of determination

Analysis: The significant importance of classification uncertainty and text embeddings suggests that the semantic contentThe underlying meaning and context expressed in the issue text, beyond just keywords or surface-level statistics. This includes concepts like the problem domain, the complexity of the issue description, technical specificity, and how the issue is framed - all of which can influence how quickly it might be resolved. of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valuable for predicting resolution time. The system effectively leverages this through the two-layer approach.

Resolution Time Distribution Challenge

Resolution Time Statistics:

  • Mean: 3,404.38 hours (~142 days)
  • Median: 105.51 hours (~4.4 days)
  • Maximum: 148,492.47 hours (~17 years)
  • 25th Percentile: 1.64 hours
  • 75th Percentile: 1,752.48 hours (~73 days)
  • 95th Percentile: 20,572.32 hours (~857 days)

The extreme disparity between mean (3,404 hours) and median (105.5 hours) resolution times reveals a highly skewed distribution with significant outliers. With some issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. taking nearly 17 years to resolve while others are addressed within hours, the current model's modest R² value of 0.19 is unsurprising.

This extreme variability makes the current model appear ineffective for practical predictions. However, with additional data and features related to repository maintainers, issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. authors, and codebase characteristics (such as complexity metrics, test coverage, and component dependencies), a significantly more powerful predictive model could be developed. Incorporating such contextual features would likely capture the organizational and technical factors that drive the wide variance in resolution times.

Impact of Text Embeddings

Adding text embeddingsIn natural language processing, a text embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information. State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. These vector representations allow machine learning models to understand the meaning behind issue descriptions, capturing nuances that simple word counts or character statistics miss. significantly improved prediction performance:

This suggests that the semantic content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. contains valuable information for predicting resolution time.

Error Analysis by Category

By Author Type

IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. from core contributors have:

  • Lower mean error (2873.02 vs. 3905.38 hours)
  • Higher median error (415.46 vs. 324.82 hours)

This suggests core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. have fewer extreme outliers but may have moderate delays.

By Creation Time

IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. created on weekends have:

  • Higher mean error (3970.00 vs. 3863.46 hours)
  • Lower median error (310.12 vs. 330.43 hours)

Weekend issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. may be more unpredictable with extremes in resolution time.

By IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Content

IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. containing code have:

  • Lower mean error (3839.15 vs. 3962.73 hours)
  • Lower median error (319.51 vs. 343.41 hours)

Code inclusion appears to make resolution time more predictable.

Interpretation of Results

While an R² of 0.19 may seem low, predicting GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times is inherently challenging due to many factors outside the scope of the data. The significant improvement when adding embeddings indicates that the text content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. provides valuable signals for prediction.


Feature Importance Analysis

Detailed Analysis of Top Predictive Features

Understanding which features drive prediction performance provides valuable insights into what factors affect GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times. This section expands on the visualizations above with detailed comparisons between models and feature sets.

Top Features Without Embeddings: Ensemble Model

Rank Feature Importance Description
1 classification_uncertaintyA measure of how confident the classification model is about whether an issue will be fixed or not. Higher uncertainty values indicate that the model found the classification decision difficult, which often correlates with more complex or ambiguous issues. 25.58% The single most important predictor, suggesting that issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that are difficult to classify tend to have less predictable resolution times.
2 created_yearThe year in which the issue was created. This captures the project's maturity and the evolution of development practices over time. 7.19% Strong temporal signal indicating that resolution patterns change significantly over a project's lifetime.
3 body_lengthThe total character count in the issue description. Longer descriptions often contain more detail about the issue, which can either help resolution (by providing more context) or indicate complexity. 5.18% IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with longer descriptions may be more complex but also provide more context for resolution.
4 body_word_countThe number of words in the issue description. Similar to body_length but normalized for language patterns (e.g., some languages use more characters per word than others). 4.47% Complements body_length; the two together suggest that issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. verbosity is a strong signal.
5 created_day_of_monthThe day of the month (1-31) when the issue was created. This can capture cyclical patterns in development activity and issue creation. 4.19% Suggests monthly cycles in development activity affect resolution time.
6 wontfix_probabilityThe model's estimate of how likely an issue is to be classified as "won't fix." Higher probabilities suggest the issue might be on the borderline, potentially affecting handling time even if it's ultimately addressed. 3.65% IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with higher probability of being won't-fix (even if classified as valid) tend to have distinctive resolution patterns.
7 code_to_text_ratioThe ratio of code characters to total characters in the issue description. A higher ratio indicates more code examples or snippets relative to explanatory text. 3.62% IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with more code relative to explanatory text may be more precisely defined.
8 link_countThe number of hyperlinks in the issue description. Links often point to related issues, documentation, or external resources that provide context or prerequisites. 3.57% IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that reference other resources may have dependencies or broader context affecting resolution time.
9 created_hourThe hour of the day (0-23) when the issue was created. This can indicate whether the issue was created during typical working hours or off-hours. 3.41% The time of day when issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are created correlates with resolution patterns, possibly reflecting timezone differences or work schedules.
10 created_monthThe month (1-12) when the issue was created. This can capture seasonal patterns in development activity. 2.38% Seasonal patterns affect resolution time, possibly due to release cycles, holidays, or contributor availability.

Top Features With Embeddings: Ensemble Model

Rank Feature Importance Shift
1 classification_uncertainty 11.86% ↓ 13.72%
2 created_year 5.06% ↓ 2.13%
3 wontfix_probability 2.46% ↓ 1.19%
4 embed_0 2.33% New
5 embed_5 1.89% New
6 embed_2 1.81% New
7 link_count 1.80% ↓ 1.77%
8 list_item_count 1.60% ↓ 0.77%
9 embed_3 1.47% New
10 embed_4 1.31% New

Key Shifts When Adding Embeddings

  • Classification uncertainty drops from 25.58% to 11.86% importance but remains the top individual feature
  • Embedding dimensions collectively account for 63.2% of predictive power
  • Content-based features like body_length and body_word_count drop dramatically in importance as embeddings capture their semantic signal more effectively
  • temporal features (created_year, created_day_of_month) decline in importance but remain influential
  • Overall distribution becomes more even, with no single feature dominating

XGBoost vs. Ensemble: Feature Importance Comparison

XGBoost Feature Importance (No Embeddings)

classification_uncertainty
85.20%
created_year
1.97%
has_error_message
1.40%
author_is_core_contributor
1.26%
link_count
1.22%

Ensemble Feature Importance (No Embeddings)

classification_uncertainty
25.58%
created_year
7.19%
body_length
5.18%
body_word_count
4.47%
created_day_of_month
4.19%

XGBoost Feature Importance (With Embeddings)

classification_uncertainty
38.18%
wontfix_probability
4.88%
created_year
3.54%
list_item_count
2.60%
link_count
2.57%

Ensemble Feature Importance (With Embeddings)

classification_uncertainty
11.86%
created_year
5.06%
wontfix_probability
2.46%
embed_0
2.33%
embed_5
1.89%

Model Comparison Insights

The striking differences between XGBoost and the ensemble model's feature importance reveal fundamental differences in how these models learn:

  • XGBoost's Focus: XGBoost places extraordinary importance (85.20%) on classification_uncertainty when embeddings are not used, suggesting it relies heavily on this single feature for predictions.
  • Ensemble's Balanced Approach: The ensemble model distributes importance more evenly across features, suggesting it leverages multiple signals more effectively.
  • Embedding Effect: Adding embeddings reduces XGBoost's reliance on classification_uncertainty from 85.20% to 38.18%, while the ensemble model sees a more dramatic shift from 25.58% to 11.86%.
  • Feature Selection: XGBoost identifies different secondary features as important compared to the ensemble—for example, has_error_message and author_is_core_contributor rank higher in XGBoost.

Individual Model Analysis

Random Forest Importance Pattern

Random Forest shows a more balanced feature distribution:

  • Top Features: body_length (12.81%), body_word_count (10.96%), created_day_of_month (10.19%)
  • Classification features rank lower: uncertainty (6.15%), wontfix_probability (6.39%)
  • With embeddings: No single embedding dimension exceeds 2.76% importance
  • Shows higher appreciation for content-based features than other models

Gradient Boosting Importance Pattern

Gradient Boosting shows a hybrid approach:

  • Top Features: created_year (19.77%), classification_uncertainty (10.98%), link_count (8.19%)
  • Has the strongest emphasis on temporal features among all models
  • With embeddings: Still prioritizes created_year (12.65%) over classification features
  • Shows unique appreciation for link_count compared to other models

Feature Category Analysis

Feature Categories (Without Embeddings)

Total Importance: 100%

Feature Categories (With Embeddings)

Total Importance: 100%

Classification Layer Features

Classification-related features provide critical signals:

  • uncertainty - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with ambiguous classification signals tend to have unpredictable resolution times
  • wontfix_probability - Higher probabilities correlate with longer resolution times, even for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ultimately classified as valid
  • The decreasing importance of these features when embeddings are added suggests that embedding dimensions capture some of the same semantic signals
  • XGBoost's extreme reliance on classification_uncertainty suggests it identifies subtle patterns in this feature that other models miss

Temporal Features

Time-related features reveal important patterns:

  • created_year - Projects evolve significantly over time, affecting resolution patterns
  • created_day_of_month - Monthly cycles affect resolution (possibly due to sprint patterns or release schedules)
  • created_hour - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. created at different times of day show different resolution patterns
  • Gradient Boosting places the highest importance on temporal features, particularly created_year
  • These features remain important even with embeddings, suggesting they capture signals separate from semantic content

Content Features

IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content characteristics provide strong signals:

  • body_length/word_count - Longer issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. likely involve more complex problems
  • code_to_text_ratio - More code relative to explanation may indicate better reproducibility
  • link_count - References to other resources may indicate dependencies
  • list_item_count - More structured content may be clearer and easier to resolve
  • Random Forest places the highest emphasis on these features
  • These features decrease dramatically in importance when embeddings are added as embeddings capture their semantic information more effectively

Embedding Features

Text embeddings dominate prediction when available:

  • Collectively account for 63.2% of feature importance in the ensemble model
  • Embed_0, embed_5, and embed_2 are consistently the most important embedding dimensions across models
  • These embedding dimensions likely capture semantic aspects of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. like:
    • Problem domain (UI, backend, database, etc.)
    • IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. type (bug, feature request, documentation)
    • Technical complexity signals
    • Writing style and clarity
  • The embedding model effectively condenses semantic information that would otherwise require many engineered features

Surprising Feature Findings

Model Disagreement on Author Features

Models disagree significantly on the importance of author-related features:

  • XGBoost: Ranks author_is_core_contributor as its 4th most important feature (1.26% importance)
  • Ensemble: Ranks author_is_core_contributor as 17th (0.67% importance)
  • Random Forest: Ranks it last at 22nd (0.31% importance)
  • This disagreement suggests the relationship between author status and resolution time is complex and possibly non-linear
  • XGBoost may be better at capturing specific resolution patterns for core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.

Bug Identification Features

Features related to bug identification show surprising patterns:

  • has_error_message: Ranked 3rd (1.40%) by XGBoost but only 14th (1.05%) by the ensemble
  • has_reproduction_steps: Generally low importance across all models despite being considered valuable in bug reports
  • has_code: Very low importance (typically below 0.6%) compared to code_to_text_ratio (3.62% in ensemble)
  • This suggests that the quality and proportion of code in an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is more important than its mere presence
  • XGBoost appears more sensitive to specific technical indicators in the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content

Strategic Feature Engineering Opportunities

Based on the comprehensive feature analysis, several opportunities for improved feature engineering emerge:

Repository Context Features

  • Repository age and maturity
  • Number of active contributors
  • Contributor-to-issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ratio
  • Release cadence and proximity to releases
  • Codebase size and complexity metrics
  • Test coverage and CI/CD pipeline metrics

IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Relationship Features

  • Number of related or linked issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
  • Dependency graph metrics
  • IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. priority relative to other open issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
  • Component popularity/activity level
  • Historical resolution times for similar issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
  • IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. "heat" (comments, reactions, subscriptions)

Developer Context Features

  • Author contribution history and expertise areas
  • Maintainer availability patterns
  • Reviewer response times
  • Team workload indicators
  • Geographic distribution of contributors
  • Timezone alignment between author and maintainers

Advanced Semantic Features

  • Topic modeling of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content
  • Named entity recognition for technologies mentioned
  • Code-specific embeddings for technical content
  • Sentiment analysis of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. descriptions
  • Complexity metrics for code snippets
  • Technical jargon density analysis

Incorporating these additional feature categories could substantially improve the model's predictive power beyond the current R² of 0.19, potentially making the system more useful for practical project planning and resource allocation.


System Architecture Analysis

Python Code Analysis

The system consists of two main Python scripts:

1. predict_resolution_time.py

This script implements the two-layer prediction system:

  • Data Loading and Preprocessing:
    • Loads repository data with issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. details
    • Applies regex-based deterministic classification
    • Extracts features from creation time
    • Loads and integrates text embeddings if available
  • Classification Layer:
    • Trains a classifier (Random Forest, XGBoost, or GPU Forest)
    • Applies classifier only to ambiguous issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
    • Calculates classification uncertainty as a feature
  • Regression Layer:
    • Trains multiple regression models (Random Forest, Gradient Boosting, XGBoost, GPU Forest)
    • Evaluates and compares model performance
    • Visualizes feature importance and errors

2. ensemble.py

This script creates and evaluates ensemble models:

  • Ensemble Creation:
    • Loads trained models from the previous script
    • Implements various ensemble methods (weighted, average, voting)
    • Optionally optimizes weights for weighted ensemble
  • Performance Analysis:
    • Compares ensemble to individual models
    • Calculates improvement percentages
    • Analyzes errors by issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. category
  • Feature Importance:
    • Calculates weighted feature importance across models
    • Visualizes importance by feature and category

Technical Implementation Details

GPU Acceleration

The system uses GPU acceleration through:

  • cuml for GPU-based Random Forest
  • cupy for GPU array operations
  • XGBoost with tree_method='gpu_hist'

This provides significant speedup for large datasets.

Error Handling

The system implements robust error handling:

  • Graceful fallback to CPU if GPU fails
  • Handling missing features and values
  • Data validation and cleaning

This ensures the pipeline can process diverse GitHub data reliably.

Critical Path Analysis

Our runtime summary shows:

The significant increase in runtime with embeddings (14.4x slower) highlights the computational cost of processing the additional 50 embedding dimensions for each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..


Cloud Migration Plan

Current Limitations

The current system has several limitations that motivate a cloud migration:

Cloud Migration Benefits

  • Scalability: Easily scale up compute resources for larger datasets
  • Data Access: Process the comprehensive GitHub Archive data
  • Parallel Processing: Distribute workloads across multiple nodes
  • Cost Efficiency: Pay only for resources used during processing

GitHub Archive Integration

The GitHub Archive provides a comprehensive record of GitHub events:

Text Embedding Model Analysis

HuggingFace Multilingual E5-Large-Instruct

The system uses intfloat/multilingual-e5-large-instructA state-of-the-art text embedding model from HuggingFace that converts text into high-dimensional vector representations. It's based on the E5 architecture (which stands for "Embeddings from bidirectional Encoder representations and contrastive predictive coding") and has been fine-tuned with instructions to better capture task-specific semantics. With 560M parameters, it offers a balance between quality and computational efficiency. for generating embeddings.

Advantages:

  • Speed: Relatively fast inference compared to larger models
  • Multilingual: Handles issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. in different languages
  • Instruction-tuned: Better at capturing task-specific semantics
  • Reasonable Size: 560M parameters (vs. billions in larger models)

Limitations:

  • Quality Tradeoff: Not as high-quality as larger embedding models
  • Technical Content: May not fully capture programming-specific semantics
  • Context Length: Limited context window compared to newer models

Cloud Implementation Roadmap

  1. Data Pipeline: Set up pipeline to ingest and process GitHub Archive data
  2. Distributed Processing: Refactor code for distributed execution (e.g., using Spark)
  3. Embedding Service: Deploy embedding model as a scalable service
  4. GPU Instances: Use cloud GPU instances for model training
  5. Storage Optimization: Implement efficient storage of embeddings and features
  6. Class Imbalance Handling: Implement techniques like oversampling or cost-sensitive learning
  7. Monitoring: Set up performance monitoring and alerting

Trade-offs and Considerations

When migrating to the cloud, several trade-offs need to be considered:

  • Cost vs. Performance: More powerful instances cost more but reduce processing time
  • Embedding Quality vs. Speed: Consider testing larger embedding models despite increased cost
  • Data Volume vs. Detail: Balance between processing more repositories or deeper analysis
  • Real-time vs. Batch Processing: Determine if predictions need to be real-time or batched