GitHub Issue Resolution Time Prediction System Analysis

System Overview & Model Development
Classification Performance
Resolution Time Prediction
Feature Analysis
System Architecture
Cloud Migration Plan

System Overview & Model Development

The GitHub Issue Resolution Time Prediction System uses a two-layer approach to analyze and predict how long it will take to resolve GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..

                Two-Layer Prediction System
                
                        Classification Layer: Determines if issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are
                        validIssues that will be fixed or addressed by
                                developers. These typically include actual bugs, feature requests, or enhancements that
                                align with the project's roadmap. or
                        won't-fixIssues that developers decide not to
                                address. These may include duplicate reports, invalid bug reports, requests outside the
                                project scope, or features that conflict with the project's design
                                philosophy..
                    

                        Regression Layer: For valid issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics., predicts
                        resolution timeThe time duration (measured in
                                hours) between when an issue is initially created and when it's finally closed. This
                                metric is crucial for project planning and setting expectations with users about when
                                their reported issues might be addressed..
                    

            

Classification Methods

The system uses two approaches to classify issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.:

Deterministic Classification

Uses regular expressionsPattern matching expressions that search for specific text patterns using specialized syntax. In this context, they're used to identify keywords and phrases in issue labels that strongly indicate whether an issue will be fixed or not (e.g., "wontfix", "invalid", "bug", "enhancement"). to find keywords in issue labelsTags assigned to GitHub issues to categorize them, such as "bug", "enhancement", "wontfix", "duplicate", etc. These labels are often manually applied by project maintainers and provide valuable signals about an issue's status and nature. that clearly indicate if an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid or won't-fix.

Machine Learning Classification

For ambiguous issuesIssues that lack clear labels or contain conflicting signals, making it difficult to determine their status through simple rule-based methods. These often require more sophisticated analysis of their content and context., uses Random ForestA machine learning algorithm that creates multiple decision trees during training and outputs the class that is the mode of the individual trees' predictions. It helps avoid overfitting and handles complex relationships between features well., XGBoostAn optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework, creating an ensemble of weak prediction models (typically decision trees) to produce a strong classifier., or other classifiersMachine learning models designed to assign items into categories (classes). In this context, they decide whether an unlabeled or ambiguously labeled issue is likely to be fixed or not based on patterns learned from previously labeled issues. to predict if the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid or won't-fix.

Resolution Time Prediction

For issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. classified as valid, the system trains multiple machine learning models to predict resolution time:

Random ForestAn ensemble learning method that builds multiple decision trees during training and merges their predictions. For regression tasks like predicting resolution time, it averages the predictions of individual trees to produce a more accurate and stable result.
Gradient BoostingA machine learning technique that produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds models sequentially, with each new model correcting errors made by previously trained ones, making it powerful for both classification and regression tasks.
XGBoostAn optimized implementation of gradient boosting that uses a more regularized model formalization to control overfitting. Its efficiency, accuracy, and ability to handle sparse data make it particularly well-suited for predicting complex patterns in GitHub issue resolution times.
GPU ForestA GPU-accelerated implementation of Random Forest that leverages parallel computing to dramatically speed up training and prediction times. This allows processing of larger datasets and more complex models than would be feasible with CPU-only implementations.

Finally, an ensemble modelA machine learning approach that combines the predictions from multiple models to produce an improved result. The ensemble aggregates the individual model predictions, reducing errors that might be made by any single model and generally achieving better performance than any constituent model alone. combines the predictions from all models to achieve better accuracy than any individual model.

Model Development Process

The development of this prediction system involved several key stages in data handling and feature creation:

Data Acquisition and Storage:

Data Collection: Approximately 2 million GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. were downloaded. These were sourced from the top 100 repositories across diverse and popular programming languages, including Python, Go, and Java, to ensure a broad and representative dataset.
Data Management: The collected issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data was then systematically upserted into a PostgreSQL database. This relational database was chosen for its robustness and efficiency in handling large volumes of structured and semi-structured data, facilitating complex queries and data retrieval for model training.

Feature Engineering:

A crucial step was the engineering of a rich set of features from the raw issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data. This process aimed to extract meaningful signals that could predict resolution times. Key feature categories include:

Text Embeddings: Sophisticated text embeddings were generated for the title and body of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. These embeddings convert textual information into dense vector representations, capturing the semantic meaning and context, which are vital for understanding the issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content.
Numerical Features: A variety of numerical features were derived, such as:
- The number of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. concurrently open within the same repository at the time of an issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. creation (reflecting repository load).
- The total number of unique contributors to the repository (indicating community size and activity).
Additional features, as detailed in the Feature Analysis section (e.g., time-based features, content metadata), were also engineered to provide a comprehensive view of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..

This robust data foundation and meticulously engineered feature set form the basis for the two-layer prediction approach and the models subsequently trained and evaluated in this system.

Classification Performance

Dataset Distribution

IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Distribution

Classification Method

Class Imbalance Challenge

The dataset exhibits severe class imbalance:

Valid IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 98.1% (550,487 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)
Won't-fix IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 1.9% (10,562 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)

This extreme imbalance makes it difficult for the model to correctly identify won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics., resulting in low recall despite high accuracy.

Classification Metrics

AccuracyThe proportion of all predictions (both positive and negative) that were correct. While our model achieves high accuracy (96.5%), this metric can be misleading with imbalanced datasets because always predicting the majority class (valid) would still yield high accuracy.: 0.9653 (96.5%)
PrecisionThe proportion of positive identifications (won't-fix predictions) that were actually correct. Our high precision (95.5%) means that when the model predicts an issue as won't-fix, it's usually right. However, this doesn't tell us how many won't-fix issues we're missing.: 0.9552 (95.5%)
RecallThe proportion of actual positives (true won't-fix issues) that were correctly identified. Our low recall (10.4%) reveals a critical weakness: the model is missing about 90% of the issues that actually won't be fixed, likely due to the severe class imbalance in the training data.: 0.1041 (10.4%)
F1 ScoreThe harmonic mean of precision and recall, providing a single metric that balances both concerns. Our low F1 score (0.1878) indicates poor overall performance despite high accuracy, because the model is failing to identify most won't-fix issues.: 0.1878
AUCArea Under the Receiver Operating Characteristic curve - measures the model's ability to distinguish between classes across various threshold settings. Our relatively high AUC (0.8542) suggests that the model has good discriminative power, but we're not leveraging it effectively due to the imbalanced data problem.: 0.8542

Performance Analysis

The high accuracy (96.5%) is misleading due to the class imbalance. The low recall (10.4%) indicates that the model is missing about 90% of the actual won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. This is a critical limitation of the current approach.

Implications of Low Recall

The poor recall for won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. means that:

Many won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are incorrectly classified as valid
The system will waste time predicting resolution times for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that won't be fixed
The resolution time predictions may be less accurate because the dataset includes mislabeled won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.

Resolution Time Prediction Performance

Model	MAE (hrs)	RMSE (hrs)	Median AE (hrs)	R²
Without Embeddings
Random Forest	3906.01	10431.53	341.10	0.1303
Gradient Boosting	3938.32	10490.13	328.57	0.1379
XGBoost	3933.83	10483.27	329.33	0.1402
GPU Forest	3920.53	10488.01	320.06	0.1380
Ensemble	3925.11	10486.19	321.20	0.1459
With Embeddings
Random Forest	3878.43	10390.64	322.83	0.1825
Gradient Boosting	3881.12	10339.36	350.04	0.1821
XGBoost	3875.79	10321.36	352.11	0.1870
GPU Forest	3891.70	10416.41	323.42	0.1737
Ensemble	3879.43	10379.16	327.31	0.1903

MAE = Mean Absolute Error, RMSE = Root Mean Square Error, Median AE = Median Absolute Error, R² = R-squared coefficient of determination

Analysis: The significant importance of classification uncertainty and text embeddings suggests that the semantic contentThe underlying meaning and context expressed in the issue text, beyond just keywords or surface-level statistics. This includes concepts like the problem domain, the complexity of the issue description, technical specificity, and how the issue is framed - all of which can influence how quickly it might be resolved. of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valuable for predicting resolution time. The system effectively leverages this through the two-layer approach.

Resolution Time Distribution Challenge

Resolution Time Statistics:

Mean: 3,404.38 hours (~142 days)
Median: 105.51 hours (~4.4 days)
Maximum: 148,492.47 hours (~17 years)
25th Percentile: 1.64 hours
75th Percentile: 1,752.48 hours (~73 days)
95th Percentile: 20,572.32 hours (~857 days)

The extreme disparity between mean (3,404 hours) and median (105.5 hours) resolution times reveals a highly skewed distribution with significant outliers. With some issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. taking nearly 17 years to resolve while others are addressed within hours, the current model's modest R² value of 0.19 is unsurprising.

This extreme variability makes the current model appear ineffective for practical predictions. However, with additional data and features related to repository maintainers, issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. authors, and codebase characteristics (such as complexity metrics, test coverage, and component dependencies), a significantly more powerful predictive model could be developed. Incorporating such contextual features would likely capture the organizational and technical factors that drive the wide variance in resolution times.

Impact of Text Embeddings

Adding text embeddingsIn natural language processing, a text embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information. State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. These vector representations allow machine learning models to understand the meaning behind issue descriptions, capturing nuances that simple word counts or character statistics miss. significantly improved prediction performance:

R² improved from 0.1459 to 0.1903 (30.4% increase)
MAE decreased by 45.68 hours
RMSE decreased by 107.03 hours

This suggests that the semantic content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. contains valuable information for predicting resolution time.

Error Analysis by Category

By Author Type

IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. from core contributors have:

Lower mean error (2873.02 vs. 3905.38 hours)
Higher median error (415.46 vs. 324.82 hours)

This suggests core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. have fewer extreme outliers but may have moderate delays.

By Creation Time

Higher mean error (3970.00 vs. 3863.46 hours)
Lower median error (310.12 vs. 330.43 hours)

Weekend issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. may be more unpredictable with extremes in resolution time.

By IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Content

Lower mean error (3839.15 vs. 3962.73 hours)
Lower median error (319.51 vs. 343.41 hours)

Code inclusion appears to make resolution time more predictable.

Interpretation of Results

While an R² of 0.19 may seem low, predicting GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times is inherently challenging due to many factors outside the scope of the data. The significant improvement when adding embeddings indicates that the text content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. provides valuable signals for prediction.

Feature Importance Analysis

Detailed Analysis of Top Predictive Features

Understanding which features drive prediction performance provides valuable insights into what factors affect GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times. This section expands on the visualizations above with detailed comparisons between models and feature sets.

Top Features Without Embeddings: Ensemble Model

Rank	Feature	Importance	Description
1	classification_uncertaintyA measure of how confident the classification model is about whether an issue will be fixed or not. Higher uncertainty values indicate that the model found the classification decision difficult, which often correlates with more complex or ambiguous issues.	25.58%	The single most important predictor, suggesting that issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that are difficult to classify tend to have less predictable resolution times.
2	created_yearThe year in which the issue was created. This captures the project's maturity and the evolution of development practices over time.	7.19%	Strong temporal signal indicating that resolution patterns change significantly over a project's lifetime.
3	body_lengthThe total character count in the issue description. Longer descriptions often contain more detail about the issue, which can either help resolution (by providing more context) or indicate complexity.	5.18%	IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with longer descriptions may be more complex but also provide more context for resolution.
4	body_word_countThe number of words in the issue description. Similar to body_length but normalized for language patterns (e.g., some languages use more characters per word than others).	4.47%	Complements body_length; the two together suggest that issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. verbosity is a strong signal.
5	created_day_of_monthThe day of the month (1-31) when the issue was created. This can capture cyclical patterns in development activity and issue creation.	4.19%	Suggests monthly cycles in development activity affect resolution time.
6	wontfix_probabilityThe model's estimate of how likely an issue is to be classified as "won't fix." Higher probabilities suggest the issue might be on the borderline, potentially affecting handling time even if it's ultimately addressed.	3.65%	IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with higher probability of being won't-fix (even if classified as valid) tend to have distinctive resolution patterns.
7	code_to_text_ratioThe ratio of code characters to total characters in the issue description. A higher ratio indicates more code examples or snippets relative to explanatory text.	3.62%	IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with more code relative to explanatory text may be more precisely defined.
8	link_countThe number of hyperlinks in the issue description. Links often point to related issues, documentation, or external resources that provide context or prerequisites.	3.57%	IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that reference other resources may have dependencies or broader context affecting resolution time.
9	created_hourThe hour of the day (0-23) when the issue was created. This can indicate whether the issue was created during typical working hours or off-hours.	3.41%	The time of day when issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are created correlates with resolution patterns, possibly reflecting timezone differences or work schedules.
10	created_monthThe month (1-12) when the issue was created. This can capture seasonal patterns in development activity.	2.38%	Seasonal patterns affect resolution time, possibly due to release cycles, holidays, or contributor availability.

Top Features With Embeddings: Ensemble Model

Rank	Feature	Importance	Shift
1	classification_uncertainty	11.86%	↓ 13.72%
2	created_year	5.06%	↓ 2.13%
3	wontfix_probability	2.46%	↓ 1.19%
4	embed_0	2.33%	New
5	embed_5	1.89%	New
6	embed_2	1.81%	New
7	link_count	1.80%	↓ 1.77%
8	list_item_count	1.60%	↓ 0.77%
9	embed_3	1.47%	New
10	embed_4	1.31%	New

                Key Shifts When Adding Embeddings
                Classification uncertainty drops from 25.58% to 11.86% importance but remains
                        the top individual feature
Embedding dimensions collectively account for 63.2% of predictive power
Content-based features like body_length and body_word_count drop dramatically
                        in importance as embeddings capture their semantic signal more effectively
temporal features (created_year, created_day_of_month) decline in importance
                        but remain influential
Overall distribution becomes more even, with no single feature dominating

            

XGBoost vs. Ensemble: Feature Importance Comparison

XGBoost Feature Importance (No Embeddings)classification_uncertainty
85.20%
created_year
1.97%
has_error_message
1.40%
author_is_core_contributor
1.26%
link_count
1.22%

Ensemble Feature Importance (No Embeddings)classification_uncertainty
25.58%
created_year
7.19%
body_length
5.18%
body_word_count
4.47%
created_day_of_month
4.19%

XGBoost Feature Importance (With Embeddings)classification_uncertainty
38.18%
wontfix_probability
4.88%
created_year
3.54%
list_item_count
2.60%
link_count
2.57%

Ensemble Feature Importance (With Embeddings)classification_uncertainty
11.86%
created_year
5.06%
wontfix_probability
2.46%
embed_0
2.33%
embed_5
1.89%

Model Comparison Insights

The striking differences between XGBoost and the ensemble model's feature importance reveal fundamental differences in how these models learn:

XGBoost's Focus: XGBoost places extraordinary importance (85.20%) on classification_uncertainty when embeddings are not used, suggesting it relies heavily on this single feature for predictions.
Ensemble's Balanced Approach: The ensemble model distributes importance more evenly across features, suggesting it leverages multiple signals more effectively.
Embedding Effect: Adding embeddings reduces XGBoost's reliance on classification_uncertainty from 85.20% to 38.18%, while the ensemble model sees a more dramatic shift from 25.58% to 11.86%.
Feature Selection: XGBoost identifies different secondary features as important compared to the ensemble—for example, has_error_message and author_is_core_contributor rank higher in XGBoost.

Individual Model Analysis

Random Forest Importance Pattern

Random Forest shows a more balanced feature distribution:

Top Features: body_length (12.81%), body_word_count (10.96%), created_day_of_month (10.19%)
Classification features rank lower: uncertainty (6.15%), wontfix_probability (6.39%)
With embeddings: No single embedding dimension exceeds 2.76% importance
Shows higher appreciation for content-based features than other models

Gradient Boosting Importance Pattern

Gradient Boosting shows a hybrid approach:

Top Features: created_year (19.77%), classification_uncertainty (10.98%), link_count (8.19%)
Has the strongest emphasis on temporal features among all models
With embeddings: Still prioritizes created_year (12.65%) over classification features
Shows unique appreciation for link_count compared to other models

Feature Category Analysis

Feature Categories (Without Embeddings)

Total Importance: 100%

Feature Categories (With Embeddings)

Total Importance: 100%

Classification Layer Features

Classification-related features provide critical signals:

uncertainty - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with ambiguous classification signals tend to have unpredictable resolution times
wontfix_probability - Higher probabilities correlate with longer resolution times, even for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ultimately classified as valid
The decreasing importance of these features when embeddings are added suggests that embedding dimensions capture some of the same semantic signals
XGBoost's extreme reliance on classification_uncertainty suggests it identifies subtle patterns in this feature that other models miss

Temporal Features

Time-related features reveal important patterns:

created_year - Projects evolve significantly over time, affecting resolution patterns
created_day_of_month - Monthly cycles affect resolution (possibly due to sprint patterns or release schedules)
created_hour - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. created at different times of day show different resolution patterns
Gradient Boosting places the highest importance on temporal features, particularly created_year
These features remain important even with embeddings, suggesting they capture signals separate from semantic content

Content Features

IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content characteristics provide strong signals:

body_length/word_count - Longer issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. likely involve more complex problems
code_to_text_ratio - More code relative to explanation may indicate better reproducibility
link_count - References to other resources may indicate dependencies
list_item_count - More structured content may be clearer and easier to resolve
Random Forest places the highest emphasis on these features
These features decrease dramatically in importance when embeddings are added as embeddings capture their semantic information more effectively

Embedding Features

Text embeddings dominate prediction when available:

Collectively account for 63.2% of feature importance in the ensemble model
Embed_0, embed_5, and embed_2 are consistently the most important embedding dimensions across models
These embedding dimensions likely capture semantic aspects of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. like:

Problem domain (UI, backend, database, etc.)
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. type (bug, feature request, documentation)
Technical complexity signals
Writing style and clarity

The embedding model effectively condenses semantic information that would otherwise require many engineered features

Surprising Feature Findings

Model Disagreement on Author Features

Models disagree significantly on the importance of author-related features:

XGBoost: Ranks author_is_core_contributor as its 4th most important feature (1.26% importance)
Ensemble: Ranks author_is_core_contributor as 17th (0.67% importance)
Random Forest: Ranks it last at 22nd (0.31% importance)
This disagreement suggests the relationship between author status and resolution time is complex and possibly non-linear
XGBoost may be better at capturing specific resolution patterns for core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.

Bug Identification Features

Features related to bug identification show surprising patterns:

has_error_message: Ranked 3rd (1.40%) by XGBoost but only 14th (1.05%) by the ensemble
has_reproduction_steps: Generally low importance across all models despite being considered valuable in bug reports
has_code: Very low importance (typically below 0.6%) compared to code_to_text_ratio (3.62% in ensemble)
This suggests that the quality and proportion of code in an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is more important than its mere presence
XGBoost appears more sensitive to specific technical indicators in the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content

Strategic Feature Engineering Opportunities

Based on the comprehensive feature analysis, several opportunities for improved feature engineering emerge:

                    Repository Context Features
                    Repository age and maturity
Number of active contributors
Contributor-to-issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ratio
Release cadence and proximity to releases
Codebase size and complexity metrics
Test coverage and CI/CD pipeline metrics

                

                    IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Relationship Features
                    Number of related or linked issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
Dependency graph metrics
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. priority relative to other open issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
Component popularity/activity level
Historical resolution times for similar issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. "heat" (comments, reactions, subscriptions)

                

                    Developer Context Features
                    Author contribution history and expertise areas
Maintainer availability patterns
Reviewer response times
Team workload indicators
Geographic distribution of contributors
Timezone alignment between author and maintainers

                

                    Advanced Semantic Features
                    Topic modeling of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content
Named entity recognition for technologies mentioned
Code-specific embeddings for technical content
Sentiment analysis of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. descriptions
Complexity metrics for code snippets
Technical jargon density analysis

                

Incorporating these additional feature categories could substantially improve the model's predictive power beyond the current R² of 0.19, potentially making the system more useful for practical project planning and resource allocation.

System Architecture Analysis

Python Code Analysis

The system consists of two main Python scripts:

1. `predict_resolution_time.py`

This script implements the two-layer prediction system:

Data Loading and Preprocessing:
- Loads repository data with issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. details
- Applies regex-based deterministic classification
- Extracts features from creation time
- Loads and integrates text embeddings if available
Classification Layer:
- Trains a classifier (Random Forest, XGBoost, or GPU Forest)
- Applies classifier only to ambiguous issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
- Calculates classification uncertainty as a feature
Regression Layer:
- Trains multiple regression models (Random Forest, Gradient Boosting, XGBoost, GPU Forest)
- Evaluates and compares model performance
- Visualizes feature importance and errors

2. `ensemble.py`

This script creates and evaluates ensemble models:

Ensemble Creation:
- Loads trained models from the previous script
- Implements various ensemble methods (weighted, average, voting)
- Optionally optimizes weights for weighted ensemble
Performance Analysis:
- Compares ensemble to individual models
- Calculates improvement percentages
- Analyzes errors by issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. category
Feature Importance:
- Calculates weighted feature importance across models
- Visualizes importance by feature and category

Technical Implementation Details

GPU Acceleration

The system uses GPU acceleration through:

cuml for GPU-based Random Forest
cupy for GPU array operations
XGBoost with tree_method='gpu_hist'

This provides significant speedup for large datasets.

Error Handling

The system implements robust error handling:

Graceful fallback to CPU if GPU fails
Handling missing features and values
Data validation and cleaning

This ensures the pipeline can process diverse GitHub data reliably.

Critical Path Analysis

Our runtime summary shows:

Without embeddings: 318.53 seconds (5.31 minutes)
With embeddings: 4598.77 seconds (76.65 minutes)

The significant increase in runtime with embeddings (14.4x slower) highlights the computational cost of processing the additional 50 embedding dimensions for each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..

Cloud Migration Plan

Current Limitations

The current system has several limitations that motivate a cloud migration:

Processing time increases significantly with embeddings (14.4x slower)
Limited to the GitHub data currently available
Hardware constraints for large-scale processing
Class imbalance handling requires more sophisticated approaches

                Cloud Migration Benefits
                Scalability: Easily scale up compute resources for larger datasets
Data Access: Process the comprehensive GitHub Archive data
Parallel Processing: Distribute workloads across multiple nodes
Cost Efficiency: Pay only for resources used during processing

            

GitHub Archive Integration

The GitHub Archive provides a comprehensive record of GitHub events:

Contains data since 2011 with hourly archives
Includes issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. creation, updates, and closures
Provides rich metadata beyond what's in current dataset
Enables analysis of temporal patterns across the entire GitHub ecosystem

Text Embedding Model Analysis

HuggingFace Multilingual E5-Large-Instruct

The system uses intfloat/multilingual-e5-large-instructA state-of-the-art text embedding model from HuggingFace that converts text into high-dimensional vector representations. It's based on the E5 architecture (which stands for "Embeddings from bidirectional Encoder representations and contrastive predictive coding") and has been fine-tuned with instructions to better capture task-specific semantics. With 560M parameters, it offers a balance between quality and computational efficiency. for generating embeddings.

Advantages:

Speed: Relatively fast inference compared to larger models
Multilingual: Handles issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. in different languages
Instruction-tuned: Better at capturing task-specific semantics
Reasonable Size: 560M parameters (vs. billions in larger models)

Limitations:

Quality Tradeoff: Not as high-quality as larger embedding models
Technical Content: May not fully capture programming-specific semantics
Context Length: Limited context window compared to newer models

Cloud Implementation Roadmap

Data Pipeline: Set up pipeline to ingest and process GitHub Archive data
Distributed Processing: Refactor code for distributed execution (e.g., using Spark)
Embedding Service: Deploy embedding model as a scalable service
GPU Instances: Use cloud GPU instances for model training
Storage Optimization: Implement efficient storage of embeddings and features
Class Imbalance Handling: Implement techniques like oversampling or cost-sensitive learning
Monitoring: Set up performance monitoring and alerting

Trade-offs and Considerations

When migrating to the cloud, several trade-offs need to be considered:

Cost vs. Performance: More powerful instances cost more but reduce processing time
Embedding Quality vs. Speed: Consider testing larger embedding models despite increased cost
Data Volume vs. Detail: Balance between processing more repositories or deeper analysis
Real-time vs. Batch Processing: Determine if predictions need to be real-time or batched

Contents