System Overview & Model Development
The GitHub Issue Resolution Time Prediction System uses a two-layer approach to analyze and
predict how long it will take to resolve GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..
Two-Layer Prediction System
-
Classification Layer: Determines if issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are
validIssues that will be fixed or addressed by
developers. These typically include actual bugs, feature requests, or enhancements that
align with the project's roadmap. or
won't-fixIssues that developers decide not to
address. These may include duplicate reports, invalid bug reports, requests outside the
project scope, or features that conflict with the project's design
philosophy..
-
Regression Layer: For valid issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics., predicts
resolution timeThe time duration (measured in
hours) between when an issue is initially created and when it's finally closed. This
metric is crucial for project planning and setting expectations with users about when
their reported issues might be addressed..
Classification Methods
The system uses two approaches to classify issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.:
Deterministic Classification
Uses regular expressionsPattern matching
expressions that search for specific text patterns using specialized syntax. In this
context, they're used to identify keywords and phrases in issue labels that strongly
indicate whether an issue will be fixed or not (e.g., "wontfix", "invalid", "bug",
"enhancement"). to find keywords in issue labelsTags assigned to GitHub issues to
categorize them, such as "bug", "enhancement", "wontfix", "duplicate", etc. These labels
are often manually applied by project maintainers and provide valuable signals about an
issue's status and nature. that clearly indicate
if an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid or won't-fix.
Machine Learning Classification
For ambiguous issuesIssues that lack clear
labels or contain conflicting signals, making it difficult to determine their status
through simple rule-based methods. These often require more sophisticated analysis of
their content and context., uses
Random ForestA machine learning algorithm that
creates multiple decision trees during training and outputs the class that is the mode
of the individual trees' predictions. It helps avoid overfitting and handles complex
relationships between features well.,
XGBoostAn optimized distributed gradient
boosting library designed to be highly efficient, flexible and portable. It implements
machine learning algorithms under the Gradient Boosting framework, creating an ensemble
of weak prediction models (typically decision trees) to produce a strong
classifier., or other
classifiersMachine learning models designed to
assign items into categories (classes). In this context, they decide whether an
unlabeled or ambiguously labeled issue is likely to be fixed or not based on patterns
learned from previously labeled issues. to predict if the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valid
or won't-fix.
Resolution Time Prediction
For issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. classified as valid, the system trains multiple machine learning models to predict resolution
time:
- Random ForestAn ensemble learning method that builds
multiple decision trees during training and merges their predictions. For regression tasks
like predicting resolution time, it averages the predictions of individual trees to produce
a more accurate and stable result.
- Gradient BoostingA machine learning technique that
produces a prediction model in the form of an ensemble of weak prediction models, typically
decision trees. It builds models sequentially, with each new model correcting errors made by
previously trained ones, making it powerful for both classification and regression
tasks.
- XGBoostAn optimized implementation of gradient
boosting that uses a more regularized model formalization to control overfitting. Its
efficiency, accuracy, and ability to handle sparse data make it particularly well-suited for
predicting complex patterns in GitHub issue resolution times.
- GPU ForestA GPU-accelerated implementation of Random
Forest that leverages parallel computing to dramatically speed up training and prediction
times. This allows processing of larger datasets and more complex models than would be
feasible with CPU-only implementations.
Finally, an ensemble modelA machine learning approach
that combines the predictions from multiple models to produce an improved result. The ensemble
aggregates the individual model predictions, reducing errors that might be made by any single
model and generally achieving better performance than any constituent model alone.
combines the predictions from all models to achieve better accuracy than any individual model.
Model Development Process
The development of this prediction system involved several key stages in data handling and feature creation:
Data Acquisition and Storage:
- Data Collection: Approximately 2 million GitHub issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. were downloaded. These were sourced from the top 100 repositories across diverse and popular programming languages, including Python, Go, and Java, to ensure a broad and representative dataset.
- Data Management: The collected issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data was then systematically upserted into a PostgreSQL database. This relational database was chosen for its robustness and efficiency in handling large volumes of structured and semi-structured data, facilitating complex queries and data retrieval for model training.
Feature Engineering:
A crucial step was the engineering of a rich set of features from the raw issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. data. This process aimed to extract meaningful signals that could predict resolution times. Key feature categories include:
- Text Embeddings: Sophisticated text embeddings were generated for the title and body of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. These embeddings convert textual information into dense vector representations, capturing the semantic meaning and context, which are vital for understanding the issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content.
- Numerical Features: A variety of numerical features were derived, such as:
- The number of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. concurrently open within the same repository at the time of an issue'sA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. creation (reflecting repository load).
- The total number of unique contributors to the repository (indicating community size and activity).
- Additional features, as detailed in the Feature Analysis section (e.g., time-based features, content metadata), were also engineered to provide a comprehensive view of each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..
This robust data foundation and meticulously engineered feature set form the basis for the two-layer prediction approach and the models subsequently trained and evaluated in this system.
Classification Performance
Dataset Distribution
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Distribution
Class Imbalance Challenge
The dataset exhibits severe class imbalance:
- Valid IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 98.1% (550,487 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)
- Won't-fix IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.: 1.9% (10,562 issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.)
This extreme imbalance makes it difficult for the model to correctly identify won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.,
resulting in low recall despite high accuracy.
Classification Metrics
- AccuracyThe proportion of all predictions (both
positive and negative) that were correct. While our model achieves high accuracy
(96.5%), this metric can be misleading with imbalanced datasets because always
predicting the majority class (valid) would still yield high accuracy.:
0.9653 (96.5%)
- PrecisionThe proportion of positive
identifications (won't-fix predictions) that were actually correct. Our high precision
(95.5%) means that when the model predicts an issue as won't-fix, it's usually right.
However, this doesn't tell us how many won't-fix issues we're missing.:
0.9552 (95.5%)
- RecallThe proportion of actual positives (true
won't-fix issues) that were correctly identified. Our low recall (10.4%) reveals a
critical weakness: the model is missing about 90% of the issues that actually won't be
fixed, likely due to the severe class imbalance in the training data.:
0.1041 (10.4%)
- F1 ScoreThe harmonic mean of precision and
recall, providing a single metric that balances both concerns. Our low F1 score (0.1878)
indicates poor overall performance despite high accuracy, because the model is failing
to identify most won't-fix issues.: 0.1878
- AUCArea Under the Receiver Operating
Characteristic curve - measures the model's ability to distinguish between classes
across various threshold settings. Our relatively high AUC (0.8542) suggests that the
model has good discriminative power, but we're not leveraging it effectively due to the
imbalanced data problem.: 0.8542
Performance Analysis
The high accuracy (96.5%) is misleading due to the class imbalance. The low recall (10.4%) indicates that
the model is missing about 90% of the actual won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.. This is a critical limitation of the
current approach.
Implications of Low Recall
The poor recall for won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. means that:
- Many won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are incorrectly classified as valid
- The system will waste time predicting resolution times for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that won't be fixed
- The resolution time predictions may be less accurate because the dataset includes mislabeled
won't-fix issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
Resolution Time Prediction Performance
Model |
MAE (hrs) |
RMSE (hrs) |
Median AE (hrs) |
R² |
Without Embeddings
|
Random Forest |
3906.01 |
10431.53 |
341.10 |
0.1303 |
Gradient Boosting |
3938.32 |
10490.13 |
328.57 |
0.1379 |
XGBoost |
3933.83 |
10483.27 |
329.33 |
0.1402 |
GPU Forest |
3920.53 |
10488.01 |
320.06 |
0.1380 |
Ensemble |
3925.11 |
10486.19 |
321.20 |
0.1459 |
With Embeddings
|
Random Forest |
3878.43 |
10390.64 |
322.83 |
0.1825 |
Gradient Boosting |
3881.12 |
10339.36 |
350.04 |
0.1821 |
XGBoost |
3875.79 |
10321.36 |
352.11 |
0.1870 |
GPU Forest |
3891.70 |
10416.41 |
323.42 |
0.1737 |
Ensemble |
3879.43 |
10379.16 |
327.31 |
0.1903 |
MAE = Mean Absolute Error, RMSE = Root Mean Square Error, Median AE = Median Absolute Error,
R² = R-squared coefficient of determination
Analysis: The significant importance of classification uncertainty and text
embeddings suggests that the semantic contentThe
underlying meaning and context expressed in the issue text, beyond just keywords or
surface-level statistics. This includes concepts like the problem domain, the complexity of
the issue description, technical specificity, and how the issue is framed - all of which can
influence how quickly it might be resolved. of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is valuable for
predicting resolution time. The system effectively leverages this through the two-layer approach.
Resolution Time Distribution Challenge
Resolution Time Statistics:
- Mean: 3,404.38 hours (~142 days)
- Median: 105.51 hours (~4.4 days)
- Maximum: 148,492.47 hours (~17 years)
- 25th Percentile: 1.64 hours
- 75th Percentile: 1,752.48 hours (~73 days)
- 95th Percentile: 20,572.32 hours (~857 days)
The extreme disparity between mean (3,404 hours) and median (105.5 hours) resolution times reveals a
highly skewed distribution with significant outliers. With some issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. taking nearly 17 years to
resolve while others are addressed within hours, the current model's modest R² value of 0.19 is
unsurprising.
This extreme variability makes the current model appear ineffective for practical predictions.
However, with additional data and features related to repository maintainers, issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. authors, and
codebase characteristics (such as complexity metrics, test coverage, and component dependencies), a
significantly more powerful predictive model could be developed. Incorporating such contextual
features would likely capture the organizational and technical factors that drive the wide variance
in resolution times.
Impact of Text Embeddings
Adding text embeddingsIn natural language processing,
a text embedding is a representation of a sentence as a vector of numbers which encodes
meaningful semantic information. State of the art embeddings are based on the learned hidden
layer representation of dedicated sentence transformer models. These vector representations
allow machine learning models to understand the meaning behind issue descriptions, capturing
nuances that simple word counts or character statistics miss. significantly
improved prediction performance:
- R² improved from 0.1459 to 0.1903 (30.4% increase)
- MAE decreased by 45.68 hours
- RMSE decreased by 107.03 hours
This suggests that the semantic content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. contains valuable information for predicting
resolution time.
Error Analysis by Category
By Author Type
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. from core contributors have:
- Lower mean error (2873.02 vs. 3905.38 hours)
- Higher median error (415.46 vs. 324.82 hours)
This suggests core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. have fewer extreme outliers but may have moderate
delays.
By Creation Time
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. created on weekends have:
- Higher mean error (3970.00 vs. 3863.46 hours)
- Lower median error (310.12 vs. 330.43 hours)
Weekend issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. may be more unpredictable with extremes in resolution time.
By IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Content
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. containing code have:
- Lower mean error (3839.15 vs. 3962.73 hours)
- Lower median error (319.51 vs. 343.41 hours)
Code inclusion appears to make resolution time more predictable.
Interpretation of Results
While an R² of 0.19 may seem low, predicting GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times is inherently challenging due
to many factors outside the scope of the data. The significant improvement when adding embeddings
indicates that the text content of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. provides valuable signals for prediction.
Feature Importance Analysis
Detailed Analysis of Top Predictive Features
Understanding which features drive prediction performance provides valuable insights into what
factors affect GitHub issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. resolution times. This section expands on the visualizations above with
detailed comparisons between models and feature sets.
Top Features Without Embeddings: Ensemble Model
Rank |
Feature |
Importance |
Description |
1 |
classification_uncertaintyA measure of
how confident the classification model is about whether an issue will be fixed
or not. Higher uncertainty values indicate that the model found the
classification decision difficult, which often correlates with more complex or
ambiguous issues. |
25.58% |
The single most important predictor, suggesting that issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that are difficult to
classify tend to have less predictable resolution times. |
2 |
created_yearThe year in which the issue
was created. This captures the project's maturity and the evolution of
development practices over time. |
7.19% |
Strong temporal signal indicating that resolution patterns change significantly over a
project's lifetime. |
3 |
body_lengthThe total character count in
the issue description. Longer descriptions often contain more detail about the
issue, which can either help resolution (by providing more context) or indicate
complexity. |
5.18% |
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with longer descriptions may be more complex but also provide more context for
resolution. |
4 |
body_word_countThe number of words in
the issue description. Similar to body_length but normalized for language
patterns (e.g., some languages use more characters per word than
others). |
4.47% |
Complements body_length; the two together suggest that issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. verbosity is a strong
signal. |
5 |
created_day_of_monthThe day of the month
(1-31) when the issue was created. This can capture cyclical patterns in
development activity and issue creation. |
4.19% |
Suggests monthly cycles in development activity affect resolution time. |
6 |
wontfix_probabilityThe model's estimate
of how likely an issue is to be classified as "won't fix." Higher probabilities
suggest the issue might be on the borderline, potentially affecting handling
time even if it's ultimately addressed. |
3.65% |
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with higher probability of being won't-fix (even if classified as valid) tend to
have distinctive resolution patterns. |
7 |
code_to_text_ratioThe ratio of code
characters to total characters in the issue description. A higher ratio
indicates more code examples or snippets relative to explanatory
text. |
3.62% |
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with more code relative to explanatory text may be more precisely defined. |
8 |
link_countThe number of hyperlinks in
the issue description. Links often point to related issues, documentation, or
external resources that provide context or prerequisites. |
3.57% |
IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. that reference other resources may have dependencies or broader context affecting
resolution time. |
9 |
created_hourThe hour of the day (0-23)
when the issue was created. This can indicate whether the issue was created
during typical working hours or off-hours. |
3.41% |
The time of day when issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. are created correlates with resolution patterns, possibly
reflecting timezone differences or work schedules. |
10 |
created_monthThe month (1-12) when the
issue was created. This can capture seasonal patterns in development
activity. |
2.38% |
Seasonal patterns affect resolution time, possibly due to release cycles, holidays, or
contributor availability. |
Top Features With Embeddings: Ensemble Model
Rank |
Feature |
Importance |
Shift |
1 |
classification_uncertainty |
11.86% |
↓ 13.72% |
2 |
created_year |
5.06% |
↓ 2.13% |
3 |
wontfix_probability |
2.46% |
↓ 1.19% |
4 |
embed_0 |
2.33% |
New |
5 |
embed_5 |
1.89% |
New |
6 |
embed_2 |
1.81% |
New |
7 |
link_count |
1.80% |
↓ 1.77% |
8 |
list_item_count |
1.60% |
↓ 0.77% |
9 |
embed_3 |
1.47% |
New |
10 |
embed_4 |
1.31% |
New |
Key Shifts When Adding Embeddings
- Classification uncertainty drops from 25.58% to 11.86% importance but remains
the top individual feature
- Embedding dimensions collectively account for 63.2% of predictive power
- Content-based features like body_length and body_word_count drop dramatically
in importance as embeddings capture their semantic signal more effectively
- temporal features (created_year, created_day_of_month) decline in importance
but remain influential
- Overall distribution becomes more even, with no single feature dominating
XGBoost vs. Ensemble: Feature Importance Comparison
XGBoost Feature Importance (No Embeddings)
classification_uncertainty
author_is_core_contributor
Ensemble Feature Importance (No Embeddings)
classification_uncertainty
XGBoost Feature Importance (With Embeddings)
classification_uncertainty
Ensemble Feature Importance (With Embeddings)
classification_uncertainty
Model Comparison Insights
The striking differences between XGBoost and the ensemble model's feature importance reveal
fundamental differences in how these models learn:
- XGBoost's Focus: XGBoost places extraordinary importance (85.20%) on
classification_uncertainty when embeddings are not used, suggesting it relies heavily on this
single feature for predictions.
- Ensemble's Balanced Approach: The ensemble model distributes importance more
evenly across features, suggesting it leverages multiple signals more effectively.
- Embedding Effect: Adding embeddings reduces XGBoost's reliance on
classification_uncertainty from 85.20% to 38.18%, while the ensemble model sees a more dramatic
shift from 25.58% to 11.86%.
- Feature Selection: XGBoost identifies different secondary features as important
compared to the ensemble—for example, has_error_message and author_is_core_contributor rank
higher in XGBoost.
Individual Model Analysis
Random Forest Importance Pattern
Random Forest shows a more balanced feature distribution:
- Top Features: body_length (12.81%), body_word_count (10.96%),
created_day_of_month (10.19%)
- Classification features rank lower: uncertainty (6.15%),
wontfix_probability (6.39%)
- With embeddings: No single embedding dimension exceeds 2.76% importance
- Shows higher appreciation for content-based features than other models
Gradient Boosting Importance Pattern
Gradient Boosting shows a hybrid approach:
- Top Features: created_year (19.77%), classification_uncertainty (10.98%),
link_count (8.19%)
- Has the strongest emphasis on temporal features among all models
- With embeddings: Still prioritizes created_year (12.65%) over
classification features
- Shows unique appreciation for link_count compared to other models
Feature Category Analysis
Feature Categories (Without Embeddings)
Total Importance: 100%
Feature Categories (With Embeddings)
Total Importance: 100%
Classification Layer Features
Classification-related features provide critical signals:
- uncertainty - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. with ambiguous classification signals tend to have
unpredictable resolution times
- wontfix_probability - Higher probabilities correlate with longer resolution
times, even for issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ultimately classified as valid
- The decreasing importance of these features when embeddings are added suggests that
embedding dimensions capture some of the same semantic signals
- XGBoost's extreme reliance on classification_uncertainty suggests it identifies subtle
patterns in this feature that other models miss
Temporal Features
Time-related features reveal important patterns:
- created_year - Projects evolve significantly over time, affecting
resolution patterns
- created_day_of_month - Monthly cycles affect resolution (possibly due to
sprint patterns or release schedules)
- created_hour - IssuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. created at different times of day show different
resolution patterns
- Gradient Boosting places the highest importance on temporal features, particularly
created_year
- These features remain important even with embeddings, suggesting they capture signals
separate from semantic content
Content Features
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content characteristics provide strong signals:
- body_length/word_count - Longer issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. likely involve more complex problems
- code_to_text_ratio - More code relative to explanation may indicate better
reproducibility
- link_count - References to other resources may indicate dependencies
- list_item_count - More structured content may be clearer and easier to
resolve
- Random Forest places the highest emphasis on these features
- These features decrease dramatically in importance when embeddings are added as embeddings
capture their semantic information more effectively
Embedding Features
Text embeddings dominate prediction when available:
- Collectively account for 63.2% of feature importance in the ensemble model
- Embed_0, embed_5, and embed_2 are consistently the most important embedding dimensions
across models
- These embedding dimensions likely capture semantic aspects of issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. like:
- Problem domain (UI, backend, database, etc.)
- IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. type (bug, feature request, documentation)
- Technical complexity signals
- Writing style and clarity
- The embedding model effectively condenses semantic information that would otherwise require
many engineered features
Surprising Feature Findings
Model Disagreement on Author Features
Models disagree significantly on the importance of author-related features:
- XGBoost: Ranks author_is_core_contributor as its 4th most important feature
(1.26% importance)
- Ensemble: Ranks author_is_core_contributor as 17th (0.67% importance)
- Random Forest: Ranks it last at 22nd (0.31% importance)
- This disagreement suggests the relationship between author status and resolution time is complex
and possibly non-linear
- XGBoost may be better at capturing specific resolution patterns for core contributor issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
Bug Identification Features
Features related to bug identification show surprising patterns:
- has_error_message: Ranked 3rd (1.40%) by XGBoost but only 14th (1.05%) by the
ensemble
- has_reproduction_steps: Generally low importance across all models despite
being considered valuable in bug reports
- has_code: Very low importance (typically below 0.6%) compared to
code_to_text_ratio (3.62% in ensemble)
- This suggests that the quality and proportion of code in an issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. is more
important than its mere presence
- XGBoost appears more sensitive to specific technical indicators in the issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content
Strategic Feature Engineering Opportunities
Based on the comprehensive feature analysis, several opportunities for improved feature engineering
emerge:
Repository Context Features
- Repository age and maturity
- Number of active contributors
- Contributor-to-issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. ratio
- Release cadence and proximity to releases
- Codebase size and complexity metrics
- Test coverage and CI/CD pipeline metrics
IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. Relationship Features
- Number of related or linked issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
- Dependency graph metrics
- IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. priority relative to other open issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
- Component popularity/activity level
- Historical resolution times for similar issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
- IssueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. "heat" (comments, reactions, subscriptions)
Developer Context Features
- Author contribution history and expertise areas
- Maintainer availability patterns
- Reviewer response times
- Team workload indicators
- Geographic distribution of contributors
- Timezone alignment between author and maintainers
Advanced Semantic Features
- Topic modeling of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. content
- Named entity recognition for technologies mentioned
- Code-specific embeddings for technical content
- Sentiment analysis of issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. descriptions
- Complexity metrics for code snippets
- Technical jargon density analysis
Incorporating these additional feature categories could substantially improve the model's predictive
power beyond the current R² of 0.19, potentially making the system more useful for practical project
planning and resource allocation.
System Architecture Analysis
Python Code Analysis
The system consists of two main Python scripts:
1. predict_resolution_time.py
This script implements the two-layer prediction system:
- Data Loading and Preprocessing:
- Loads repository data with issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. details
- Applies regex-based deterministic classification
- Extracts features from creation time
- Loads and integrates text embeddings if available
- Classification Layer:
- Trains a classifier (Random Forest, XGBoost, or GPU Forest)
- Applies classifier only to ambiguous issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics.
- Calculates classification uncertainty as a feature
- Regression Layer:
- Trains multiple regression models (Random Forest, Gradient Boosting, XGBoost, GPU
Forest)
- Evaluates and compares model performance
- Visualizes feature importance and errors
2. ensemble.py
This script creates and evaluates ensemble models:
- Ensemble Creation:
- Loads trained models from the previous script
- Implements various ensemble methods (weighted, average, voting)
- Optionally optimizes weights for weighted ensemble
- Performance Analysis:
- Compares ensemble to individual models
- Calculates improvement percentages
- Analyzes errors by issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. category
- Feature Importance:
- Calculates weighted feature importance across models
- Visualizes importance by feature and category
Technical Implementation Details
GPU Acceleration
The system uses GPU acceleration through:
cuml
for GPU-based Random Forest
cupy
for GPU array operations
- XGBoost with
tree_method='gpu_hist'
This provides significant speedup for large datasets.
Error Handling
The system implements robust error handling:
- Graceful fallback to CPU if GPU fails
- Handling missing features and values
- Data validation and cleaning
This ensures the pipeline can process diverse GitHub data reliably.
Critical Path Analysis
Our runtime summary shows:
- Without embeddings: 318.53 seconds (5.31 minutes)
- With embeddings: 4598.77 seconds (76.65 minutes)
The significant increase in runtime with embeddings (14.4x slower) highlights the computational cost of
processing the additional 50 embedding dimensions for each issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics..
Cloud Migration Plan
Current Limitations
The current system has several limitations that motivate a cloud migration:
- Processing time increases significantly with embeddings (14.4x slower)
- Limited to the GitHub data currently available
- Hardware constraints for large-scale processing
- Class imbalance handling requires more sophisticated approaches
Cloud Migration Benefits
- Scalability: Easily scale up compute resources for larger datasets
- Data Access: Process the comprehensive GitHub Archive data
- Parallel Processing: Distribute workloads across multiple nodes
- Cost Efficiency: Pay only for resources used during processing
GitHub Archive Integration
The GitHub Archive provides a comprehensive
record of GitHub events:
- Contains data since 2011 with hourly archives
- Includes issueA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. creation, updates, and closures
- Provides rich metadata beyond what's in current dataset
- Enables analysis of temporal patterns across the entire GitHub ecosystem
Text Embedding Model Analysis
HuggingFace Multilingual E5-Large-Instruct
The system uses intfloat/multilingual-e5-large-instructA
state-of-the-art text embedding model from HuggingFace that converts
text into high-dimensional vector representations. It's based on the E5 architecture (which
stands for "Embeddings from bidirectional Encoder representations and contrastive predictive
coding") and has been fine-tuned with instructions to better capture task-specific
semantics. With 560M parameters, it offers a balance between quality and computational
efficiency. for generating embeddings.
Advantages:
- Speed: Relatively fast inference compared to larger models
- Multilingual: Handles issuesA GitHub Issue is a way to track tasks, enhancements, bugs, or other items of work within a GitHub repository. It serves as a discussion forum and progress tracker for specific topics. in different languages
- Instruction-tuned: Better at capturing task-specific semantics
- Reasonable Size: 560M parameters (vs. billions in larger models)
Limitations:
- Quality Tradeoff: Not as high-quality as larger embedding models
- Technical Content: May not fully capture programming-specific semantics
- Context Length: Limited context window compared to newer models
Cloud Implementation Roadmap
- Data Pipeline: Set up pipeline to ingest and process GitHub Archive data
- Distributed Processing: Refactor code for distributed execution (e.g., using Spark)
- Embedding Service: Deploy embedding model as a scalable service
- GPU Instances: Use cloud GPU instances for model training
- Storage Optimization: Implement efficient storage of embeddings and features
- Class Imbalance Handling: Implement techniques like oversampling or cost-sensitive
learning
- Monitoring: Set up performance monitoring and alerting
Trade-offs and Considerations
When migrating to the cloud, several trade-offs need to be considered:
- Cost vs. Performance: More powerful instances cost more but reduce processing
time
- Embedding Quality vs. Speed: Consider testing larger embedding models despite
increased cost
- Data Volume vs. Detail: Balance between processing more repositories or deeper
analysis
- Real-time vs. Batch Processing: Determine if predictions need to be real-time
or batched