Compare commits
4 Commits
bdcba003fe
...
0a0281ab4e
Author | SHA1 | Date | |
---|---|---|---|
0a0281ab4e | |||
0f248e6b9a | |||
7d82e4c588 | |||
721a38435b |
101
README.md
Normal file
101
README.md
Normal file
@@ -0,0 +1,101 @@
|
||||
|
||||
<!--Your Github repository must have the following contents:
|
||||
|
||||
A README.md file that communicates the libraries used, the motivation for the project, the files in the repository with a small description of each, a summary of the results of the analysis, and necessary acknowledgments.
|
||||
|
||||
Your code in a Jupyter notebook, with appropriate comments, analysis, and documentation.
|
||||
|
||||
You may also provide any other necessary documentation you find necessary.-->
|
||||
|
||||
# stacksurvey
|
||||
|
||||
**stacksurvey** is an exploration and analysis of data from StackOverflow's developer survey of 2024.
|
||||
|
||||
[https://survey.stackoverflow.co/2024/](https://survey.stackoverflow.co/2024/)
|
||||
|
||||
The motivation for project is satisfying a class assignment. Eventually, an interesting (enough) topic was discovered in the data set:
|
||||
|
||||
>What is the annual compensation (y) over years of experience (x) of deveopers who use a programming language from a specific country?
|
||||
|
||||
## Requirements
|
||||
|
||||
numpy pandas sklearn matplotlib seaborn
|
||||
|
||||
## Summary of Analysis
|
||||
|
||||
The models generated by the notebook become less reliable with years of experience greater than 10 or annual incomes greater than $200,000.
|
||||
|
||||
Each chart comes with two regression lines. Red is the default regression line that has not been tuned. The other is an attempt to better fit the data by either transforming or shifting x.
|
||||
|
||||
The transformation is typically
|
||||
|
||||
y = m * log(x) + b
|
||||
|
||||
where the base is a parameter.
|
||||
|
||||
Each model had different changes of base applied to the log function.
|
||||
|
||||
### C
|
||||
|
||||

|
||||
|
||||
+----------------------+
|
||||
red regression line for C
|
||||
coefficient = 1427.58
|
||||
intercept = 103659.82
|
||||
rmse = 26971.44
|
||||
r2 score = 0.06
|
||||
sample predictions:
|
||||
[[125073.46117519]
|
||||
[107942.54574181]
|
||||
[109370.12202793]]
|
||||
+----------------------+
|
||||
+----------------------+
|
||||
magenta regression line for C
|
||||
coefficient = 11973.47
|
||||
intercept = 54776.27
|
||||
rmse = 21198.61
|
||||
r2 score = 0.57
|
||||
sample predictions:
|
||||
[[132396.26294684]
|
||||
[119937.35465744]
|
||||
[ 64985.1549115 ]]
|
||||
+----------------------+
|
||||
|
||||
For C programmers, a linear model fits well but not great having an r2 score of 0.57. Junior level positions earn roughly $54,776. Their income progresses $11,973 with each year of experience.
|
||||
|
||||
### Python
|
||||
|
||||

|
||||
|
||||
+----------------------+
|
||||
red regression line for Python
|
||||
coefficient = 2573.62
|
||||
intercept = 123479.15
|
||||
rmse = 39759.45
|
||||
r2 score = 0.34
|
||||
sample predictions:
|
||||
[[126052.77118246]
|
||||
[174951.60602361]
|
||||
[187819.7204555 ]]
|
||||
+----------------------+
|
||||
+----------------------+
|
||||
cyan regression line for Python
|
||||
coefficient = 10378.53
|
||||
intercept = 82957.69
|
||||
rmse = 42374.26
|
||||
r2 score = 0.38
|
||||
sample predictions:
|
||||
[[139882.01866593]
|
||||
[117229.55243376]
|
||||
[137277.30441955]]
|
||||
+----------------------+
|
||||
|
||||
For data scientists, analysts, or engineers, a linear model is a moderate fit at best as the r2 score is around 0.30. There appears to be divergence at the 10 year mark in their careers. This may be the result of their field (advertising, finance, bio/medical, and so on).
|
||||
|
||||
Entry or junior level professionals generally have an income of $82,957 or $123,479. Their annual income increases by $10,378 or $2573 each year.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
* "Udacity AI" (ChatGPT), the idea to transform x values to appropriate a linear regression into a logarithmic regression.
|
||||
|
BIN
images/programmers-C-United-States-of-America.png
Normal file
BIN
images/programmers-C-United-States-of-America.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 82 KiB |
BIN
images/programmers-Python-United-States-of-America.png
Normal file
BIN
images/programmers-Python-United-States-of-America.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 92 KiB |
@@ -149,25 +149,15 @@
|
||||
" self.devs = devs.drop(indices)\n",
|
||||
" del devs, new_column\n",
|
||||
" \n",
|
||||
" def visualize(self, n_lowest=0, \n",
|
||||
" hue=\"Country\", palette=sb.color_palette() ): \n",
|
||||
" def visualize(self, hue=\"Country\", \n",
|
||||
" palette=sb.color_palette() ): \n",
|
||||
" self.canvas = plt.figure()\n",
|
||||
" key_x = self.key_x\n",
|
||||
" key_y = self.key_y\n",
|
||||
"\n",
|
||||
" if n_lowest > 0:\n",
|
||||
" # chatgpt draws my line\n",
|
||||
" # Calculate the lowest nth point (for example, the 5th lowest value)\n",
|
||||
" # iloc[-1] gets the last element from the n smallest\n",
|
||||
" lowest_nth = self.devs[key_y].nsmallest(n_lowest).iloc[-1] \n",
|
||||
" # Draw a horizontal line at the lowest nth point\n",
|
||||
" # label=f'Lowest {n_poorest}th Point: {lowest_nth_value:.2f}'\n",
|
||||
" plt.axhline(y=lowest_nth, color='purple', linestyle='--', \n",
|
||||
" label=\"y=%0.2f\" % lowest_nth, zorder=-1 )\n",
|
||||
"\n",
|
||||
" sb.scatterplot(data=self.devs, x=key_x, y=key_y, hue=hue, palette=palette)\n",
|
||||
" plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
|
||||
" title = \"Annual Salary of %s Developers Over Years of Experience\" % self.language\\\n",
|
||||
" title = \"Annual Compensation of %s Programmers Over Years of Experience\" % self.language\\\n",
|
||||
" + \"\\nsample size=%i\" % len (self.devs)\\\n",
|
||||
" + \"\\ncountry=%s\" % self.country\n",
|
||||
" plt.title(title)\n",
|
||||
@@ -188,22 +178,25 @@
|
||||
"\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" y_pred = model.predict(X_test)\n",
|
||||
"\n",
|
||||
" \n",
|
||||
" m = model.coef_[0][0]\n",
|
||||
" b = model.intercept_[0]\n",
|
||||
" print(\"+----------------------+\")\n",
|
||||
" print(\"%s regression line for %s\" % (line_color, self.language))\n",
|
||||
" print(\"coefficient =\", model.coef_)\n",
|
||||
" print('intercept=', model.intercept_)\n",
|
||||
" print(\"coefficient = %0.2f\" % m)\n",
|
||||
" print('intercept = %0.2f' % b)\n",
|
||||
" rmse = root_mean_squared_error(y_test, y_pred)\n",
|
||||
" print(\"rmse = \", rmse)\n",
|
||||
" print(\"rmse = %0.2f\" % rmse)\n",
|
||||
" r2 = r2_score(y_test, y_pred)\n",
|
||||
" print(\"r2 score = \", r2)\n",
|
||||
" print(\"r2 score = %0.2f\" % r2)\n",
|
||||
" print(\"sample predictions:\")\n",
|
||||
" print(y_pred[3:6])\n",
|
||||
" print(\"+----------------------+\")\n",
|
||||
" \n",
|
||||
" plt.figure(self.canvas)\n",
|
||||
"\n",
|
||||
" plt.figure(self.canvas)\n",
|
||||
" plt.plot(X_test, y_pred, color=line_color, label='Regression Line')\n",
|
||||
" plt.axhline(y=b, color=\"purple\", linestyle='--', \n",
|
||||
" label=\"b=%0.2f\" % b, zorder=-1 )\n",
|
||||
" plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
|
||||
" del y_pred, model\n",
|
||||
"\n",
|
||||
@@ -255,7 +248,7 @@
|
||||
"\n",
|
||||
"* The income of a data professional can either increase by 2,000 per year (red) or 10,000 per year (cyan).\n",
|
||||
"\n",
|
||||
"* For both models, the r2 score ranges from poor to moderate = 0.20 - 0.37 depending on the random number. The variability not explained by the model could be the result of the fields such as advertising, finance, or bio/medical technology.\n",
|
||||
"* For both models, the r2 score ranges from poor to moderate = 0.20 - 0.37 depending on the random number. The variability not explained by the model could be the result of different fields that employ dats scientists/analysts/engineers such as finance, bio/med, or advertising.\n",
|
||||
"\n",
|
||||
"* For any given point in the career, the model is off by 39,000 or 42,000 dollars.\n",
|
||||
"\n",
|
||||
@@ -280,7 +273,7 @@
|
||||
" # \"Product manager\"\n",
|
||||
"]\n",
|
||||
"c = Foo(so_df, \"C\", jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
|
||||
"c.visualize(n_lowest=7, hue=\"DevType\", palette=[\"#57e6da\",\"#d9e352\",\"#cc622d\"] ) \n",
|
||||
"c.visualize(hue=\"DevType\", palette=[\"#57e6da\",\"#d9e352\",\"#cc622d\"] ) \n",
|
||||
"c.run_regression()\n",
|
||||
"c.run_regression(x_transform=log_base_a, change_base=1.3, \n",
|
||||
" x_shift=2, y_shift=-5000, line_color=\"magenta\", random=555)\n",
|
||||
|
Reference in New Issue
Block a user