Compare commits
20 Commits
e4ba004fec
...
write-read
Author | SHA1 | Date | |
---|---|---|---|
af45d1a9c8 | |||
1570c4559f | |||
5e56263469 | |||
4b46a735cf | |||
05dc8b41e1 | |||
02067139e6 | |||
259e19f2ac | |||
800712c63c | |||
721a38435b | |||
bdcba003fe | |||
320fcb343b | |||
a6bc83fa8b | |||
f073019538 | |||
08eb095bf6 | |||
2d91e205a2 | |||
71d1efa292 | |||
c6096cfe6c | |||
ea4ee3f493 | |||
0c5cc2259d | |||
cbd575697f |
101
README.md
Normal file
101
README.md
Normal file
@@ -0,0 +1,101 @@
|
||||
|
||||
<!--Your Github repository must have the following contents:
|
||||
|
||||
A README.md file that communicates the libraries used, the motivation for the project, the files in the repository with a small description of each, a summary of the results of the analysis, and necessary acknowledgments.
|
||||
|
||||
Your code in a Jupyter notebook, with appropriate comments, analysis, and documentation.
|
||||
|
||||
You may also provide any other necessary documentation you find necessary.-->
|
||||
|
||||
# stacksurvey
|
||||
|
||||
**stacksurvey** is an exploration and analysis of data from StackOverflow's developer survey of 2024.
|
||||
|
||||
[https://survey.stackoverflow.co/2024/](https://survey.stackoverflow.co/2024/)
|
||||
|
||||
The motivation for project is satisfying a class assignment. Eventually, an interesting (enough) topic was discovered in the data set:
|
||||
|
||||
>What is the annual compensation (y) over years of experience (x) of deveopers who use a programming language from a specific country?
|
||||
|
||||
## Requirements
|
||||
|
||||
numpy pandas sklearn matplotlib seaborn
|
||||
|
||||
## Summary of Analysis
|
||||
|
||||
The models generated by the notebook become less reliable with years of experience greater than 10 or annual incomes greater than $200,000.
|
||||
|
||||
Each chart comes with two regression lines. Red is the default regression line that has not been tuned. The other is an attempt to better fit the data by either transforming or shifting x.
|
||||
|
||||
The transformation is typically
|
||||
|
||||
y = m * log(x) + b
|
||||
|
||||
where the base is a parameter.
|
||||
|
||||
Each model had different changes of base applied to the log function.
|
||||
|
||||
### C
|
||||
|
||||

|
||||
|
||||
+----------------------+
|
||||
red regression line for C
|
||||
coefficient = 1427.58
|
||||
intercept = 103659.82
|
||||
rmse = 26971.44
|
||||
r2 score = 0.06
|
||||
sample predictions:
|
||||
[[125073.46117519]
|
||||
[107942.54574181]
|
||||
[109370.12202793]]
|
||||
+----------------------+
|
||||
+----------------------+
|
||||
magenta regression line for C
|
||||
coefficient = 11973.47
|
||||
intercept = 54776.27
|
||||
rmse = 21198.61
|
||||
r2 score = 0.57
|
||||
sample predictions:
|
||||
[[132396.26294684]
|
||||
[119937.35465744]
|
||||
[ 64985.1549115 ]]
|
||||
+----------------------+
|
||||
|
||||
For C programmers, a linear model fits well but not great having an r2 score of 0.57. Junior level positions earn roughly $54,776. Their income progresses $11,973 with each year of experience.
|
||||
|
||||
### Python
|
||||
|
||||

|
||||
|
||||
+----------------------+
|
||||
red regression line for Python
|
||||
coefficient = 2573.62
|
||||
intercept = 123479.15
|
||||
rmse = 39759.45
|
||||
r2 score = 0.34
|
||||
sample predictions:
|
||||
[[126052.77118246]
|
||||
[174951.60602361]
|
||||
[187819.7204555 ]]
|
||||
+----------------------+
|
||||
+----------------------+
|
||||
cyan regression line for Python
|
||||
coefficient = 10378.53
|
||||
intercept = 82957.69
|
||||
rmse = 42374.26
|
||||
r2 score = 0.38
|
||||
sample predictions:
|
||||
[[139882.01866593]
|
||||
[117229.55243376]
|
||||
[137277.30441955]]
|
||||
+----------------------+
|
||||
|
||||
For data scientists, analysts, or engineers, a linear model is a moderate fit at best as the r2 score is around 0.30. There appears to be divergence at the 10 year mark in their careers. This may be the result of their field (advertising, finance, bio/medical, and so on).
|
||||
|
||||
Entry or junior level professionals generally have an income of $82,957 or $123,479. Their annual income increases by $10,378 or $2573 each year.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
* "Udacity AI" (ChatGPT), the idea to transform x values to appropriate a linear regression into a logarithmic regression.
|
||||
|
BIN
images/programmers-C-United-States-of-America.png
Normal file
BIN
images/programmers-C-United-States-of-America.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 81 KiB |
BIN
images/programmers-Python-United-States-of-America.png
Normal file
BIN
images/programmers-Python-United-States-of-America.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 91 KiB |
@@ -30,14 +30,332 @@
|
||||
"print(so_df.keys())\n",
|
||||
"so_df.describe()\n",
|
||||
"\n",
|
||||
"# print(so_df[:3])"
|
||||
"# check for people who aren't paying attention\n",
|
||||
"count_not_apple = (so_df[\"Check\"] != \"Apples\").sum()\n",
|
||||
"print(count_not_apple)\n",
|
||||
"print(so_df.shape)\n",
|
||||
"assert(count_not_apple == 0)\n",
|
||||
"# print(so_df[:3])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0e9b0c49-eac6-45e1-83f1-92813e734ef5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# draw count plot of developers based on age\n",
|
||||
"\n",
|
||||
"def visualize_devs(df, lang, key=\"Age\",):\n",
|
||||
" plt.figure()\n",
|
||||
" plt.xticks(rotation=45)\n",
|
||||
" # from:\n",
|
||||
" # print(df[key].unique())\n",
|
||||
" order = ['Under 18 years old', '18-24 years old', \\\n",
|
||||
" '25-34 years old','35-44 years old',\\\n",
|
||||
" '45-54 years old', '55-64 years old', \\\n",
|
||||
" '65 years or older', 'Prefer not to say']\n",
|
||||
" sb.countplot(x=key, data=df, order=order)\n",
|
||||
" title=\"Ages of %s Programmers\" % lang\n",
|
||||
" plt.title(title)\n",
|
||||
" filename= \"images/%s-of-%s-programmers.png\" % (key, lang)\n",
|
||||
" plt.savefig(filename, bbox_inches=\"tight\")\n",
|
||||
"\n",
|
||||
"def get_lang_devs(df, lang):\n",
|
||||
" col = \"LanguageHaveWorkedWith\"\n",
|
||||
" # will not work for single character languages (C, R)\n",
|
||||
" # will mangle Java and JavaScript, Python and MicroPython\n",
|
||||
" return df[ df[col].str.contains(lang, na=False) ] \n",
|
||||
"\n",
|
||||
"def get_c_devs(df, lang=\"C\"):\n",
|
||||
" key = \"LanguageHaveWorkedWith\"\n",
|
||||
" cdevs = []\n",
|
||||
" for index, dev in df.iterrows():\n",
|
||||
" try:\n",
|
||||
" # split string into list\n",
|
||||
" langs_used = dev[key].split(';')\n",
|
||||
" if lang in langs_used:\n",
|
||||
" cdevs.append(dev)\n",
|
||||
" except AttributeError:\n",
|
||||
"# print(dev[key])\n",
|
||||
" pass\n",
|
||||
" return pd.DataFrame(cdevs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b8212c27-6c76-4c8f-ba66-bbf1b5835c99",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"from sklearn.linear_model import LinearRegression\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn.metrics import root_mean_squared_error, r2_score\n",
|
||||
"import traceback\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"# still haven't come up with a name\n",
|
||||
"class Foo:\n",
|
||||
" def __init__(self, dataset, language, jobs=None, \n",
|
||||
" n_rich_outliers=0, n_poor_outliers=0, \n",
|
||||
" country=\"United States of America\"):\n",
|
||||
" self.devs = None\n",
|
||||
" self.canvas = None\n",
|
||||
" self.language = language\n",
|
||||
" self.country = country\n",
|
||||
" # focus on people who have given ...\n",
|
||||
" key_x = \"YearsCodePro\"\n",
|
||||
" key_y = \"ConvertedCompYearly\"\n",
|
||||
" df = dataset.dropna(subset=[key_x, key_y])\n",
|
||||
" self.key_x = key_x\n",
|
||||
" self.key_y = key_y\n",
|
||||
" \n",
|
||||
" qualifiers = {\n",
|
||||
" \"MainBranch\":\"I am a developer by profession\",\n",
|
||||
" }\n",
|
||||
" if country:\n",
|
||||
" qualifiers[\"Country\"] = country\n",
|
||||
" for k in qualifiers:\n",
|
||||
" df = df[df[k] == qualifiers[k] ] \n",
|
||||
"\n",
|
||||
" # chatgpt tells me about filtering with multiple strings\n",
|
||||
" if jobs:\n",
|
||||
" df = df[df.isin(jobs).any(axis=1)]\n",
|
||||
"\n",
|
||||
" devs = None\n",
|
||||
" if len(language) == 1 or language in [\"Python\", \"Java\"]:\n",
|
||||
" devs = get_c_devs(df, lang=language)\n",
|
||||
" else:\n",
|
||||
" devs = get_lang_devs(df, language)\n",
|
||||
" \n",
|
||||
" replacement_dict = {\n",
|
||||
" 'Less than 1 year': '0.5',\n",
|
||||
" 'More than 50 years': '51',\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # https://stackoverflow.com/questions/47443134/update-column-in-pandas-dataframe-without-warning\n",
|
||||
" pd.options.mode.chained_assignment = None # default='warn'\n",
|
||||
" new_column = devs[key_x].replace(replacement_dict)\n",
|
||||
" devs[key_x] = pd.to_numeric(new_column, errors='coerce')\n",
|
||||
" pd.options.mode.chained_assignment = 'warn' # default='warn'\n",
|
||||
" # print( devs[key_x].unique() )\n",
|
||||
" \n",
|
||||
" indices = devs[key_y].nlargest(n_rich_outliers).index\n",
|
||||
" devs = devs.drop(indices)\n",
|
||||
" indices = devs[key_y].nsmallest(n_poor_outliers).index\n",
|
||||
" self.devs = devs.drop(indices)\n",
|
||||
" del devs, new_column\n",
|
||||
" \n",
|
||||
" def visualize(self, hue=\"Country\", \n",
|
||||
" palette=sb.color_palette() ): \n",
|
||||
" self.canvas = plt.figure()\n",
|
||||
" key_x = self.key_x\n",
|
||||
" key_y = self.key_y\n",
|
||||
"\n",
|
||||
" sb.scatterplot(data=self.devs, x=key_x, y=key_y, hue=hue, palette=palette)\n",
|
||||
" plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
|
||||
" title = \"Annual Salary of %s Developers Over Years of Experience\" % self.language\\\n",
|
||||
" + \"\\nsample size=%i\" % len (self.devs)\\\n",
|
||||
" + \"\\ncountry=%s\" % self.country\n",
|
||||
" plt.title(title)\n",
|
||||
"\n",
|
||||
" def run_regression(self, model=LinearRegression(), split=train_test_split, \n",
|
||||
" x_transform=None, change_base=None, x_shift=0, y_shift=0,\n",
|
||||
" line_color='red', random=333):\n",
|
||||
" df = self.devs # .sort_values(by = self.key2)\n",
|
||||
" X = df[self.key_x].to_frame()\n",
|
||||
" if x_transform is not None and change_base is not None:\n",
|
||||
" X = x_transform (X, a=change_base ) \n",
|
||||
" elif x_transform is not None:\n",
|
||||
" X = x_transform (X) \n",
|
||||
" X = X + x_shift\n",
|
||||
" y = df[self.key_y].to_frame() + y_shift\n",
|
||||
" \n",
|
||||
" X_train, X_test, y_train, y_test = split(X, y, test_size=0.2, random_state=random)\n",
|
||||
"\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" y_pred = model.predict(X_test)\n",
|
||||
"\n",
|
||||
" print(\"+----------------------+\")\n",
|
||||
" print(\"%s regression line for %s\" % (line_color, self.language))\n",
|
||||
" print(\"coefficient =\", model.coef_)\n",
|
||||
" print('intercept=', model.intercept_)\n",
|
||||
" rmse = root_mean_squared_error(y_test, y_pred)\n",
|
||||
" print(\"rmse = \", rmse)\n",
|
||||
" r2 = r2_score(y_test, y_pred)\n",
|
||||
" print(\"r2 score = \", r2)\n",
|
||||
" print(\"sample predictions:\")\n",
|
||||
" print(y_pred[3:6])\n",
|
||||
" print(\"+----------------------+\")\n",
|
||||
" b = model.intercept_[0]\n",
|
||||
"\n",
|
||||
" plt.figure(self.canvas)\n",
|
||||
" plt.plot(X_test, y_pred, color=line_color, label='Regression Line')\n",
|
||||
" plt.axhline(y=b, color=\"purple\", linestyle='--', \n",
|
||||
" label=\"b=%0.2f\" % b, zorder=-1 )\n",
|
||||
" plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
|
||||
" del y_pred, model\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" def export_image(self, base_filename = \"images/programmers-%s-%s.png\"):\n",
|
||||
" plt.figure(self.canvas)\n",
|
||||
" filename = base_filename % (self.language, self.country)\n",
|
||||
" plt.savefig(filename.replace(' ', '-'), bbox_inches='tight')\n",
|
||||
"\n",
|
||||
"# the higher a is, the steeper the line gets\n",
|
||||
"def log_base_a(x, a=1.07):\n",
|
||||
" return np.log10(x)/np.log(a)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ba81c59c-0610-4f71-96fb-9eddd7736329",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# expected python jobs\n",
|
||||
"pyjobs = [\"Data scientist or machine learning specialist\",\n",
|
||||
" \"Data or business analyst\",\n",
|
||||
" \"Data engineer\",\n",
|
||||
"# \"DevOps specialist\",\n",
|
||||
"# \"Developer, QA or test\"\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"python = Foo(so_df, \"Python\", jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
|
||||
"python.visualize(hue=\"DevType\", palette=[\"#dbdb32\", \"#34bf65\", \"#ac70e0\"])\n",
|
||||
"python.run_regression()\n",
|
||||
"python.run_regression( x_transform=log_base_a, change_base=1.20, \n",
|
||||
" x_shift=0, y_shift=-1.5e4, line_color='cyan', random=888)\n",
|
||||
"python.export_image()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f4e3516e-ffe3-4768-ae92-e5cb0be503f8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Business Understanding\n",
|
||||
"\n",
|
||||
"* For Python programmers specialized in data e.g data scientists, engineers, or analysts, a linear model moderately fits the relationship between income and years of experience. For data engineers and scientists, there is a possible divergence within the career path at 10 years of experience. \n",
|
||||
"\n",
|
||||
"* The typical starting salary within the field of data science is around 100,000 to 120,000 dollars.\n",
|
||||
"\n",
|
||||
"* The income of a data professional can either increase by 2,000 per year (red) or 10,000 per year (cyan).\n",
|
||||
"\n",
|
||||
"* For both models, the r2 score ranges from poor to moderate = 0.20 - 0.37 depending on the random number. The variability not explained by the model could be the result of the fields such as advertising, finance, or bio/medical technology.\n",
|
||||
"\n",
|
||||
"* For any given point in the career, the model is off by 39,000 or 42,000 dollars.\n",
|
||||
"\n",
|
||||
"Generally, for low uncomes poorly explained by the model, the cause could be getting a new job after a year of unemployment, internships, or part-time positions. For high incomes poorly explained by the model, the cause could be professionals at large companies who had recently added a programming language to their skill set. Other causes could be company size or working hours.\n",
|
||||
"\n",
|
||||
"(Business understanding was done for C first. Questions are there.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0e27f76c-8f87-4c39-ac2f-5a9b2434466f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# expected C jobs\n",
|
||||
"cjobs = [\n",
|
||||
" \"Developer, embedded applications or devices\", \n",
|
||||
" \"Developer, game or graphics\",\n",
|
||||
" \"Hardware Engineer\" ,\n",
|
||||
" # \"Project manager\", \n",
|
||||
" # \"Product manager\"\n",
|
||||
"]\n",
|
||||
"c = Foo(so_df, \"C\", jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
|
||||
"c.visualize(hue=\"DevType\", palette=[\"#57e6da\",\"#d9e352\",\"#cc622d\"] ) \n",
|
||||
"c.run_regression()\n",
|
||||
"c.run_regression(x_transform=log_base_a, change_base=1.3, \n",
|
||||
" x_shift=2, y_shift=-5000, line_color=\"magenta\", random=555)\n",
|
||||
"c.export_image()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89d86a1e-dc65-48e4-adcf-bb10188fd0b7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Business Understanding\n",
|
||||
"\n",
|
||||
"1. For C programmers, specifically embedded systems, graphics, and hardware engineers, a linear model fits the relationship of income and years of experience.\n",
|
||||
"\n",
|
||||
"2. A coefficient = 11973.469 indicates that for each year of experience, a C programmer typically earns an additional $12,000 per year.\n",
|
||||
"\n",
|
||||
"3. Because the graph looks like a spray of water, after 10 years of experience, the salaries for C programming professionals strongly vary.\n",
|
||||
"\n",
|
||||
"4. A junior C programmer, at 2 years of experience, typically earns $54,776.266 per year.\n",
|
||||
"\n",
|
||||
"5. An r2 score = 0.571 indicates a bit over half of the variability in data is explained by the independent variable. This however is only for incomes below 200,000 dollars. Some participants with 5 years of professional experience were reporting incomes at or around $200,000. These were considered unusual outliers. Among the game developers, they may have independently released a game.\n",
|
||||
"\n",
|
||||
"rmse = 21198.612 indicates the model is off by around 21,000 dollars for a given point in a career.\n",
|
||||
"\n",
|
||||
"### Questions that can be answered\n",
|
||||
"\n",
|
||||
"* Is there a linear relationship between income and years of experience.\n",
|
||||
"* Is there a point in a career where raises stop ocurring?\n",
|
||||
"* What is the typical salary of a entry-level or junior C programmer?\n",
|
||||
"* How much more do C programmers earn for each year of experience?\n",
|
||||
"* How much of the variability is explained by the model and what factors are not considered?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "928f421c-1f2b-4be1-9ce2-8f3593a9a823",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Below cells generate extra or unused graphs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8357f841-23a0-4bfa-bf09-860bd3e014b8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"jsjobs = [\"Developer, full-stack\",\n",
|
||||
" \"Developer, front-end\",\n",
|
||||
" \"Developer, mobile\"\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"js = Foo(so_df, \"JavaScript\", jobs=jsjobs, n_rich_outliers=6, country=\"Ukraine\")\n",
|
||||
"js.visualize(hue=\"DevType\")\n",
|
||||
"js.export_image()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"visualize_devs( get_c_devs(so_df) , \"C\")\n",
|
||||
"\n",
|
||||
"for lang in [\"Cobol\", \"Prolog\", \"Ada\", \"Python\"]:\n",
|
||||
" foo = get_lang_devs(so_df, lang)\n",
|
||||
" visualize_devs(foo, lang)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "35b9727a-176c-4193-a1f9-a508aecd2d1c",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# get popularity of different programming languages\n",
|
||||
@@ -89,7 +407,7 @@
|
||||
" plt.grid(axis='x', linestyle='--', alpha=0.75) \n",
|
||||
" plt.title(\"%s vs %s\" % (label1, label2))\n",
|
||||
" if saveto is not None:\n",
|
||||
" plt.savefig(saveto)\n",
|
||||
" plt.savefig(saveto, bbox_inches='tight')\n",
|
||||
" del df, df2\n",
|
||||
"\n",
|
||||
"l1 = get_langs( so_df )\n",
|
||||
@@ -110,18 +428,25 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d0bfdb92-378a-4452-91cc-4d21afd2d6cc",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# draw horizontal bar plot\n",
|
||||
"# https://seaborn.pydata.org/examples/part_whole_bars.html\n",
|
||||
"\n",
|
||||
"# investigate extrinsic vs intrinsic motivation\n",
|
||||
"def get_difference(dict1, dict2):\n",
|
||||
"def get_difference(dict1, dict2, proportion=False):\n",
|
||||
" keys = dict1.keys()\n",
|
||||
" result = dict()\n",
|
||||
" for key in keys:\n",
|
||||
" result[key] = dict1[key] - dict2[key]\n",
|
||||
" if proportion:\n",
|
||||
" result[key] = round((dict1[key] - dict2[key])/dict2[key],2)\n",
|
||||
" else:\n",
|
||||
" result[key] = dict1[key] - dict2[key]\n",
|
||||
" return result\n",
|
||||
"\n",
|
||||
"def visualize_diff(diff_dict, color=\"lightblue\", saveto=None):\n",
|
||||
@@ -132,7 +457,6 @@
|
||||
" df = pd.DataFrame(diff_sorted.items(), columns=['Languages', 'Value'])\n",
|
||||
" plt.figure(figsize=(15,20)) \n",
|
||||
" sb.barplot(x=KEY, y='Languages', data=df, color=color)\n",
|
||||
" \n",
|
||||
" DELTA = '\\u0394'\n",
|
||||
" for index, value in enumerate(df[KEY]):\n",
|
||||
" # chatgpt annotates my chart\n",
|
||||
@@ -144,47 +468,178 @@
|
||||
" # Adjust the x position for negative values\n",
|
||||
" plt.text(value, index, DELTA+str(value), va='center', ha='right') \n",
|
||||
" lowest = 0\n",
|
||||
" offset = 0.5\n",
|
||||
" offset = 0\n",
|
||||
" positive_values = df[df[KEY] > 0][KEY]\n",
|
||||
" if not positive_values.empty:\n",
|
||||
" lowest = positive_values.min()\n",
|
||||
" offset = list(positive_values).count(lowest) \n",
|
||||
" if len(positive_values) < len(df):\n",
|
||||
" # don't draw the line if every value is greater than 0\n",
|
||||
" plt.axhline(y=df[KEY].tolist().index(lowest) + offset, color='red', linestyle='--')\n",
|
||||
" # don't draw the line if every value is greater than 0_\n",
|
||||
" plt.axhline(y=df[KEY].tolist().index(lowest) + (offset-0.5), \n",
|
||||
" color='red', linestyle='--', zorder=-1)\n",
|
||||
" if saveto is not None:\n",
|
||||
" plt.savefig(saveto)\n",
|
||||
" plt.savefig(saveto, bbox_inches='tight')\n",
|
||||
" \n",
|
||||
"motiv_diff = get_difference(l2, l1)\n",
|
||||
"motiv_diff = get_difference(l2, l1, proportion=True)\n",
|
||||
"# print(motiv_diff)\n",
|
||||
"visualize_diff(motiv_diff, saveto=\"images/delta.png\")\n",
|
||||
"motiv_diff = get_difference(l2, l1)\n",
|
||||
"visualize_diff(motiv_diff, saveto=\"images/delta-b.png\")\n",
|
||||
"\n",
|
||||
"# no clear description of what \"admired\" is\n",
|
||||
"# in the schema\n",
|
||||
"# but generally people want to use the languages\n",
|
||||
"# they admire\n",
|
||||
"\n",
|
||||
"# determine level of hype\n",
|
||||
"hype = get_difference(l4, l3)\n",
|
||||
"# hype = get_difference(l4, l3)\n",
|
||||
"# print(hype)\n",
|
||||
"visualize_diff(hype, color=\"red\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e90cf119-c50d-468a-bc87-72dac41176ce",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# print survey ans\n",
|
||||
"employment_status = Counter(so_df[\"MainBranch\"])\n",
|
||||
"print(employment_status)\n",
|
||||
"\n",
|
||||
"print(so_df[\"ConvertedCompYearly\"][])"
|
||||
"# visualize_diff(hype, color=\"red\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f6b1a935-eeda-416f-8adf-5e854d3aa066",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
"source": [
|
||||
"# do people fall out of love with langs\n",
|
||||
"# the more they are used professionally?\n",
|
||||
"\n",
|
||||
"def visualize_favor(df, key_x, key_y, MAGIC_X=0, MAGIC_Y=0, title=str(), saveto=None):\n",
|
||||
" plt.figure()\n",
|
||||
" OFFSET = 1 # push text away from point slightly\n",
|
||||
" for i in range(merged.shape[0]):\n",
|
||||
" # label points that aren't un a cluster\n",
|
||||
" if merged[key_x][i] > MAGIC_X or merged[key_y][i] > MAGIC_Y:\n",
|
||||
" plt.text(merged[key_x].iloc[i]+OFFSET, \n",
|
||||
" merged[key_y].iloc[i]+OFFSET, \n",
|
||||
" merged[\"Language\"].iloc[i], \n",
|
||||
" ha=\"left\",\n",
|
||||
" size='medium')\n",
|
||||
"\n",
|
||||
" sb.scatterplot(data=merged, x=key_x, y=key_y, hue=\"Language\")\n",
|
||||
" plt.legend(loc='lower left', bbox_to_anchor=(0, -1.25), ncol=3) \n",
|
||||
" plt.title(title)\n",
|
||||
" if saveto is not None:\n",
|
||||
" plt.savefig(saveto, bbox_inches='tight')\n",
|
||||
" pass\n",
|
||||
"key_x = \"Users\"\n",
|
||||
"key_y = \"Potential '\\u0394'Users\"\n",
|
||||
"df1 = pd.DataFrame(l1.items(), columns=['Language', key_x])\n",
|
||||
"df2 = pd.DataFrame(motiv_diff.items(), columns=['Language', key_y])\n",
|
||||
"# chatgpt tells me how to combine df\n",
|
||||
"merged = pd.merge(df1, df2[[\"Language\", key_y]], on='Language', how='left')\n",
|
||||
"visualize_favor(merged, key_x, key_y, \n",
|
||||
" MAGIC_X=5000, MAGIC_Y=2000, \n",
|
||||
" saveto=\"images/favor.png\")\n",
|
||||
"del df1, df2, merged"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e90cf119-c50d-468a-bc87-72dac41176ce",
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": true
|
||||
},
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# see how much money are people making\n",
|
||||
"\n",
|
||||
"def get_mean_by_category(df, category, key=\"ConvertedCompYearly\"):\n",
|
||||
" unique = df[category].unique()\n",
|
||||
" result = dict()\n",
|
||||
" for u in unique:\n",
|
||||
" mean = df[df[category] == u][key].mean()\n",
|
||||
" result[u] = mean\n",
|
||||
" return result\n",
|
||||
"\n",
|
||||
"def show_me_the_money(df, saveto=None):\n",
|
||||
" key_x = \"ConvertedCompYearly\"\n",
|
||||
" key_y = \"DevType\"\n",
|
||||
" \n",
|
||||
" means = get_mean_by_category(df, key_y) \n",
|
||||
" mean_df = pd.DataFrame(means.items(), columns=[key_y, key_x])\n",
|
||||
"\n",
|
||||
" plt.figure(figsize=(14,18)) \n",
|
||||
" plt.axvline(x=1e5, color='red', linestyle='--', label=\"x = $100,000\")\n",
|
||||
" plt.axvline(x=1e6, color='lightgreen', linestyle='--', label=\"x = millionaire\")\n",
|
||||
" sb.barplot(x=key_x, y=key_y, data=mean_df.sort_values(by=key_x), \\\n",
|
||||
" color='lavender', alpha=0.7, label=\"average compensation\")\n",
|
||||
" sb.stripplot(x=key_x, y=key_y, data=df, \\\n",
|
||||
" size=3, jitter=True)\n",
|
||||
" if saveto is not None:\n",
|
||||
" plt.savefig(saveto, bbox_inches='tight')\n",
|
||||
" \n",
|
||||
"# print survey ans\n",
|
||||
"#employment_status = Counter(so_df[\"MainBranch\"])\n",
|
||||
"#print(employment_status)\n",
|
||||
"\n",
|
||||
"#employment_type = Counter(so_df[\"DevType\"])\n",
|
||||
"#print(employment_type)\n",
|
||||
"\n",
|
||||
"key = \"ConvertedCompYearly\"\n",
|
||||
"# answers = so_df[:-1][key].count()\n",
|
||||
"# print(answers, \"people answered re: \", key)\n",
|
||||
"df_no_na = so_df.dropna(subset=[key])\n",
|
||||
"indices = df_no_na[key].nlargest(15).index\n",
|
||||
"\n",
|
||||
"show_me_the_money( df_no_na.drop(indices), saveto=\"images/compensation-by-profession.png\" )\n",
|
||||
"# could also ask myself what portion of developers \n",
|
||||
"# earn less than the mean compensation\n",
|
||||
"# (what titles have high standard deviations in earnings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cdf21b1c-1316-422f-ad14-48150f80366c",
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# key = \"DevType\"\n",
|
||||
"# prof = \"Developer, full-stack\"\n",
|
||||
"\n",
|
||||
"key = \"MainBranch\"\n",
|
||||
"prof = \"I am a developer by profession\"\n",
|
||||
"col = \"ConvertedCompYearly\"\n",
|
||||
"\n",
|
||||
"devs = df_no_na[df_no_na[key] == prof ] \n",
|
||||
"pd.set_option('display.float_format', '{:.2f}'.format)\n",
|
||||
"devs.describe()[col]\n",
|
||||
"\n",
|
||||
"# who the hell is making $1/yr \n",
|
||||
"# devs[devs[col] == 1.0]\n",
|
||||
"\n",
|
||||
"# who are the millionaires\n",
|
||||
"# devs[devs[col] > 1e6]\n",
|
||||
"\n",
|
||||
"# who make more than the mean\n",
|
||||
"# devs[devs[col] > 76230.84]\n",
|
||||
"\n",
|
||||
"# who make more than the median\n",
|
||||
"# devs[devs[col] > 63316.00]\n",
|
||||
"\n",
|
||||
"# the ancient ones\n",
|
||||
"so_df[so_df[\"YearsCodePro\"] == 'More than 50 years']\n",
|
||||
"# should drop the 18-24 year old who is either bullshitting or recalls a past life\n",
|
||||
"# 55-64 years old\n",
|
||||
"# 65 years or older"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
Reference in New Issue
Block a user