Comply with PEP 8.

For consistent quotes and number of line breaks. At least for the code that is being used.
Added cells for CRISP_DM.
2025-04-25 01:31:45 -07:00 · 2025-04-25 01:13:45 -07:00
1 changed files with 140 additions and 87 deletions
--- a/stackoverflow-survey.ipynb
+++ b/stackoverflow-survey.ipynb
@@ -1,5 +1,22 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "14623ab1-dc15-4aa7-96c2-074ea4d0e33a",
+   "metadata": {},
+   "source": [
+    "# Project: Write a data science blog post\n",
+    "\n",
+    "## Business Understanding\n",
+    "\n",
+    "Salary or wages are a common talking point from business, personal finance, and economics.\n",
+    "But what's the bigger picture beyond mean and median?\n",
+    "\n",
+    "1. How much can entry or junior level developers expect to be paid?\n",
+    "2. How much more do they earn with each year of experience?\n",
+    "3. At what point in a career do salaries or wages start to stagnate?"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -14,7 +31,17 @@
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# avoid burning my eyes @ night\n",
-    "plt.style.use(\"dark_background\")"
+    "plt.style.use('dark_background')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56f4093b-450f-4529-831e-1f791e3e2c6a",
+   "metadata": {},
+   "source": [
+    "## Data Understanding and Exploration\n",
+    "\n",
+    "The survey will ask participants to answer \"Apples\" to a question in order to check if they're paying attention to the questions. The published data set already purged rows that failed the check."
   ]
  },
  {
@@ -24,14 +51,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "FILE = \"data/survey_results_public.csv\"\n",
+    "FILE = 'data/survey_results_public.csv'\n",
    "so_df = pd.read_csv(FILE)\n",
    "\n",
    "print(so_df.keys())\n",
    "so_df.describe()\n",
    "\n",
    "# check for people who aren't paying attention\n",
-    "count_not_apple =  (so_df[\"Check\"] != \"Apples\").sum()\n",
+    "count_not_apple =  (so_df['Check'] != 'Apples').sum()\n",
    "print(count_not_apple)\n",
    "print(so_df.shape)\n",
    "assert(count_not_apple == 0)\n",
@@ -47,7 +74,7 @@
   "source": [
    "# draw count plot of developers based on age\n",
    "\n",
-    "def visualize_devs(df, lang, key=\"Age\",):\n",
+    "def visualize_devs(df, lang, key='Age'):\n",
    "    plt.figure()\n",
    "    plt.xticks(rotation=45)\n",
    "    # from:\n",
@@ -57,19 +84,21 @@
    "              '45-54 years old', '55-64 years old',  \\\n",
    "              '65 years or older', 'Prefer not to say']\n",
    "    sb.countplot(x=key, data=df, order=order)\n",
-    "    title=\"Ages of %s Programmers\" % lang\n",
+    "    title='Ages of %s Programmers' % lang\n",
    "    plt.title(title)\n",
-    "    filename= \"images/%s-of-%s-programmers.png\" % (key, lang)\n",
+    "    filename= 'images/%s-of-%s-programmers.png' % (key, lang)\n",
    "    plt.savefig(filename, bbox_inches=\"tight\")\n",
    "\n",
+    "\n",
    "def get_lang_devs(df, lang):\n",
-    "    col = \"LanguageHaveWorkedWith\"\n",
+    "    col = 'LanguageHaveWorkedWith'\n",
    "    # will not work for single character languages (C, R)\n",
    "    # will mangle Java and JavaScript, Python and MicroPython\n",
    "    return df[ df[col].str.contains(lang, na=False) ] \n",
    "\n",
-    "def get_c_devs(df, lang=\"C\"):\n",
-    "    key = \"LanguageHaveWorkedWith\"\n",
+    "\n",
+    "def get_c_devs(df, lang='C'):\n",
+    "    key = 'LanguageHaveWorkedWith'\n",
    "    cdevs = []\n",
    "    for index, dev in df.iterrows():\n",
    "        try:\n",
@@ -83,6 +112,30 @@
    "    return pd.DataFrame(cdevs)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "visualize_devs( get_c_devs(so_df) , 'C')\n",
+    "\n",
+    "for lang in ['Cobol', 'Prolog', 'Ada', 'Python']:\n",
+    "    foo = get_lang_devs(so_df, lang)\n",
+    "    visualize_devs(foo, lang)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab9ce039-8ed4-46d7-8eea-426c460d0a7b",
+   "metadata": {},
+   "source": [
+    "## Preparing the Data\n",
+    "\n",
+    "`__init__()` specifies which rows to omit and which to use, so the data for modeling doesn't look like a shotgun blast of rainbow colors."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -101,23 +154,23 @@
    "class Foo:\n",
    "    def __init__(self, dataset, language, jobs=None, \n",
    "                 n_rich_outliers=0, n_poor_outliers=0, \n",
-    "                 country=\"United States of America\"):\n",
+    "                 country='United States of America'):\n",
    "        self.devs   = None\n",
    "        self.canvas = None\n",
    "        self.language = language\n",
    "        self.country = country\n",
    "        # focus on people who have given ...\n",
-    "        key_x  = \"YearsCodePro\"\n",
-    "        key_y  = \"ConvertedCompYearly\"\n",
+    "        key_x  = 'YearsCodePro'\n",
+    "        key_y  = 'ConvertedCompYearly'\n",
    "        df   = dataset.dropna(subset=[key_x, key_y])\n",
    "        self.key_x = key_x\n",
    "        self.key_y = key_y\n",
    "    \n",
    "        qualifiers = {\n",
-    "            \"MainBranch\":\"I am a developer by profession\",\n",
+    "            'MainBranch': 'I am a developer by profession',\n",
    "       }\n",
    "        if country:\n",
-    "            qualifiers[\"Country\"] = country\n",
+    "            qualifiers['Country'] = country\n",
    "        for k in qualifiers:\n",
    "            df = df[df[k] == qualifiers[k] ] \n",
    "\n",
@@ -126,7 +179,7 @@
    "            df = df[df.isin(jobs).any(axis=1)]\n",
    "\n",
    "        devs = None\n",
-    "        if len(language) == 1 or language in [\"Python\", \"Java\"]:\n",
+    "        if len(language) == 1 or language in ['Python', 'Java']:\n",
    "            devs = get_c_devs(df, lang=language)\n",
    "        else:\n",
    "            devs = get_lang_devs(df, language)\n",
@@ -149,7 +202,7 @@
    "        self.devs = devs.drop(indices)\n",
    "        del devs, new_column\n",
    "    \n",
-    "    def visualize(self,  hue=\"Country\", \n",
+    "    def visualize(self,  hue='Country', \n",
    "                  palette=sb.color_palette() ):    \n",
    "        self.canvas = plt.figure()\n",
    "        key_x = self.key_x\n",
@@ -157,9 +210,9 @@
    "\n",
    "        sb.scatterplot(data=self.devs, x=key_x, y=key_y, hue=hue, palette=palette)\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
-    "        title = \"Annual Compensation of %s Programmers Over Years of Experience\" % self.language\\\n",
-    "                + \"\\nsample size=%i\" %  len (self.devs)\\\n",
-    "                + \"\\ncountry=%s\" % self.country\n",
+    "        title = 'Annual Compensation of %s Programmers Over Years of Experience' % self.language\\\n",
+    "                + '\\nsample size=%i' %  len (self.devs)\\\n",
+    "                + '\\ncountry=%s' % self.country\n",
    "        plt.title(title)\n",
    "\n",
    "    def run_regression(self, model=LinearRegression(), split=train_test_split, \n",
@@ -181,36 +234,47 @@
    "    \n",
    "        m = model.coef_[0][0]\n",
    "        b = model.intercept_[0]\n",
-    "        print(\"+----------------------+\")\n",
-    "        print(\"%s regression line for %s\" % (line_color, self.language))\n",
-    "        print(\"coefficient = %0.2f\" % m)\n",
+    "        print('+----------------------+')\n",
+    "        print('%s regression line for %s' % (line_color, self.language))\n",
+    "        print('coefficient = %0.2f' % m)\n",
    "        print('intercept = %0.2f' % b)\n",
    "        rmse = root_mean_squared_error(y_test, y_pred)\n",
-    "        print(\"rmse = %0.2f\" % rmse)\n",
+    "        print('rmse = %0.2f' % rmse)\n",
    "        r2   = r2_score(y_test, y_pred)\n",
-    "        print(\"r2 score = %0.2f\" % r2)\n",
-    "        print(\"sample predictions:\")\n",
+    "        print('r2 score = %0.2f' % r2)\n",
+    "        print('sample predictions:')\n",
    "        print(y_pred[3:6])\n",
-    "        print(\"+----------------------+\")\n",
+    "        print('+----------------------+')\n",
    "\n",
    "        plt.figure(self.canvas)\n",
    "        plt.plot(X_test, y_pred, color=line_color, label='Regression Line')\n",
-    "        plt.axhline(y=b, color=\"purple\", linestyle='--', \n",
-    "                    label=\"b=%0.2f\" % b, zorder=-1 )\n",
+    "        plt.axhline(y=b, color='purple', linestyle='--', \n",
+    "                    label='b=%0.2f' % b, zorder=-1 )\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
    "        del y_pred, model\n",
    "\n",
    "\n",
-    "    def export_image(self, base_filename = \"images/programmers-%s-%s.png\"):\n",
+    "    def export_image(self, base_filename = 'images/programmers-%s-%s.png'):\n",
    "        plt.figure(self.canvas)\n",
    "        filename = base_filename % (self.language, self.country)\n",
    "        plt.savefig(filename.replace(' ', '-'), bbox_inches='tight')\n",
    "\n",
+    "\n",
    "# the higher a is, the steeper the line gets\n",
    "def log_base_a(x, a=1.07):\n",
    "    return np.log10(x)/np.log(a)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "9a1df75a-4bcf-4072-9bab-d15b4a88c691",
+   "metadata": {},
+   "source": [
+    "## Data Modeling\n",
+    "\n",
+    "Generate models for American python programmers working as data scientists/analysts/engineers."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -220,15 +284,15 @@
   "source": [
    "\n",
    "# expected python jobs\n",
-    "pyjobs = [\"Data scientist or machine learning specialist\",\n",
-    "          \"Data or business analyst\",\n",
-    "          \"Data engineer\",\n",
+    "pyjobs = ['Data scientist or machine learning specialist',\n",
+    "          'Data or business analyst',\n",
+    "          'Data engineer',\n",
    "#        \"DevOps specialist\",\n",
    "#        \"Developer, QA or test\"\n",
    "]\n",
    "\n",
-    "python = Foo(so_df, \"Python\", jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
-    "python.visualize(hue=\"DevType\", palette=[\"#dbdb32\", \"#34bf65\", \"#ac70e0\"])\n",
+    "python = Foo(so_df, 'Python', jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
+    "python.visualize(hue='DevType', palette=['#dbdb32', '#34bf65', '#ac70e0'])\n",
    "python.run_regression()\n",
    "python.run_regression( x_transform=log_base_a, change_base=1.20, \n",
    "                       x_shift=0, y_shift=-1.5e4, line_color='cyan', random=888)\n",
@@ -237,24 +301,34 @@
  },
  {
   "cell_type": "markdown",
-   "id": "f4e3516e-ffe3-4768-ae92-e5cb0be503f8",
+   "id": "b6e42288-cc7b-4d1c-827f-137d4817dd50",
   "metadata": {},
   "source": [
-    "## Business Understanding\n",
+    "## Evaluation (Python)\n",
    "\n",
-    "* For Python programmers specialized in data e.g data scientists, engineers, or analysts, a linear model moderately fits the relationship between income and years of experience. For data engineers and scientists, there is a possible divergence within the career path at 10 years of experience. \n",
+    "Two models will tell two different stories for data scientists, analysts, and engineers. For either model, roughly 30% of the variability of the data is explanable by years of experience. The \"cyan\" model performs slightly better than the default \"red\" model. The two models have roughly the same RMSE of around $40,000, meaning they may be off by that amount for any given x.\n",
    "\n",
-    "* The typical starting salary within the field of data science is around 100,000 to 120,000 dollars.\n",
+    "### \"red\" / default model\n",
    "\n",
-    "* The income of a data professional can either increase by 2,000 per year (red) or 10,000 per year (cyan).\n",
+    "1. Entry level data scientists/analysts/engineers earn $123,479.15 USD/year.\n",
+    "2. They get a raise of $2,573.62 for each year of experience.\n",
+    "3. This rate of increase in income is steady for multiple decades (>20 years of experience).\n",
    "\n",
-    "* For both models, the r2 score ranges from poor to moderate = 0.20 - 0.37 depending on the random number. The variability not explained by the model could be the result of different fields that employ dats scientists/analysts/engineers such as finance, bio/med, or advertising.\n",
+    "### \"cyan\" model\n",
    "\n",
-    "* For any given point in the career, the model is off by 39,000 or 42,000 dollars.\n",
+    "1. Entry level positions yield $82,957.69.\n",
+    "2. There is a raise of $10,378.53 for each year of experience until 10.\n",
+    "3. At 10 years, a cohort (x < 10, y > $200,000) has experienced an unchanged rate of increase while the other experiences a reduced rate of increase similar to the slope (coefficient) from the \"red\" model.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd9fa93d-55e5-4588-af14-69252bd69447",
+   "metadata": {},
+   "source": [
+    "## Data Modeling and Evaluation (for C)\n",
    "\n",
-    "Generally, for low uncomes poorly explained by the model, the cause could be getting a new job after a year of unemployment, internships, or part-time positions. For high incomes poorly explained by the model, the cause could be professionals at large companies who had recently added a programming language to their skill set. Other causes could be company size or working hours.\n",
-    "\n",
-    "(Business understanding was done for C first. Questions are there.)"
+    "Generate models for American C programmers working as embedded systems developers, hardware engineers, or graphics/game programmers."
   ]
  },
  {
@@ -266,61 +340,54 @@
   "source": [
    "# expected C jobs\n",
    "cjobs = [\n",
-    "    \"Developer, embedded applications or devices\", \n",
-    "    \"Developer, game or graphics\",\n",
-    "    \"Hardware Engineer\" ,\n",
+    "    'Developer, embedded applications or devices', \n",
+    "    'Developer, game or graphics',\n",
+    "    'Hardware Engineer',\n",
    " #        \"Project manager\", \n",
    " #        \"Product manager\"\n",
    "]\n",
-    "c = Foo(so_df, \"C\", jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
-    "c.visualize(hue=\"DevType\", palette=[\"#57e6da\",\"#d9e352\",\"#cc622d\"] ) \n",
+    "c = Foo(so_df, 'C', jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
+    "c.visualize(hue='DevType', palette=['#57e6da','#d9e352','#cc622d'] ) \n",
    "c.run_regression()\n",
    "c.run_regression(x_transform=log_base_a, change_base=1.3, \n",
-    "                 x_shift=2, y_shift=-5000, line_color=\"magenta\", random=555)\n",
+    "                 x_shift=2, y_shift=-5000, line_color='magenta', random=555)\n",
    "c.export_image()"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "89d86a1e-dc65-48e4-adcf-bb10188fd0b7",
+   "id": "b7026c56-3049-4e60-bbc6-ee548ff58297",
   "metadata": {},
   "source": [
-    "## Business Understanding\n",
+    "## Evaluation for C\n",
    "\n",
-    "1.  For C programmers, specifically embedded systems, graphics, and hardware engineers, a linear model fits the relationship of income and years of experience.\n",
+    "The magenta model for C is good but not great with an r2 score of 0.57. `rmse = 21198.61`, meaning the model is off by $21,198.61 for a given x value.\n",
    "\n",
-    "2. A coefficient = 11973.469 indicates that for each year of experience, a C programmer typically earns an additional $12,000 per year.\n",
-    "\n",
-    "3. Because the graph looks like a spray of water, after 10 years of experience, the salaries for C programming professionals strongly vary.\n",
-    "\n",
-    "4. A junior C programmer, at 2 years of experience, typically earns $54,776.266 per year.\n",
-    "\n",
-    "5. An r2 score =  0.571 indicates a bit over half of the variability in data is explained by the independent variable. This however is only for incomes below 200,000 dollars. Some participants with 5 years of professional experience were reporting incomes at or around $200,000. These were considered unusual outliers. Among the game developers, they may have independently released a game.\n",
-    "\n",
-    "rmse =  21198.612 indicates the model is off by around 21,000 dollars for a given point in a career.\n",
-    "\n",
-    "### Questions that can be answered\n",
-    "\n",
-    "* Is there a linear relationship between income and years of experience.\n",
-    "* Is there a point in a career where raises stop ocurring?\n",
-    "* What is the typical salary of a entry-level or junior C programmer?\n",
-    "* How much more do C programmers earn for each year of experience?\n",
-    "* How much of the variability is explained by the model and what factors are not considered?\n"
+    "1. Early career C programmers earn about $54,776.27 per year.\n",
+    "2. They get a raise of $11,973.47 per year of experience.\n",
+    "3. After 10 years, the rate of increase is lower (possibly $1,427.58 as depicted in the red regression line).\n"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "928f421c-1f2b-4be1-9ce2-8f3593a9a823",
+   "id": "0f21b9fa-7de0-4c39-86ca-d8c4b03cc3c9",
   "metadata": {},
   "source": [
-    "Below cells generate extra or unused graphs."
+    "## (More) Data Understanding and Exploration\n",
+    "\n",
+    "Below cells generate extra or unused graphs.\n",
+    "I put this here because I want to restart the kernel and re-run cells until this point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8357f841-23a0-4bfa-bf09-860bd3e014b8",
-   "metadata": {},
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
   "outputs": [],
   "source": [
    "\n",
@@ -334,20 +401,6 @@
    "js.export_image()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "visualize_devs( get_c_devs(so_df) , \"C\")\n",
-    "\n",
-    "for lang in [\"Cobol\", \"Prolog\", \"Ada\", \"Python\"]:\n",
-    "    foo = get_lang_devs(so_df, lang)\n",
-    "    visualize_devs(foo, lang)"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
Author	SHA1	Message	Date
scuti	5c3225d7ea	Comply with PEP 8. For consistent quotes and number of line breaks. At least for the code that is being used.	2025-04-25 01:31:45 -07:00
scuti	fbdd5f3f18	Added cells for CRISP_DM. Instructor said it was a nonlinear process, but the grader wants it to be linear.	2025-04-25 01:13:45 -07:00