Rewrote explanation for dropped rows.

Because the function about excluded rows was changed to show a different key.
Expanded probe_excluded_rows().
2025-04-30 04:06:51 -07:00 · 2025-04-30 03:29:07 -07:00 · 2025-04-29 08:13:45 -07:00 · 2025-04-29 08:13:26 -07:00 · 2025-04-28 06:31:35 -07:00 · 2025-04-27 16:23:33 -07:00
4 changed files with 510 additions and 120 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,101 @@
+
+<!--Your Github repository must have the following contents:
+
+    A README.md file that communicates the libraries used, the motivation for the project, the files in the repository with a small description of each, a summary of the results of the analysis, and necessary acknowledgments.
+
+    Your code in a Jupyter notebook, with appropriate comments, analysis, and documentation.
+
+    You may also provide any other necessary documentation you find necessary.-->
+
+# stacksurvey
+
+**stacksurvey** is an exploration and analysis of data from StackOverflow's developer survey of 2024.
+
+[https://survey.stackoverflow.co/2024/](https://survey.stackoverflow.co/2024/)
+
+The motivation for project is satisfying a class assignment. Eventually, an interesting (enough) topic was discovered in the data set: 
+
+>What is the annual compensation (y) over years of experience (x) of developers who use a programming language from a specific country?
+
+## Requirements
+
+    numpy pandas sklearn matplotlib seaborn
+
+## Summary of Analysis
+
+The models generated by the notebook become less reliable with years of experience greater than 10 or annual incomes greater than $200,000.
+
+Each chart comes with two regression lines. Red is the default regression line that has not been tuned. The other is an attempt to better fit the data by either transforming or shifting x.
+
+The transformation is typically
+
+    y = m * log(x) + b
+
+where the base is a parameter. 
+
+Each model had different changes of base applied to the log function.
+
+### C
+
+![graph of c programmers](images/programmers-C-United-States-of-America.png)
+
+    +----------------------+
+    red regression line for C
+    coefficient = 1427.58
+    intercept = 103659.82
+    rmse = 26971.44
+    r2 score = 0.06
+    sample predictions:
+    [[125073.46117519]
+    [107942.54574181]
+    [109370.12202793]]
+    +----------------------+
+    +----------------------+
+    magenta regression line for C
+    coefficient = 11973.47
+    intercept = 54776.27
+    rmse = 21198.61
+    r2 score = 0.57
+    sample predictions:
+    [[132396.26294684]
+    [119937.35465744]
+    [ 64985.1549115 ]]
+    +----------------------+
+
+For C programmers, a linear model fits well but not great having an r2 score of 0.57. Junior level positions earn roughly $54,776. Their income progresses $11,973 with each year of experience.
+
+### Python
+
+![graph of python programmers](images/programmers-Python-United-States-of-America.png)
+
+    +----------------------+
+    red regression line for Python
+    coefficient = 2573.62
+    intercept = 123479.15
+    rmse = 39759.45
+    r2 score = 0.34
+    sample predictions:
+    [[126052.77118246]
+    [174951.60602361]
+    [187819.7204555 ]]
+    +----------------------+
+    +----------------------+
+    cyan regression line for Python
+    coefficient = 10378.53
+    intercept = 82957.69
+    rmse = 42374.26
+    r2 score = 0.38
+    sample predictions:
+    [[139882.01866593]
+    [117229.55243376]
+    [137277.30441955]]
+    +----------------------+
+
+For data scientists, analysts, or engineers, a linear model is a moderate fit at best as the r2 score is around 0.30. There appears to be divergence at the 10 year mark in their careers. This may be the result of their field (advertising, finance, bio/medical, and so on).
+
+Entry or junior level professionals generally have an income of $82,957 or $123,479. Their annual income increases by $10,378 or $2573 each year. 
+
+## Acknowledgements
+
+"Udacity AI" (ChatGPT), the idea to transform x values to appropriate a linear regression into a logarithmic regression.
+
--- a/images/programmers-C-United-States-of-America.png
+++ b/images/programmers-C-United-States-of-America.png
--- a/images/programmers-Python-United-States-of-America.png
+++ b/images/programmers-Python-United-States-of-America.png
--- a/stackoverflow-survey.ipynb
+++ b/stackoverflow-survey.ipynb
@@ -1,5 +1,22 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "14623ab1-dc15-4aa7-96c2-074ea4d0e33a",
+   "metadata": {},
+   "source": [
+    "# Project: Write a data science blog post\n",
+    "\n",
+    "## Business Understanding\n",
+    "\n",
+    "Salary or wages are a common talking point from business, personal finance, and economics.\n",
+    "But what's the bigger picture beyond mean and median?\n",
+    "\n",
+    "1. How much can entry or junior level developers expect to be paid?\n",
+    "2. How much more do they earn with each year of experience?\n",
+    "3. At what point in a career do salaries or wages start to stagnate?"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -14,7 +31,17 @@
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# avoid burning my eyes @ night\n",
-    "plt.style.use(\"dark_background\")"
+    "plt.style.use('dark_background')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56f4093b-450f-4529-831e-1f791e3e2c6a",
+   "metadata": {},
+   "source": [
+    "## Data Understanding and Exploration\n",
+    "\n",
+    "The survey will ask participants to answer \"Apples\" to a question in order to check if they're paying attention to the questions. The published data set already purged rows that failed the check."
   ]
  },
  {
@@ -24,14 +51,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "FILE = \"data/survey_results_public.csv\"\n",
+    "FILE = 'data/survey_results_public.csv'\n",
    "so_df = pd.read_csv(FILE)\n",
    "\n",
    "print(so_df.keys())\n",
    "so_df.describe()\n",
    "\n",
    "# check for people who aren't paying attention\n",
-    "count_not_apple =  (so_df[\"Check\"] != \"Apples\").sum()\n",
+    "count_not_apple =  (so_df['Check'] != 'Apples').sum()\n",
    "print(count_not_apple)\n",
    "print(so_df.shape)\n",
    "assert(count_not_apple == 0)\n",
@@ -47,7 +74,17 @@
   "source": [
    "# draw count plot of developers based on age\n",
    "\n",
-    "def visualize_devs(df, lang, key=\"Age\",):\n",
+    "def visualize_devs(df, title, key='Age'):\n",
+    "    '''\n",
+    "    Draws count plot of developers based on attributes.\n",
+    "\n",
+    "    inputs:\n",
+    "        df:    a DataFrame, the subset of the data set.\n",
+    "        title: string, title of the chart.\n",
+    "        key:   string, the attribute to count (age).\n",
+    "    outputs:\n",
+    "        no return values, will draw and save a graphic.\n",
+    "    '''\n",
    "    plt.figure()\n",
    "    plt.xticks(rotation=45)\n",
    "    # from:\n",
@@ -57,19 +94,44 @@
    "              '45-54 years old', '55-64 years old',  \\\n",
    "              '65 years or older', 'Prefer not to say']\n",
    "    sb.countplot(x=key, data=df, order=order)\n",
-    "    title=\"Ages of %s Programmers\" % lang\n",
    "    plt.title(title)\n",
-    "    filename= \"images/%s-of-%s-programmers.png\" % (key, lang)\n",
-    "    plt.savefig(filename, bbox_inches=\"tight\")\n",
+    "    filename= 'images/%s.png' % title.replace(\" \", \"-\")\n",
+    "    plt.savefig(filename, bbox_inches='tight')\n",
+    "\n",
    "\n",
    "def get_lang_devs(df, lang):\n",
-    "    col = \"LanguageHaveWorkedWith\"\n",
+    "    '''\n",
+    "    Returns a DataFrame, subset of the data set, of developers that have\n",
+    "    worked with a specified programming language.\n",
+    "\n",
+    "    inputs:\n",
+    "        df:   a DataFrame, can be the entire published data set.\n",
+    "        lang: a string, the programming language.\n",
+    "    outputs:\n",
+    "        a DataFrame of developers that have worked with `lang` programming \n",
+    "        language.\n",
+    "    '''\n",
+    "    col = 'LanguageHaveWorkedWith'\n",
    "    # will not work for single character languages (C, R)\n",
    "    # will mangle Java and JavaScript, Python and MicroPython\n",
    "    return df[ df[col].str.contains(lang, na=False) ] \n",
    "\n",
-    "def get_c_devs(df, lang=\"C\"):\n",
-    "    key = \"LanguageHaveWorkedWith\"\n",
+    "\n",
+    "def get_c_devs(df, lang='C'):\n",
+    "    '''\n",
+    "    Returns a DataFrame, subset of the data set, of developers that have\n",
+    "    worked with a specified programming language.\n",
+    "    Similar to get_lang_devs() but adapted for languages named by a single\n",
+    "    letter, or names like 'Java' which is contained in 'JavaScript'.\n",
+    "\n",
+    "    inputs:\n",
+    "        df:   a DataFrame, can be the entire published data set.\n",
+    "        lang: a string, the programming language.\n",
+    "    outputs:\n",
+    "        a DataFrame of developers that have worked with `lang` programming \n",
+    "        language.\n",
+    "    '''\n",
+    "    key = 'LanguageHaveWorkedWith'\n",
    "    cdevs = []\n",
    "    for index, dev in df.iterrows():\n",
    "        try:\n",
@@ -83,6 +145,47 @@
    "    return pd.DataFrame(cdevs)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "visualize_devs( get_c_devs(so_df) , 'Ages of C Programmers')\n",
+    "visualize_devs( get_c_devs(so_df, lang='Python') , 'Ages of Python Programmers')\n",
+    "\n",
+    "for lang in ['Cobol', 'Prolog', 'Ada']:\n",
+    "    title = 'Ages of %s Programmers' % lang\n",
+    "    foo = get_lang_devs(so_df, lang)\n",
+    "    visualize_devs(foo, title)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab9ce039-8ed4-46d7-8eea-426c460d0a7b",
+   "metadata": {},
+   "source": [
+    "## Preparing the Data\n",
+    "\n",
+    "`__init__()` specifies which rows to omit and which to use, so the data for modeling doesn't look like a shotgun blast of rainbow colors.\n",
+    "\n",
+    "### NaNs are dropped\n",
+    "\n",
+    "No values are assumed in the place of NaN for keys 'YearsCodePro' and 'ConvertedCompYearly'.\n",
+    "\n",
+    "Rows with NaN are dropped for developers who:\n",
+    "* did not specify their years of professional experience\n",
+    "* did not disclose an annual compensation.\n",
+    "\n",
+    "More developers declined to specify their income than years of experience. Between total and included rows, the distributions of years of experience is similar. This supports that the analysis is not significantly altered by missing data.\n",
+    "\n",
+    "See charts\n",
+    "\n",
+    "* Python Developers Total vs Included\n",
+    "* C Developers Total vs Included"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -99,25 +202,37 @@
    "\n",
    "# still haven't come up with a name\n",
    "class Foo:\n",
-    "    def __init__(self, dataset, language, jobs=None, \n",
+    "    def __init__(self, df, language, jobs=None, \n",
    "                 n_rich_outliers=0, n_poor_outliers=0, \n",
-    "                 country=\"United States of America\"):\n",
+    "                 country='United States of America'):\n",
+    "        '''\n",
+    "        inputs:\n",
+    "            dataset:  A DataFrame, can be the full data set.\n",
+    "            language: string, the programming language \n",
+    "                a developer has worked with.\n",
+    "            jobs:     list of strings, job positions \n",
+    "            - typically domains where the language is dominant.\n",
+    "            n_rich_outliers: integer, removes samples from the \n",
+    "                upper limit of the y-axis.\n",
+    "            n_poor_outliers: integer, removes samples from the \n",
+    "                lower limit of the y-axis.\n",
+    "            country: string, specifies the country of origin.\n",
+    "        '''\n",
    "        self.devs   = None\n",
    "        self.canvas = None\n",
    "        self.language = language\n",
    "        self.country = country\n",
    "        # focus on people who have given ...\n",
-    "        key_x  = \"YearsCodePro\"\n",
-    "        key_y  = \"ConvertedCompYearly\"\n",
-    "        df   = dataset.dropna(subset=[key_x, key_y])\n",
+    "        key_x  = 'YearsCodePro'\n",
+    "        key_y  = 'ConvertedCompYearly'\n",
    "        self.key_x = key_x\n",
    "        self.key_y = key_y\n",
-    "    \n",
+    "\n",
    "        qualifiers = {\n",
-    "            \"MainBranch\":\"I am a developer by profession\",\n",
+    "            'MainBranch': 'I am a developer by profession',\n",
    "       }\n",
    "        if country:\n",
-    "            qualifiers[\"Country\"] = country\n",
+    "            qualifiers['Country'] = country\n",
    "        for k in qualifiers:\n",
    "            df = df[df[k] == qualifiers[k] ] \n",
    "\n",
@@ -126,11 +241,15 @@
    "            df = df[df.isin(jobs).any(axis=1)]\n",
    "\n",
    "        devs = None\n",
-    "        if len(language) == 1 or language in [\"Python\", \"Java\"]:\n",
+    "        if len(language) == 1 or language in ['Python', 'Java']:\n",
    "            devs = get_c_devs(df, lang=language)\n",
    "        else:\n",
    "            devs = get_lang_devs(df, language)\n",
-    "        \n",
+    "\n",
+    "        self.df_no_x = devs[devs[key_x].isnull()]\n",
+    "        self.df_no_y = devs[devs[key_y].isnull()]\n",
+    "        devs  = devs.dropna(subset=[key_x, key_y])\n",
+    "\n",
    "        replacement_dict = {\n",
    "            'Less than 1 year': '0.5',\n",
    "            'More than 50 years': '51',\n",
@@ -138,8 +257,12 @@
    "\n",
    "        # https://stackoverflow.com/questions/47443134/update-column-in-pandas-dataframe-without-warning\n",
    "        pd.options.mode.chained_assignment = None  # default='warn'\n",
+    "    \n",
    "        new_column = devs[key_x].replace(replacement_dict)\n",
-    "        devs[key_x] = pd.to_numeric(new_column, errors='coerce')\n",
+    "        devs[key_x] = pd.to_numeric(new_column, errors='raise')\n",
+    "\n",
+    "        new_column = self.df_no_y[key_x].replace(replacement_dict)\n",
+    "        self.df_no_y[key_x] = pd.to_numeric(new_column, errors='raise')\n",
    "        pd.options.mode.chained_assignment = 'warn'  # default='warn'\n",
    "        # print( devs[key_x].unique() )\n",
    "        \n",
@@ -149,67 +272,234 @@
    "        self.devs = devs.drop(indices)\n",
    "        del devs, new_column\n",
    "    \n",
-    "    def visualize(self,  hue=\"Country\", \n",
-    "                  palette=sb.color_palette() ):    \n",
+    "    def visualize(self,  hue='Country', \n",
+    "                  palette=sb.color_palette() ):\n",
+    "        '''\n",
+    "        Draw scatter plot of samples included in self.devs.\n",
+    "\n",
+    "        inputs:\n",
+    "            hue:     string, colorize dots by a given key.\n",
+    "            palette: list of strings (color codes)\n",
+    "                     or string (matplotlib predefined palettes),\n",
+    "                     specifies the colors to use when coloring dots.\n",
+    "        '''\n",
    "        self.canvas = plt.figure()\n",
    "        key_x = self.key_x\n",
    "        key_y = self.key_y\n",
    "\n",
    "        sb.scatterplot(data=self.devs, x=key_x, y=key_y, hue=hue, palette=palette)\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
-    "        title = \"Annual Salary of %s Developers Over Years of Experience\" % self.language\\\n",
-    "                + \"\\nsample size=%i\" %  len (self.devs)\\\n",
-    "                + \"\\ncountry=%s\" % self.country\n",
+    "        title = 'Annual Compensation of %s Programmers Over Years of Experience' % self.language\\\n",
+    "                + '\\nsample size=%i' %  len (self.devs)\\\n",
+    "                + '\\ncountry=%s' % self.country\n",
    "        plt.title(title)\n",
    "\n",
-    "    def run_regression(self, model=LinearRegression(), split=train_test_split, \n",
-    "                       x_transform=None, change_base=None, x_shift=0, y_shift=0,\n",
-    "                       line_color='red', random=333):\n",
-    "        df = self.devs # .sort_values(by = self.key2)\n",
-    "        X = df[self.key_x].to_frame()\n",
-    "        if x_transform is not None and change_base is not None:\n",
-    "            X = x_transform (X, a=change_base ) \n",
-    "        elif x_transform is not None:\n",
-    "            X = x_transform (X)   \n",
-    "        X = X + x_shift\n",
-    "        y = df[self.key_y].to_frame() + y_shift\n",
-    "    \n",
-    "        X_train, X_test, y_train, y_test = split(X, y, test_size=0.2, random_state=random)\n",
+    "    def run_regression(self, x_transform=None, change_base=1.07, \n",
+    "                       x_shift=0, y_shift=0,\n",
+    "                       random=333, risky=0,\n",
+    "                       color='red', name='Regression Line' ):\n",
+    "        '''\n",
+    "        Run linear regresssion and draws a straight line.\n",
    "\n",
+    "        inputs:\n",
+    "            x_transform: function, function to tune the independent variable.\n",
+    "            change_base: float or integer, specifies base \n",
+    "                for logarithmic function, not used if x_transform is None.\n",
+    "            x_shift: integer, for tuning, shifts the position \n",
+    "                of the line on the x-axis.\n",
+    "            y_shift: integer, for tuning, shifts the position \n",
+    "                of the line on the y-axis.\n",
+    "            random:  integer, random seed for train_test_split; \n",
+    "                change to test generalization.\n",
+    "            risky    integer ranging from 0 to 2,\n",
+    "                    0 = does nothing (default),\n",
+    "                    1 = sorts the independent variable,\n",
+    "                    2 = sorts the dependent variable,\n",
+    "               performs unrecommended operation to sort data,\n",
+    "               risking the model training on the order of values.\n",
+    "               May draw nice lines that generalize across random states.\n",
+    "           color: string, color of the regression line.\n",
+    "           name:  string, label of regression line on the legend.\n",
+    "        '''\n",
+    "        df = self.devs # .sort_values(by = self.key2)\n",
+    "        X = df[[self.key_x]]\n",
+    "        y = df[[self.key_y]]\n",
+    "\n",
+    "        # not recommended\n",
+    "        # carries risk of model training on sorted order\n",
+    "        # however it appears to be generalizing well\n",
+    "        # across random state and shuffle (=True, default)\n",
+    "        style = '-'\n",
+    "        if risky > 0:\n",
+    "            X = X.sort_values(by=self.key_x)\n",
+    "            style = '--'\n",
+    "        if risky > 1:\n",
+    "            y = y.sort_values(by=self.key_y)\n",
+    "        if x_transform is not None:\n",
+    "            X = x_transform (X, a=change_base ) \n",
+    "\n",
+    "        X = X + x_shift\n",
+    "        y = y + y_shift\n",
+    "    \n",
+    "        X_train, X_test, y_train, y_test = train_test_split(\n",
+    "                                                X, y, \n",
+    "                                                test_size=0.2, \n",
+    "                                                random_state=random)\n",
+    "\n",
+    "        model = LinearRegression()\n",
    "        model.fit(X_train, y_train)\n",
    "        y_pred = model.predict(X_test)\n",
-    "\n",
-    "        print(\"+----------------------+\")\n",
-    "        print(\"%s regression line for %s\" % (line_color, self.language))\n",
-    "        print(\"coefficient =\", model.coef_)\n",
-    "        print('intercept=', model.intercept_)\n",
-    "        rmse = root_mean_squared_error(y_test, y_pred)\n",
-    "        print(\"rmse = \", rmse)\n",
-    "        r2   = r2_score(y_test, y_pred)\n",
-    "        print(\"r2 score = \", r2)\n",
-    "        print(\"sample predictions:\")\n",
-    "        print(y_pred[3:6])\n",
-    "        print(\"+----------------------+\")\n",
+    "    \n",
+    "        m = model.coef_[0][0]\n",
    "        b = model.intercept_[0]\n",
+    "        label = '%s regression line for %s' % (color, self.language)\n",
+    "        show_model_stats(m, b, y_test, y_pred, label)\n",
    "\n",
    "        plt.figure(self.canvas)\n",
-    "        plt.plot(X_test, y_pred, color=line_color, label='Regression Line')\n",
-    "        plt.axhline(y=b, color=\"purple\", linestyle='--', \n",
-    "                    label=\"b=%0.2f\" % b, zorder=-1 )\n",
+    "        plt.plot(X_test, y_pred, color=color, label=name, linestyle=style)\n",
+    "        plt.axhline(y=b, color='purple', linestyle='--', \n",
+    "                    label='b=%0.2f' % b, zorder=-1 )\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
-    "        del y_pred, model\n",
+    "        del y_pred, model, X, y\n",
    "\n",
+    "    def run_log_regression(self, color='pink', nodraw=True):\n",
+    "        '''\n",
+    "        Runs logarithmic regression and draws a line that contours \n",
+    "        at the point of diminishing returns.\n",
    "\n",
-    "    def export_image(self, base_filename = \"images/programmers-%s-%s.png\"):\n",
+    "        Logarithmic regression provides a better fit for the data;\n",
+    "        however, it is not part of the course.\n",
+    "\n",
+    "        Can illustrate an interesting relationship between the\n",
+    "        \"default\" linear model and a tuned linear model.\n",
+    "\n",
+    "        inputs:\n",
+    "            color:   color of the regression line.\n",
+    "            nodraw:  whether or not to draw the line.\n",
+    "        '''\n",
+    "        df = self.devs\n",
+    "        X = df[[self.key_x]] #.sort_values(by=self.key_x)\n",
+    "        y = df[[self.key_y]] #.sort_values(by=self.key_y)\n",
+    "\n",
+    "        X_train, X_test, y_train, y_test = train_test_split(\n",
+    "                                                X, y, \n",
+    "                                                test_size=0.2, \n",
+    "                                                random_state=777)\n",
+    "    \n",
+    "        X_train_log = np.log(X_train)\n",
+    "        X_test_log = np.log(X_test)\n",
+    "    \n",
+    " #       X_train_log = X_train_log.sort_values(by=self.key_x)\n",
+    " #       y_train = y_train.sort_values(by=self.key_y)\n",
+    "        X_test_log = X_test_log.sort_values(by=self.key_x)\n",
+    "        X_test = X_test.sort_values(by=self.key_x)\n",
+    "        y_test = y_test.sort_values(by=self.key_y)\n",
+    "        \n",
+    "        model = LinearRegression()\n",
+    "        model.fit(X_train_log, y_train)\n",
+    "        y_pred = model.predict(X_test_log)\n",
+    "        y_pred.sort()\n",
+    "\n",
+    "        m = model.coef_[0][0]\n",
+    "        b = model.intercept_[0]\n",
+    "        label = '%s log regression line for %s' % (color, self.language)\n",
+    "        show_model_stats(m, b, y_test, y_pred, label)\n",
+    "\n",
+    "        if nodraw:\n",
+    "            return\n",
+    "        plt.plot(X_test, y_pred, color=color, label=\"Log regression\")\n",
+    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
+    "\n",
+    "    def export_image(self, base_filename = 'images/programmers-%s-%s.png'):\n",
+    "        '''\n",
+    "        Saves canvas to file.\n",
+    "\n",
+    "        inputs:\n",
+    "            base_filename: string with two format codes (two strings),\n",
+    "                this string will be interpolated by...\n",
+    "                1. the programming language\n",
+    "                2. the country of origin.\n",
+    "        '''\n",
    "        plt.figure(self.canvas)\n",
    "        filename = base_filename % (self.language, self.country)\n",
    "        plt.savefig(filename.replace(' ', '-'), bbox_inches='tight')\n",
    "\n",
+    "    def probe_excluded_rows(self):\n",
+    "        '''\n",
+    "        Display information about developers excluded from analysis.\n",
+    "        '''\n",
+    "        nan_x_count = self.df_no_x.shape[0]\n",
+    "        nan_y_count = self.df_no_y.shape[0]\n",
+    "        print(nan_x_count, 'did not specify', self.key_x)\n",
+    "        print(nan_y_count, 'did not specify', self.key_y)\n",
+    "        print('total developers:', self.devs.shape[0] \n",
+    "              + nan_x_count + nan_y_count)\n",
+    "        title = '%s Developers Total vs Included' % self.language\n",
+    "        total_devs = pd.concat([self.devs, self.df_no_y])\n",
+    "    \n",
+    "        plt.figure()\n",
+    "        plt.title(title)\n",
+    "        plt.xticks(rotation=45)\n",
+    "        key   = self.key_x\n",
+    "\n",
+    "        bins = [0, 10, 20, 30, 40, 50]\n",
+    "        labels = ['0-10', '11-20', '21-30', '31-40', '41-50']\n",
+    "        total_binned = pd.cut(total_devs[key], bins=bins, labels=labels).to_frame()\n",
+    "        devs_binned  = pd.cut(self.devs[key], bins=bins, labels=labels).to_frame()\n",
+    "\n",
+    "        sb.countplot(x=key, data=total_binned, label='total')\n",
+    "        sb.countplot(x=key, data=devs_binned,\n",
+    "                     color='red', label='included in analysis')\n",
+    "        plt.legend()\n",
+    "        plt.savefig('images/%s-total-vs-included.png' % self.language)\n",
+    "        \n",
+    "    \n",
+    "def show_model_stats(coef, intercept, y_test, y_pred, label):\n",
+    "    '''\n",
+    "    Displays model performance.\n",
+    "\n",
+    "    inputs:\n",
+    "        coef:      the coefficient of the model.\n",
+    "        intercept: the y-intercept of the model.\n",
+    "        y_test:    true values to compare against model predictions.\n",
+    "        y_pred:    prediction values from the model.\n",
+    "    \n",
+    "        label:     string, to help identify which line (e.g color).\n",
+    "    '''\n",
+    "    print('+----------------------+')\n",
+    "    print(label)\n",
+    "    print('coefficient = %0.2f' % coef)\n",
+    "    print('intercept = %0.2f' % intercept)\n",
+    "    rmse = root_mean_squared_error(y_test, y_pred)\n",
+    "    print('rmse = %0.2f' % rmse)\n",
+    "    r2   = r2_score(y_test, y_pred)\n",
+    "    print('r2 score = %0.2f' % r2)\n",
+    "    print('sample predictions:')\n",
+    "    print(y_pred[3:6])\n",
+    "    print('+----------------------+')\n",
+    "\n",
    "# the higher a is, the steeper the line gets\n",
    "def log_base_a(x, a=1.07):\n",
+    "    '''\n",
+    "    Performs logarithmic transformation of value 'x' with base 'a'.\n",
+    "\n",
+    "    inputs:\n",
+    "        x: numeric, the variable to be transformed.\n",
+    "        a: numeric, the new base.\n",
+    "    '''\n",
    "    return np.log10(x)/np.log(a)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "9a1df75a-4bcf-4072-9bab-d15b4a88c691",
+   "metadata": {},
+   "source": [
+    "## Data Modeling\n",
+    "\n",
+    "Generate models for American python programmers working as data scientists/analysts/engineers."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -219,41 +509,57 @@
   "source": [
    "\n",
    "# expected python jobs\n",
-    "pyjobs = [\"Data scientist or machine learning specialist\",\n",
-    "          \"Data or business analyst\",\n",
-    "          \"Data engineer\",\n",
+    "pyjobs = ['Data scientist or machine learning specialist',\n",
+    "          'Data or business analyst',\n",
+    "          'Data engineer',\n",
    "#        \"DevOps specialist\",\n",
    "#        \"Developer, QA or test\"\n",
    "]\n",
    "\n",
-    "python = Foo(so_df, \"Python\", jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
-    "python.visualize(hue=\"DevType\", palette=[\"#dbdb32\", \"#34bf65\", \"#ac70e0\"])\n",
-    "python.run_regression()\n",
+    "python = Foo(so_df, 'Python', jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
+    "python.visualize(hue='DevType', palette=['#dbdb32', '#34bf65', '#ac70e0'])\n",
+    "python.run_regression(name = 'Default regression line')\n",
    "python.run_regression( x_transform=log_base_a, change_base=1.20, \n",
-    "                       x_shift=0, y_shift=-1.5e4, line_color='cyan', random=888)\n",
-    "python.export_image()"
+    "                       x_shift=0, y_shift=-1.5e4, random=888,\n",
+    "                       color='cyan', name='Tuned regression line')\n",
+    "\n",
+    "#python.run_regression(x_transform=log_base_a, change_base=1.20, risky=2, random=555, \n",
+    "#                      color='pink', name='Risky regression line')\n",
+    "python.run_log_regression(nodraw=False)\n",
+    "python.export_image()\n",
+    "python.probe_excluded_rows()"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "f4e3516e-ffe3-4768-ae92-e5cb0be503f8",
+   "id": "b6e42288-cc7b-4d1c-827f-137d4817dd50",
   "metadata": {},
   "source": [
-    "## Business Understanding\n",
+    "## Evaluation (Python)\n",
    "\n",
-    "* For Python programmers specialized in data e.g data scientists, engineers, or analysts, a linear model moderately fits the relationship between income and years of experience. For data engineers and scientists, there is a possible divergence within the career path at 10 years of experience. \n",
+    "Two models will tell two different stories for data scientists, analysts, and engineers. For either model, roughly 30% of the variability of the data is explanable by years of experience. The \"cyan\" model performs slightly better than the default \"red\" model. The two models have roughly the same RMSE of around $40,000, meaning they may be off by that amount for any given x.\n",
    "\n",
-    "* The typical starting salary within the field of data science is around 100,000 to 120,000 dollars.\n",
+    "### \"red\" / default model\n",
    "\n",
-    "* The income of a data professional can either increase by 2,000 per year (red) or 10,000 per year (cyan).\n",
+    "1. Entry level data scientists/analysts/engineers earn $123,479.15 USD/year.\n",
+    "2. They get a raise of $2,573.62 for each year of experience.\n",
+    "3. This rate of increase in income is steady for multiple decades (>20 years of experience).\n",
    "\n",
-    "* For both models, the r2 score ranges from poor to moderate = 0.20 - 0.37 depending on the random number. The variability not explained by the model could be the result of the fields such as advertising, finance, or bio/medical technology.\n",
+    "### \"cyan\" model\n",
    "\n",
-    "* For any given point in the career, the model is off by 39,000 or 42,000 dollars.\n",
+    "1. Entry level positions yield $82,957.69.\n",
+    "2. There is a raise of $10,378.53 for each year of experience until 10.\n",
+    "3. At 10 years, a cohort (x < 10, y > $200,000) has experienced an unchanged rate of increase while the other experiences a reduced rate of increase similar to the slope (coefficient) from the \"red\" model.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd9fa93d-55e5-4588-af14-69252bd69447",
+   "metadata": {},
+   "source": [
+    "## Data Modeling and Evaluation (for C)\n",
    "\n",
-    "Generally, for low uncomes poorly explained by the model, the cause could be getting a new job after a year of unemployment, internships, or part-time positions. For high incomes poorly explained by the model, the cause could be professionals at large companies who had recently added a programming language to their skill set. Other causes could be company size or working hours.\n",
-    "\n",
-    "(Business understanding was done for C first. Questions are there.)"
+    "Generate models for American C programmers working as embedded systems developers, hardware engineers, or graphics/game programmers."
   ]
  },
  {
@@ -265,61 +571,58 @@
   "source": [
    "# expected C jobs\n",
    "cjobs = [\n",
-    "    \"Developer, embedded applications or devices\", \n",
-    "    \"Developer, game or graphics\",\n",
-    "    \"Hardware Engineer\" ,\n",
+    "    'Developer, embedded applications or devices', \n",
+    "    'Developer, game or graphics',\n",
+    "    'Hardware Engineer',\n",
    " #        \"Project manager\", \n",
    " #        \"Product manager\"\n",
    "]\n",
-    "c = Foo(so_df, \"C\", jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
-    "c.visualize(hue=\"DevType\", palette=[\"#57e6da\",\"#d9e352\",\"#cc622d\"] ) \n",
+    "c = Foo(so_df, 'C', jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
+    "c.visualize(hue='DevType', palette=['#57e6da','#d9e352','#cc622d'] ) \n",
    "c.run_regression()\n",
    "c.run_regression(x_transform=log_base_a, change_base=1.3, \n",
-    "                 x_shift=2, y_shift=-5000, line_color=\"magenta\", random=555)\n",
-    "c.export_image()"
+    "                 x_shift=2, y_shift=-5000, color='magenta', random=555)\n",
+    "c.run_log_regression(nodraw=False)\n",
+    "c.export_image()\n",
+    "c.probe_excluded_rows()"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "89d86a1e-dc65-48e4-adcf-bb10188fd0b7",
-   "metadata": {},
+   "id": "b7026c56-3049-4e60-bbc6-ee548ff58297",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true
+   },
   "source": [
-    "## Business Understanding\n",
+    "## Evaluation for C\n",
    "\n",
-    "1.  For C programmers, specifically embedded systems, graphics, and hardware engineers, a linear model fits the relationship of income and years of experience.\n",
+    "The magenta model for C is good but not great with an r2 score of 0.57. `rmse = 21198.61`, meaning the model is off by $21,198.61 for a given x value.\n",
    "\n",
-    "2. A coefficient = 11973.469 indicates that for each year of experience, a C programmer typically earns an additional $12,000 per year.\n",
-    "\n",
-    "3. Because the graph looks like a spray of water, after 10 years of experience, the salaries for C programming professionals strongly vary.\n",
-    "\n",
-    "4. A junior C programmer, at 2 years of experience, typically earns $54,776.266 per year.\n",
-    "\n",
-    "5. An r2 score =  0.571 indicates a bit over half of the variability in data is explained by the independent variable. This however is only for incomes below 200,000 dollars. Some participants with 5 years of professional experience were reporting incomes at or around $200,000. These were considered unusual outliers. Among the game developers, they may have independently released a game.\n",
-    "\n",
-    "rmse =  21198.612 indicates the model is off by around 21,000 dollars for a given point in a career.\n",
-    "\n",
-    "### Questions that can be answered\n",
-    "\n",
-    "* Is there a linear relationship between income and years of experience.\n",
-    "* Is there a point in a career where raises stop ocurring?\n",
-    "* What is the typical salary of a entry-level or junior C programmer?\n",
-    "* How much more do C programmers earn for each year of experience?\n",
-    "* How much of the variability is explained by the model and what factors are not considered?\n"
+    "1. Early career C programmers earn about $54,776.27 per year.\n",
+    "2. They get a raise of $11,973.47 per year of experience.\n",
+    "3. After 10 years, the rate of increase is lower (possibly $1,427.58 as depicted in the red regression line).\n"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "928f421c-1f2b-4be1-9ce2-8f3593a9a823",
+   "id": "0f21b9fa-7de0-4c39-86ca-d8c4b03cc3c9",
   "metadata": {},
   "source": [
-    "Below cells generate extra or unused graphs."
+    "## (More) Data Understanding and Exploration\n",
+    "\n",
+    "Below cells generate extra or unused graphs.\n",
+    "I put this here because I want to restart the kernel and re-run cells until this point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8357f841-23a0-4bfa-bf09-860bd3e014b8",
-   "metadata": {},
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
   "outputs": [],
   "source": [
    "\n",
@@ -333,20 +636,6 @@
    "js.export_image()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "visualize_devs( get_c_devs(so_df) , \"C\")\n",
-    "\n",
-    "for lang in [\"Cobol\", \"Prolog\", \"Ada\", \"Python\"]:\n",
-    "    foo = get_lang_devs(so_df, lang)\n",
-    "    visualize_devs(foo, lang)"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -658,7 +947,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.7"
+   "version": "3.13.3"
  }
 },
 "nbformat": 4,
Author	SHA1	Message	Date
scuti	2fd699497f	Rewrote explanation for dropped rows. Because the function about excluded rows was changed to show a different key.	2025-04-30 04:06:51 -07:00
scuti	08ab9f126c	Expanded probe_excluded_rows(). Shows blue bars to represent total professional developers, and red bar to represent those included in the analysis. x-axis is years of professional experience (changed from age).	2025-04-30 03:29:07 -07:00
scuti	d5443bd1fb	Probe into developers excluded from analysis. Added charts on participants who did not specify an annual income compared with those who did. Can print quantity of rows with NaN dropped.	2025-04-29 08:13:45 -07:00
scuti	2af2414219	Write DOCSTRINGS for functions. Also corrected typo in label.	2025-04-29 08:13:26 -07:00
scuti	f49283a7cc	Added train-test splitting for log regression.	2025-04-28 06:31:35 -07:00
scuti	1f7fe33915	Added function show_model_stats().	2025-04-27 16:23:33 -07:00
scuti	3c3e804251	Simplified code. Removed to_frame().	2025-04-27 12:42:39 -07:00
scuti	67d1441303	Added logarithmic regression Not part of the course but fits better.	2025-04-27 10:49:37 -07:00
scuti	b18a5cb42a	Implemented "risky" (pink) model. Also cleaned up code. Training on sorted data is unrecommended and "risky"; however, the risky model appears to be generalizing across random state.	2025-04-27 10:06:33 -07:00
scuti	311db886a4	Minor fixes to README. Corrected typo of 'developers'. Removed bulletpoint for the only point under acknowledgements.	2025-04-25 01:47:21 -07:00
scuti	7b34548a2d	Comply with PEP 8. For consistent quotes and number of line breaks. At least for the code that is being used.	2025-04-25 01:47:17 -07:00
scuti	fbdd5f3f18	Added cells for CRISP_DM. Instructor said it was a nonlinear process, but the grader wants it to be linear.	2025-04-25 01:13:45 -07:00
scuti	0a0281ab4e	Updated title for graphs.	2025-04-24 01:45:08 -07:00
scuti	0f248e6b9a	Wrote README.	2025-04-24 01:43:05 -07:00
scuti	7d82e4c588	Print only 2 significant figures of regression results.	2025-04-23 18:42:37 -07:00