stacksurvey/stackoverflow-survey.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "14623ab1-dc15-4aa7-96c2-074ea4d0e33a",
   "metadata": {},
   "source": [
    "# Project: Write a data science blog post\n",
    "\n",
    "## Business Understanding\n",
    "\n",
    "Salary or wages are a common talking point from business, personal finance, and economics.\n",
    "But what's the bigger picture beyond mean and median?\n",
    "\n",
    "1. How much can entry or junior level developers expect to be paid?\n",
    "2. How much more do they earn with each year of experience?\n",
    "3. At what point in a career do salaries or wages start to stagnate?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74e4cf25-6649-4633-89ea-03ffc2e23caa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "import pandas as pd\n",
    "import seaborn as sb\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# avoid burning my eyes @ night\n",
    "plt.style.use('dark_background')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56f4093b-450f-4529-831e-1f791e3e2c6a",
   "metadata": {},
   "source": [
    "## Data Understanding and Exploration\n",
    "\n",
    "The survey will ask participants to answer \"Apples\" to a question in order to check if they're paying attention to the questions. The published data set already purged rows that failed the check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2b80545-2481-4ee8-8d43-ffd4a612a397",
   "metadata": {},
   "outputs": [],
   "source": [
    "FILE = 'data/survey_results_public.csv'\n",
    "so_df = pd.read_csv(FILE)\n",
    "\n",
    "print(so_df.keys())\n",
    "so_df.describe()\n",
    "\n",
    "# check for people who aren't paying attention\n",
    "count_not_apple =  (so_df['Check'] != 'Apples').sum()\n",
    "print(count_not_apple)\n",
    "print(so_df.shape)\n",
    "assert(count_not_apple == 0)\n",
    "# print(so_df[:3])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e9b0c49-eac6-45e1-83f1-92813e734ef5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# draw count plot of developers based on age\n",
    "\n",
    "def visualize_devs(df, title, key='Age'):\n",
    "    '''\n",
    "    Draws count plot of developers based on attributes.\n",
    "\n",
    "    inputs:\n",
    "        df:    a DataFrame, the subset of the data set.\n",
    "        title: string, title of the chart.\n",
    "        key:   string, the attribute to count (age).\n",
    "    outputs:\n",
    "        no return values, will draw and save a graphic.\n",
    "    '''\n",
    "    plt.figure()\n",
    "    plt.xticks(rotation=45)\n",
    "    # from:\n",
    "    # print(df[key].unique())\n",
    "    order =  ['Under 18 years old', '18-24 years old',  \\\n",
    "              '25-34 years old','35-44 years old',\\\n",
    "              '45-54 years old', '55-64 years old',  \\\n",
    "              '65 years or older', 'Prefer not to say']\n",
    "    sb.countplot(x=key, data=df, order=order)\n",
    "    plt.title(title)\n",
    "    filename= 'images/%s.png' % title.replace(\" \", \"-\")\n",
    "    plt.savefig(filename, bbox_inches='tight')\n",
    "\n",
    "\n",
    "def get_lang_devs(df, lang):\n",
    "    '''\n",
    "    Returns a DataFrame, subset of the data set, of developers that have\n",
    "    worked with a specified programming language.\n",
    "\n",
    "    inputs:\n",
    "        df:   a DataFrame, can be the entire published data set.\n",
    "        lang: a string, the programming language.\n",
    "    outputs:\n",
    "        a DataFrame of developers that have worked with `lang` programming \n",
    "        language.\n",
    "    '''\n",
    "    col = 'LanguageHaveWorkedWith'\n",
    "    # will not work for single character languages (C, R)\n",
    "    # will mangle Java and JavaScript, Python and MicroPython\n",
    "    return df[ df[col].str.contains(lang, na=False) ] \n",
    "\n",
    "\n",
    "def get_c_devs(df, lang='C'):\n",
    "    '''\n",
    "    Returns a DataFrame, subset of the data set, of developers that have\n",
    "    worked with a specified programming language.\n",
    "    Similar to get_lang_devs() but adapted for languages named by a single\n",
    "    letter, or names like 'Java' which is contained in 'JavaScript'.\n",
    "\n",
    "    inputs:\n",
    "        df:   a DataFrame, can be the entire published data set.\n",
    "        lang: a string, the programming language.\n",
    "    outputs:\n",
    "        a DataFrame of developers that have worked with `lang` programming \n",
    "        language.\n",
    "    '''\n",
    "    key = 'LanguageHaveWorkedWith'\n",
    "    cdevs = []\n",
    "    for index, dev in df.iterrows():\n",
    "        try:\n",
    "            # split string into list\n",
    "            langs_used = dev[key].split(';')\n",
    "            if lang in langs_used:\n",
    "                cdevs.append(dev)\n",
    "        except AttributeError:\n",
    "#            print(dev[key])\n",
    "            pass\n",
    "    return pd.DataFrame(cdevs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11a1b9fb-db48-4749-8d77-4241a99d7bad",
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_devs( get_c_devs(so_df) , 'Ages of C Programmers')\n",
    "visualize_devs( get_c_devs(so_df, lang='Python') , 'Ages of Python Programmers')\n",
    "\n",
    "for lang in ['Cobol', 'Prolog', 'Ada']:\n",
    "    title = 'Ages of %s Programmers' % lang\n",
    "    foo = get_lang_devs(so_df, lang)\n",
    "    visualize_devs(foo, title)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab9ce039-8ed4-46d7-8eea-426c460d0a7b",
   "metadata": {},
   "source": [
    "## Preparing the Data\n",
    "\n",
    "`__init__()` specifies which rows to omit and which to use, so the data for modeling doesn't look like a shotgun blast of rainbow colors.\n",
    "\n",
    "### NaNs are dropped\n",
    "\n",
    "No values are assumed in the place of NaN for keys 'YearsCodePro' and 'ConvertedCompYearly'.\n",
    "\n",
    "Rows with NaN are dropped for developers who:\n",
    "* did not specify their years of professional experience\n",
    "* did not disclose an annual compensation.\n",
    "\n",
    "More developers declined to specify their income than years of experience. Between total and included rows, the distributions of years of experience is similar. This supports that the analysis is not significantly altered by missing data.\n",
    "\n",
    "See charts\n",
    "\n",
    "* Python Developers Total vs Included\n",
    "* C Developers Total vs Included"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8212c27-6c76-4c8f-ba66-bbf1b5835c99",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import root_mean_squared_error, r2_score\n",
    "import traceback\n",
    "import numpy as np\n",
    "\n",
    "# still haven't come up with a name\n",
    "class Foo:\n",
    "    def __init__(self, df, language, jobs=None, \n",
    "                 n_rich_outliers=0, n_poor_outliers=0, \n",
    "                 country='United States of America'):\n",
    "        '''\n",
    "        inputs:\n",
    "            dataset:  A DataFrame, can be the full data set.\n",
    "            language: string, the programming language \n",
    "                a developer has worked with.\n",
    "            jobs:     list of strings, job positions \n",
    "            - typically domains where the language is dominant.\n",
    "            n_rich_outliers: integer, removes samples from the \n",
    "                upper limit of the y-axis.\n",
    "            n_poor_outliers: integer, removes samples from the \n",
    "                lower limit of the y-axis.\n",
    "            country: string, specifies the country of origin.\n",
    "        '''\n",
    "        self.devs   = None\n",
    "        self.canvas = None\n",
    "        self.language = language\n",
    "        self.country = country\n",
    "        # focus on people who have given ...\n",
    "        key_x  = 'YearsCodePro'\n",
    "        key_y  = 'ConvertedCompYearly'\n",
    "        self.key_x = key_x\n",
    "        self.key_y = key_y\n",
    "\n",
    "        qualifiers = {\n",
    "            'MainBranch': 'I am a developer by profession',\n",
    "       }\n",
    "        if country:\n",
    "            qualifiers['Country'] = country\n",
    "        for k in qualifiers:\n",
    "            df = df[df[k] == qualifiers[k] ] \n",
    "\n",
    "        # chatgpt tells me about filtering with multiple strings\n",
    "        if jobs:\n",
    "            df = df[df.isin(jobs).any(axis=1)]\n",
    "\n",
    "        devs = None\n",
    "        if len(language) == 1 or language in ['Python', 'Java']:\n",
    "            devs = get_c_devs(df, lang=language)\n",
    "        else:\n",
    "            devs = get_lang_devs(df, language)\n",
    "\n",
    "        self.df_no_x = devs[devs[key_x].isnull()]\n",
    "        self.df_no_y = devs[devs[key_y].isnull()]\n",
    "        devs  = devs.dropna(subset=[key_x, key_y])\n",
    "\n",
    "        replacement_dict = {\n",
    "            'Less than 1 year': '0.5',\n",
    "            'More than 50 years': '51',\n",
    "        }\n",
    "\n",
    "        # https://stackoverflow.com/questions/47443134/update-column-in-pandas-dataframe-without-warning\n",
    "        pd.options.mode.chained_assignment = None  # default='warn'\n",
    "    \n",
    "        new_column = devs[key_x].replace(replacement_dict)\n",
    "        devs[key_x] = pd.to_numeric(new_column, errors='raise')\n",
    "\n",
    "        new_column = self.df_no_y[key_x].replace(replacement_dict)\n",
    "        self.df_no_y[key_x] = pd.to_numeric(new_column, errors='raise')\n",
    "        pd.options.mode.chained_assignment = 'warn'  # default='warn'\n",
    "        # print( devs[key_x].unique() )\n",
    "        \n",
    "        indices  = devs[key_y].nlargest(n_rich_outliers).index\n",
    "        devs = devs.drop(indices)\n",
    "        indices  = devs[key_y].nsmallest(n_poor_outliers).index\n",
    "        self.devs = devs.drop(indices)\n",
    "        del devs, new_column\n",
    "    \n",
    "    def visualize(self,  hue='Country', \n",
    "                  palette=sb.color_palette() ):\n",
    "        '''\n",
    "        Draw scatter plot of samples included in self.devs.\n",
    "\n",
    "        inputs:\n",
    "            hue:     string, colorize dots by a given key.\n",
    "            palette: list of strings (color codes)\n",
    "                     or string (matplotlib predefined palettes),\n",
    "                     specifies the colors to use when coloring dots.\n",
    "        '''\n",
    "        self.canvas = plt.figure()\n",
    "        key_x = self.key_x\n",
    "        key_y = self.key_y\n",
    "\n",
    "        sb.scatterplot(data=self.devs, x=key_x, y=key_y, hue=hue, palette=palette)\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
    "        title = 'Annual Compensation of %s Programmers Over Years of Experience' % self.language\\\n",
    "                + '\\nsample size=%i' %  len (self.devs)\\\n",
    "                + '\\ncountry=%s' % self.country\n",
    "        plt.title(title)\n",
    "\n",
    "    def run_regression(self, x_transform=None, change_base=1.07, \n",
    "                       x_shift=0, y_shift=0,\n",
    "                       random=333, risky=0,\n",
    "                       color='red', name='Regression Line' ):\n",
    "        '''\n",
    "        Run linear regresssion and draws a straight line.\n",
    "\n",
    "        inputs:\n",
    "            x_transform: function, function to tune the independent variable.\n",
    "            change_base: float or integer, specifies base \n",
    "                for logarithmic function, not used if x_transform is None.\n",
    "            x_shift: integer, for tuning, shifts the position \n",
    "                of the line on the x-axis.\n",
    "            y_shift: integer, for tuning, shifts the position \n",
    "                of the line on the y-axis.\n",
    "            random:  integer, random seed for train_test_split; \n",
    "                change to test generalization.\n",
    "            risky    integer ranging from 0 to 2,\n",
    "                    0 = does nothing (default),\n",
    "                    1 = sorts the independent variable,\n",
    "                    2 = sorts the dependent variable,\n",
    "               performs unrecommended operation to sort data,\n",
    "               risking the model training on the order of values.\n",
    "               May draw nice lines that generalize across random states.\n",
    "           color: string, color of the regression line.\n",
    "           name:  string, label of regression line on the legend.\n",
    "        '''\n",
    "        df = self.devs # .sort_values(by = self.key2)\n",
    "        X = df[[self.key_x]]\n",
    "        y = df[[self.key_y]]\n",
    "\n",
    "        # not recommended\n",
    "        # carries risk of model training on sorted order\n",
    "        # however it appears to be generalizing well\n",
    "        # across random state and shuffle (=True, default)\n",
    "        style = '-'\n",
    "        if risky > 0:\n",
    "            X = X.sort_values(by=self.key_x)\n",
    "            style = '--'\n",
    "        if risky > 1:\n",
    "            y = y.sort_values(by=self.key_y)\n",
    "        if x_transform is not None:\n",
    "            X = x_transform (X, a=change_base ) \n",
    "\n",
    "        X = X + x_shift\n",
    "        y = y + y_shift\n",
    "    \n",
    "        X_train, X_test, y_train, y_test = train_test_split(\n",
    "                                                X, y, \n",
    "                                                test_size=0.2, \n",
    "                                                random_state=random)\n",
    "\n",
    "        model = LinearRegression()\n",
    "        model.fit(X_train, y_train)\n",
    "        y_pred = model.predict(X_test)\n",
    "    \n",
    "        m = model.coef_[0][0]\n",
    "        b = model.intercept_[0]\n",
    "        label = '%s regression line for %s' % (color, self.language)\n",
    "        show_model_stats(m, b, y_test, y_pred, label)\n",
    "\n",
    "        plt.figure(self.canvas)\n",
    "        plt.plot(X_test, y_pred, color=color, label=name, linestyle=style)\n",
    "        plt.axhline(y=b, color='purple', linestyle='--', \n",
    "                    label='b=%0.2f' % b, zorder=-1 )\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
    "        del y_pred, model, X, y\n",
    "\n",
    "    def run_log_regression(self, color='pink', nodraw=True):\n",
    "        '''\n",
    "        Runs logarithmic regression and draws a line that contours \n",
    "        at the point of diminishing returns.\n",
    "\n",
    "        Logarithmic regression provides a better fit for the data;\n",
    "        however, it is not part of the course.\n",
    "\n",
    "        Can illustrate an interesting relationship between the\n",
    "        \"default\" linear model and a tuned linear model.\n",
    "\n",
    "        inputs:\n",
    "            color:   color of the regression line.\n",
    "            nodraw:  whether or not to draw the line.\n",
    "        '''\n",
    "        df = self.devs\n",
    "        X = df[[self.key_x]] #.sort_values(by=self.key_x)\n",
    "        y = df[[self.key_y]] #.sort_values(by=self.key_y)\n",
    "\n",
    "        X_train, X_test, y_train, y_test = train_test_split(\n",
    "                                                X, y, \n",
    "                                                test_size=0.2, \n",
    "                                                random_state=777)\n",
    "    \n",
    "        X_train_log = np.log(X_train)\n",
    "        X_test_log = np.log(X_test)\n",
    "    \n",
    " #       X_train_log = X_train_log.sort_values(by=self.key_x)\n",
    " #       y_train = y_train.sort_values(by=self.key_y)\n",
    "        X_test_log = X_test_log.sort_values(by=self.key_x)\n",
    "        X_test = X_test.sort_values(by=self.key_x)\n",
    "        y_test = y_test.sort_values(by=self.key_y)\n",
    "        \n",
    "        model = LinearRegression()\n",
    "        model.fit(X_train_log, y_train)\n",
    "        y_pred = model.predict(X_test_log)\n",
    "        y_pred.sort()\n",
    "\n",
    "        m = model.coef_[0][0]\n",
    "        b = model.intercept_[0]\n",
    "        label = '%s log regression line for %s' % (color, self.language)\n",
    "        show_model_stats(m, b, y_test, y_pred, label)\n",
    "\n",
    "        if nodraw:\n",
    "            return\n",
    "        plt.plot(X_test, y_pred, color=color, label=\"Log regression\")\n",
    "        plt.legend(loc='lower center', bbox_to_anchor=(1.5,0)) \n",
    "\n",
    "    def export_image(self, base_filename = 'images/programmers-%s-%s.png'):\n",
    "        '''\n",
    "        Saves canvas to file.\n",
    "\n",
    "        inputs:\n",
    "            base_filename: string with two format codes (two strings),\n",
    "                this string will be interpolated by...\n",
    "                1. the programming language\n",
    "                2. the country of origin.\n",
    "        '''\n",
    "        plt.figure(self.canvas)\n",
    "        filename = base_filename % (self.language, self.country)\n",
    "        plt.savefig(filename.replace(' ', '-'), bbox_inches='tight')\n",
    "\n",
    "    def probe_excluded_rows(self):\n",
    "        '''\n",
    "        Display information about developers excluded from analysis.\n",
    "        '''\n",
    "        nan_x_count = self.df_no_x.shape[0]\n",
    "        nan_y_count = self.df_no_y.shape[0]\n",
    "        print(nan_x_count, 'did not specify', self.key_x)\n",
    "        print(nan_y_count, 'did not specify', self.key_y)\n",
    "        print('total developers:', self.devs.shape[0] \n",
    "              + nan_x_count + nan_y_count)\n",
    "        title = '%s Developers Total vs Included' % self.language\n",
    "        total_devs = pd.concat([self.devs, self.df_no_y])\n",
    "    \n",
    "        plt.figure()\n",
    "        plt.title(title)\n",
    "        plt.xticks(rotation=45)\n",
    "        key   = self.key_x\n",
    "\n",
    "        bins = [0, 10, 20, 30, 40, 50]\n",
    "        labels = ['0-10', '11-20', '21-30', '31-40', '41-50']\n",
    "        total_binned = pd.cut(total_devs[key], bins=bins, labels=labels).to_frame()\n",
    "        devs_binned  = pd.cut(self.devs[key], bins=bins, labels=labels).to_frame()\n",
    "\n",
    "        sb.countplot(x=key, data=total_binned, label='total')\n",
    "        sb.countplot(x=key, data=devs_binned,\n",
    "                     color='red', label='included in analysis')\n",
    "        plt.legend()\n",
    "        plt.savefig('images/%s-total-vs-included.png' % self.language)\n",
    "        \n",
    "    \n",
    "def show_model_stats(coef, intercept, y_test, y_pred, label):\n",
    "    '''\n",
    "    Displays model performance.\n",
    "\n",
    "    inputs:\n",
    "        coef:      the coefficient of the model.\n",
    "        intercept: the y-intercept of the model.\n",
    "        y_test:    true values to compare against model predictions.\n",
    "        y_pred:    prediction values from the model.\n",
    "    \n",
    "        label:     string, to help identify which line (e.g color).\n",
    "    '''\n",
    "    print('+----------------------+')\n",
    "    print(label)\n",
    "    print('coefficient = %0.2f' % coef)\n",
    "    print('intercept = %0.2f' % intercept)\n",
    "    rmse = root_mean_squared_error(y_test, y_pred)\n",
    "    print('rmse = %0.2f' % rmse)\n",
    "    r2   = r2_score(y_test, y_pred)\n",
    "    print('r2 score = %0.2f' % r2)\n",
    "    print('sample predictions:')\n",
    "    print(y_pred[3:6])\n",
    "    print('+----------------------+')\n",
    "\n",
    "# the higher a is, the steeper the line gets\n",
    "def log_base_a(x, a=1.07):\n",
    "    '''\n",
    "    Performs logarithmic transformation of value 'x' with base 'a'.\n",
    "\n",
    "    inputs:\n",
    "        x: numeric, the variable to be transformed.\n",
    "        a: numeric, the new base.\n",
    "    '''\n",
    "    return np.log10(x)/np.log(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a1df75a-4bcf-4072-9bab-d15b4a88c691",
   "metadata": {},
   "source": [
    "## Data Modeling\n",
    "\n",
    "Generate models for American python programmers working as data scientists/analysts/engineers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba81c59c-0610-4f71-96fb-9eddd7736329",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# expected python jobs\n",
    "pyjobs = ['Data scientist or machine learning specialist',\n",
    "          'Data or business analyst',\n",
    "          'Data engineer',\n",
    "#        \"DevOps specialist\",\n",
    "#        \"Developer, QA or test\"\n",
    "]\n",
    "\n",
    "python = Foo(so_df, 'Python', jobs=pyjobs, n_rich_outliers=12, n_poor_outliers=2)\n",
    "python.visualize(hue='DevType', palette=['#dbdb32', '#34bf65', '#ac70e0'])\n",
    "python.run_regression(name = 'Default regression line')\n",
    "python.run_regression( x_transform=log_base_a, change_base=1.20, \n",
    "                       x_shift=0, y_shift=-1.5e4, random=888,\n",
    "                       color='cyan', name='Tuned regression line')\n",
    "\n",
    "#python.run_regression(x_transform=log_base_a, change_base=1.20, risky=2, random=555, \n",
    "#                      color='pink', name='Risky regression line')\n",
    "python.run_log_regression(nodraw=False)\n",
    "python.export_image()\n",
    "python.probe_excluded_rows()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6e42288-cc7b-4d1c-827f-137d4817dd50",
   "metadata": {},
   "source": [
    "## Evaluation (Python)\n",
    "\n",
    "Two models will tell two different stories for data scientists, analysts, and engineers. For either model, roughly 30% of the variability of the data is explanable by years of experience. The \"cyan\" model performs slightly better than the default \"red\" model. The two models have roughly the same RMSE of around $40,000, meaning they may be off by that amount for any given x.\n",
    "\n",
    "### \"red\" / default model\n",
    "\n",
    "1. Entry level data scientists/analysts/engineers earn $123,479.15 USD/year.\n",
    "2. They get a raise of $2,573.62 for each year of experience.\n",
    "3. This rate of increase in income is steady for multiple decades (>20 years of experience).\n",
    "\n",
    "### \"cyan\" model\n",
    "\n",
    "1. Entry level positions yield $82,957.69.\n",
    "2. There is a raise of $10,378.53 for each year of experience until 10.\n",
    "3. At 10 years, a cohort (x < 10, y > $200,000) has experienced an unchanged rate of increase while the other experiences a reduced rate of increase similar to the slope (coefficient) from the \"red\" model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd9fa93d-55e5-4588-af14-69252bd69447",
   "metadata": {},
   "source": [
    "## Data Modeling and Evaluation (for C)\n",
    "\n",
    "Generate models for American C programmers working as embedded systems developers, hardware engineers, or graphics/game programmers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e27f76c-8f87-4c39-ac2f-5a9b2434466f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# expected C jobs\n",
    "cjobs = [\n",
    "    'Developer, embedded applications or devices', \n",
    "    'Developer, game or graphics',\n",
    "    'Hardware Engineer',\n",
    " #        \"Project manager\", \n",
    " #        \"Product manager\"\n",
    "]\n",
    "c = Foo(so_df, 'C', jobs=cjobs, n_rich_outliers=30, n_poor_outliers=2)\n",
    "c.visualize(hue='DevType', palette=['#57e6da','#d9e352','#cc622d'] ) \n",
    "c.run_regression()\n",
    "c.run_regression(x_transform=log_base_a, change_base=1.3, \n",
    "                 x_shift=2, y_shift=-5000, color='magenta', random=555)\n",
    "c.run_log_regression(nodraw=False)\n",
    "c.export_image()\n",
    "c.probe_excluded_rows()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7026c56-3049-4e60-bbc6-ee548ff58297",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Evaluation for C\n",
    "\n",
    "The magenta model for C is good but not great with an r2 score of 0.57. `rmse = 21198.61`, meaning the model is off by $21,198.61 for a given x value.\n",
    "\n",
    "1. Early career C programmers earn about $54,776.27 per year.\n",
    "2. They get a raise of $11,973.47 per year of experience.\n",
    "3. After 10 years, the rate of increase is lower (possibly $1,427.58 as depicted in the red regression line).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f21b9fa-7de0-4c39-86ca-d8c4b03cc3c9",
   "metadata": {},
   "source": [
    "## (More) Data Understanding and Exploration\n",
    "\n",
    "Below cells generate extra or unused graphs.\n",
    "I put this here because I want to restart the kernel and re-run cells until this point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8357f841-23a0-4bfa-bf09-860bd3e014b8",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "jsjobs = [\"Developer, full-stack\",\n",
    "          \"Developer, front-end\",\n",
    "          \"Developer, mobile\"\n",
    "]\n",
    "\n",
    "js = Foo(so_df, \"JavaScript\", jobs=jsjobs, n_rich_outliers=6, country=\"Ukraine\")\n",
    "js.visualize(hue=\"DevType\")\n",
    "js.export_image()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35b9727a-176c-4193-a1f9-a508aecd2d1c",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# get popularity of different programming languages\n",
    "\n",
    "#keys re: languages are:\n",
    "#LanguageHaveWorkedWith,LanguageWantToWorkWith,LanguageAdmired,LanguageDesired\n",
    "\n",
    "# draw as strip chart\n",
    "# https://seaborn.pydata.org/generated/seaborn.stripplot.html#seaborn.stripplot\n",
    "\n",
    "def get_langs(dataset, key=\"LanguageHaveWorkedWith\"):\n",
    "    lang_count = Counter()\n",
    "    assert(key in dataset.keys())\n",
    "    for response in dataset[key]:\n",
    "        if type(response) == str:\n",
    "            lang_count.update(response.split(';'))\n",
    "    langs_by_popularity = dict(\n",
    "        sorted(lang_count.items(), key=lambda item: item[1], reverse=True)\n",
    "    )\n",
    "    return langs_by_popularity\n",
    "\n",
    "def visualize_langs(langs, langs2, label1 = \"condition1\", label2 = \"condition2\", saveto=None):\n",
    "    DOT_COLOR1 = \"lightblue\"\n",
    "    DOT_COLOR2 = \"red\"\n",
    "    BG_COLOR   = \"black\" \n",
    "    df    = pd.DataFrame(langs.items(), columns=['Languages', 'Count'])\n",
    "    df2   = pd.DataFrame(langs2.items(), columns=['Languages', 'Count'])\n",
    "    \n",
    "    plt.figure(figsize=(10,15)) \n",
    "    \n",
    "    sb.stripplot(x='Count', y='Languages', data=df, \\\n",
    "                 size=5, color=DOT_COLOR1, label=\"have worked with\", jitter=True)\n",
    "    sb.stripplot(x='Count', y='Languages', data=df2, \\\n",
    "                 size=5, color=DOT_COLOR2, label=\"want to work with\", jitter=True)\n",
    "    \n",
    "    # chatgpt draws my legend\n",
    "    # Create custom legend handles to avoid duplicates\n",
    "    # color = 'w' means do not draw line bissecting point\n",
    "    blue_patch = plt.Line2D(\n",
    "        [0], [0], marker='o', color=BG_COLOR, \\\n",
    "        label=label1, markerfacecolor=DOT_COLOR1, markersize=10)\n",
    "    red_patch = plt.Line2D(\n",
    "        [0], [0], marker='o', color=BG_COLOR, \\\n",
    "        label=label2, markerfacecolor=DOT_COLOR2, markersize=10)\n",
    "    \n",
    "    # Show the legend with custom handles\n",
    "    plt.legend(handles=[blue_patch, red_patch], loc=\"center right\")\n",
    "    \n",
    "    plt.grid(axis='x', linestyle='--', alpha=0.75) \n",
    "    plt.title(\"%s vs %s\" % (label1, label2))\n",
    "    if saveto is not None:\n",
    "        plt.savefig(saveto, bbox_inches='tight')\n",
    "    del df, df2\n",
    "\n",
    "l1 = get_langs( so_df )\n",
    "l2 = get_langs( so_df, \"LanguageWantToWorkWith\" )\n",
    "visualize_langs(l1,l2, \n",
    "                label1=\"have worked with\", label2=\"want to work with\",\n",
    "                saveto=\"images/used-vs-want2use.png\")\n",
    "\n",
    "l3 = get_langs( so_df, \"LanguageAdmired\")\n",
    "l4 = get_langs( so_df, \"LanguageWantToWorkWith\")\n",
    "visualize_langs(l3, l4, \n",
    "                label1=\"admired\", label2=\"want to work with\",\n",
    "               saveto=\"images/admired-vs-want2use.png\")\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0bfdb92-378a-4452-91cc-4d21afd2d6cc",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# draw horizontal bar plot\n",
    "# https://seaborn.pydata.org/examples/part_whole_bars.html\n",
    "\n",
    "# investigate extrinsic vs intrinsic motivation\n",
    "def get_difference(dict1, dict2, proportion=False):\n",
    "    keys = dict1.keys()\n",
    "    result = dict()\n",
    "    for key in keys:\n",
    "        if proportion:\n",
    "            result[key] = round((dict1[key] - dict2[key])/dict2[key],2)\n",
    "        else:\n",
    "            result[key] = dict1[key] - dict2[key]\n",
    "    return result\n",
    "\n",
    "def visualize_diff(diff_dict, color=\"lightblue\", saveto=None):\n",
    "    diff_sorted = dict(\n",
    "        sorted(diff_dict.items(), key=lambda item: item[1], reverse=True)\n",
    "    )\n",
    "    KEY = \"Value\"\n",
    "    df    = pd.DataFrame(diff_sorted.items(), columns=['Languages', 'Value'])\n",
    "    plt.figure(figsize=(15,20)) \n",
    "    sb.barplot(x=KEY, y='Languages', data=df, color=color)\n",
    "    DELTA =  '\\u0394'\n",
    "    for index, value in enumerate(df[KEY]):\n",
    "    # chatgpt annotates my chart\n",
    "    # Position the text at the base of the bar\n",
    "        if value >= 0:\n",
    "            # Adjust the x position for positive values\n",
    "            plt.text(value, index, DELTA+str(value), va='center', ha=\"left\")  \n",
    "        else:\n",
    "             # Adjust the x position for negative values\n",
    "            plt.text(value, index,  DELTA+str(value), va='center',  ha='right') \n",
    "    lowest = 0\n",
    "    offset = 0\n",
    "    positive_values = df[df[KEY] > 0][KEY]\n",
    "    if not positive_values.empty:\n",
    "        lowest = positive_values.min()\n",
    "        offset = list(positive_values).count(lowest) \n",
    "    if len(positive_values) < len(df):\n",
    "        # don't draw the line if every value is greater than 0_\n",
    "        plt.axhline(y=df[KEY].tolist().index(lowest) + (offset-0.5), \n",
    "                    color='red', linestyle='--', zorder=-1)\n",
    "    if saveto is not None:\n",
    "        plt.savefig(saveto, bbox_inches='tight')\n",
    "    \n",
    "motiv_diff = get_difference(l2, l1, proportion=True)\n",
    "# print(motiv_diff)\n",
    "visualize_diff(motiv_diff, saveto=\"images/delta.png\")\n",
    "motiv_diff = get_difference(l2, l1)\n",
    "visualize_diff(motiv_diff, saveto=\"images/delta-b.png\")\n",
    "\n",
    "# no clear description of what \"admired\" is\n",
    "# in the schema\n",
    "# but generally people want to use the languages\n",
    "# they admire\n",
    "\n",
    "# determine level of hype\n",
    "# hype = get_difference(l4, l3)\n",
    "# print(hype)\n",
    "# visualize_diff(hype, color=\"red\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6b1a935-eeda-416f-8adf-5e854d3aa066",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# do people fall out of love with langs\n",
    "# the more they are used professionally?\n",
    "\n",
    "def visualize_favor(df, key_x, key_y, MAGIC_X=0, MAGIC_Y=0, title=str(), saveto=None):\n",
    "    plt.figure()\n",
    "    OFFSET = 1 # push text away from point slightly\n",
    "    for i in range(merged.shape[0]):\n",
    "        # label points that aren't un a cluster\n",
    "        if merged[key_x][i] > MAGIC_X or merged[key_y][i] > MAGIC_Y:\n",
    "            plt.text(merged[key_x].iloc[i]+OFFSET, \n",
    "                     merged[key_y].iloc[i]+OFFSET, \n",
    "                     merged[\"Language\"].iloc[i], \n",
    "                     ha=\"left\",\n",
    "                     size='medium')\n",
    "\n",
    "    sb.scatterplot(data=merged, x=key_x, y=key_y, hue=\"Language\")\n",
    "    plt.legend(loc='lower left', bbox_to_anchor=(0, -1.25), ncol=3) \n",
    "    plt.title(title)\n",
    "    if saveto is not None:\n",
    "        plt.savefig(saveto, bbox_inches='tight')\n",
    "    pass\n",
    "key_x  = \"Users\"\n",
    "key_y  = \"Potential '\\u0394'Users\"\n",
    "df1    = pd.DataFrame(l1.items(), columns=['Language', key_x])\n",
    "df2    = pd.DataFrame(motiv_diff.items(), columns=['Language', key_y])\n",
    "# chatgpt tells me how to combine df\n",
    "merged = pd.merge(df1, df2[[\"Language\", key_y]], on='Language', how='left')\n",
    "visualize_favor(merged, key_x, key_y, \n",
    "                MAGIC_X=5000, MAGIC_Y=2000, \n",
    "                saveto=\"images/favor.png\")\n",
    "del df1, df2, merged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e90cf119-c50d-468a-bc87-72dac41176ce",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# see how much money are people making\n",
    "\n",
    "def get_mean_by_category(df, category, key=\"ConvertedCompYearly\"):\n",
    "    unique = df[category].unique()\n",
    "    result = dict()\n",
    "    for u in unique:\n",
    "        mean = df[df[category] == u][key].mean()\n",
    "        result[u] = mean\n",
    "    return result\n",
    "\n",
    "def show_me_the_money(df, saveto=None):\n",
    "    key_x = \"ConvertedCompYearly\"\n",
    "    key_y = \"DevType\"\n",
    "    \n",
    "    means   = get_mean_by_category(df, key_y) \n",
    "    mean_df = pd.DataFrame(means.items(), columns=[key_y, key_x])\n",
    "\n",
    "    plt.figure(figsize=(14,18)) \n",
    "    plt.axvline(x=1e5, color='red', linestyle='--', label=\"x = $100,000\")\n",
    "    plt.axvline(x=1e6, color='lightgreen', linestyle='--', label=\"x = millionaire\")\n",
    "    sb.barplot(x=key_x, y=key_y, data=mean_df.sort_values(by=key_x), \\\n",
    "               color='lavender', alpha=0.7, label=\"average compensation\")\n",
    "    sb.stripplot(x=key_x, y=key_y, data=df, \\\n",
    "                 size=3, jitter=True)\n",
    "    if saveto is not None:\n",
    "        plt.savefig(saveto, bbox_inches='tight')\n",
    "    \n",
    "# print survey ans\n",
    "#employment_status = Counter(so_df[\"MainBranch\"])\n",
    "#print(employment_status)\n",
    "\n",
    "#employment_type = Counter(so_df[\"DevType\"])\n",
    "#print(employment_type)\n",
    "\n",
    "key = \"ConvertedCompYearly\"\n",
    "#    answers = so_df[:-1][key].count()\n",
    "#    print(answers, \"people answered re: \", key)\n",
    "df_no_na = so_df.dropna(subset=[key])\n",
    "indices  = df_no_na[key].nlargest(15).index\n",
    "\n",
    "show_me_the_money( df_no_na.drop(indices), saveto=\"images/compensation-by-profession.png\" )\n",
    "# could also ask myself what portion of developers \n",
    "# earn less than the mean compensation\n",
    "# (what titles have high standard deviations in earnings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdf21b1c-1316-422f-ad14-48150f80366c",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "# key   = \"DevType\"\n",
    "# prof  = \"Developer, full-stack\"\n",
    "\n",
    "key   = \"MainBranch\"\n",
    "prof = \"I am a developer by profession\"\n",
    "col   = \"ConvertedCompYearly\"\n",
    "\n",
    "devs =  df_no_na[df_no_na[key] ==  prof ] \n",
    "pd.set_option('display.float_format', '{:.2f}'.format)\n",
    "devs.describe()[col]\n",
    "\n",
    "# who the hell is making $1/yr \n",
    "# devs[devs[col] == 1.0]\n",
    "\n",
    "# who are the millionaires\n",
    "# devs[devs[col] > 1e6]\n",
    "\n",
    "# who make more than the mean\n",
    "# devs[devs[col] > 76230.84]\n",
    "\n",
    "# who make more than the median\n",
    "# devs[devs[col] > 63316.00]\n",
    "\n",
    "# the ancient ones\n",
    "so_df[so_df[\"YearsCodePro\"] == 'More than 50 years']\n",
    "# should drop the 18-24 year old who is either bullshitting or recalls a past life\n",
    "# 55-64 years old\n",
    "# 65 years or older"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}