{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quantitative Structure-Property Relationships\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quantitative Structure-Property Relationships (QSPR) and Quantitative\n",
"Structure-Activity Relationships (QSAR) use statistical models to relate a set\n",
"of predictor values to a response variable. Molecules are described using a set\n",
"of *descriptors*, and then mathematical relationships can be developed to explain\n",
"observed properties. In QSPR and QSAR, physico-chemical properties of theoretical\n",
"descriptors of chemicals are used to predict either a physical property or a\n",
"biological outcome.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Molecular Descriptors\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A molecular descriptor is “final result of a logical and mathematical procedure,\n",
"which transforms chemical information encoded within a symbolic repre-sentation\n",
"of a molecule into a useful number or the result of some standardized\n",
"experiment” (Todeschini, R.; Consonni, V. *Molecular descriptors for\n",
"chemoinformatics* **2009** Wiley‑VCH, Weinheim). You are already familiar with\n",
"descriptors such as molecular weight or number of heavy atoms and we have\n",
"queried PubChem for data such as XLogP. We’ll examine just a few simple\n",
"descriptors, but thousands have been developed for applications in QSPR.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using rdkit and mordred to calculate descriptors\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly we have been using algorithms for calculating these indices. This is\n",
"time consuming for an individual, but programs can be used to complete this much\n",
"easier. We will use the rdkit and mordred python libraries to help us out.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from rdkit import Chem # imports the Chem module from rdkit\n",
"from mordred import Calculator, descriptors # imports mordred descriptor library\n",
"calc = Calculator(descriptors, ignore_3D=True) # sets up a function reading descriptors\n",
"len(calc.descriptors) # tells us how many different types of descriptors are available in the library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Wiener Index\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We already calculated the Wiener index for *n*-pentane and 2-methylpentane. Now\n",
"let’s have mordred do it for us.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from mordred import WienerIndex\n",
"pentane = Chem.MolFromSmiles('CCCCC') # Use rdkit to create a mol file from the smiles string for n-pentane\n",
"methyl_pentane = Chem.MolFromSmiles('CCCC(C)C') # and for 2-methylpentane\n",
"wiener_index = WienerIndex.WienerIndex() # create descriptor instance for Wiener index\n",
"result1 = wiener_index(pentane) # calculate wiener index for n-pentane\n",
"result2 = wiener_index(methyl_pentane) # and for 2-methylpentane\n",
"print(\"The Wiener index for n-pentane is: \", result1) # display result\n",
"print(\"The Wiener index for 2-methylpentane is: \", result2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zagreb Indices\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can do the same for the different Zagreb indices for *n*-pentane and\n",
"2-methylpentane.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from mordred import ZagrebIndex\n",
"\n",
"zagreb_index1 = ZagrebIndex.ZagrebIndex(version = 1) # create descriptor instance for Zagreb index 1\n",
"zagreb_index2 = ZagrebIndex.ZagrebIndex(version = 2) # create descriptor instance for Zagreb index 1\n",
"\n",
"result_Z1 = zagreb_index1(pentane) # calculate Z1 descriptor value for n-pentane\n",
"result_Z2 = zagreb_index2(pentane) # calculate Z2 descriptor value for n-pentane\n",
"print(\"The Zagreb index 1 for n-pentane is:\", result_Z1)\n",
"print(\"The Zagreb index 2 for n-pentane is:\", result_Z2)\n",
"\n",
"result_Z1 = zagreb_index1(methyl_pentane) # and for 2-methylpentane as well\n",
"result_Z2 = zagreb_index2(methyl_pentane) \n",
"print(\"The Zagreb index 1 for 2-methylpentane is:\", result_Z1)\n",
"print(\"The Zagreb index 2 for 2-methylpentane is:\", result_Z2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see from the code above, each index will have different code that\n",
"needs to be followed for programming. Each descriptor and the resulting code\n",
"syntax can be found here\n",
"[http://mordred-descriptor.github.io/documentation/master/api/modules.html](http://mordred-descriptor.github.io/documentation/master/api/modules.html)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Looping through a list of molecules\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have an understanding on how rdkit and mordred work to get our\n",
"descriptors, let’s simplify the code using a looping structure:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"smiles = [\"CCC\", \"CCCC\", \"CCCCC\", \"CCCC(C)C\",\"CC(C)C(C)C\"] #store smiles strings in a list\n",
"\n",
"for smile in smiles:\n",
" mol = Chem.MolFromSmiles(smile) # convert smiles string to mol file\n",
" result_Z1 = zagreb_index1(mol) # calculate Z1 descriptor value\n",
" result_Z2 = zagreb_index2(mol) # calculate Z2 descriptor value\n",
" print(\"The Zagreb index 1 for\", smile, \"is:\", result_Z1)\n",
" print(\"The Zagreb index 2 for\", smile, \"is:\", result_Z2)\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using descriptors to predict molecular properties\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this exercise we will take a series of alkanes and create an equation that\n",
"will allow us to predict boiling points. We will start with a 30 molecule alkane\n",
"training set. We will obtain various descriptors and see how they can predict\n",
"the physical property boiling point.\n",
"\n",
"For this exercise we will be using the [pandas](https://pandas.pydata.org/) (Python Data Analysis) library to\n",
"help us read, write and manage data. We will also use matplotlib to generate\n",
"graphs.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boiling Point data\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s start by reading and graphing a set of boiling point data. First we read\n",
"our csv file into a pandas “dataframe”. Notice that we can generate a nicely\n",
"formatted table from our dataframe by just entering the name of the dataframe on\n",
"the last line.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd # import the Python Data Analysis Library with the shortened name pd\n",
"df = pd.read_csv(\"BP.csv\") # read in the file into a pandas dataframe\n",
"df # print the dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graphing the data\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can graph the data using matplotlib.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"plt.scatter(df.MW, df.BP_K) # plot of boiling point (in K) vs molecular weight\n",
"plt.xlabel('Molecular Weight')\n",
"plt.ylabel('Boiling Point in Kelvin')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly from the data we can see that we have multiple molecules with the same\n",
"molecular weight, but different boiling points. Molecular weight is therefore\n",
"not the best predictor of boiling point. We can see if there are other\n",
"descriptors that we can use such as Weiner or Zagreb. Let’s add various\n",
"descriptors to the dataframe.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding descriptors to the dataset\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now calculate the Wiener and Zagreb indices for each of our\n",
"hydrocarbons and add them to the dataframe.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create new lists to store results we calculate\n",
"result_Wiener= []\n",
"result_Z1= []\n",
"result_Z2= []\n",
"\n",
"for index, row in df.iterrows(): # iterate through each row of the CSV data\n",
" SMILE = row['SMILES'] # get SMILES string from row\n",
" mol = Chem.MolFromSmiles(SMILE) # convert smiles string to mol file\n",
" result_Wiener.append(wiener_index(mol)) # calculate Wiener index descripter value\n",
" result_Z1.append(zagreb_index1(mol)) # calculate zagreb (Z1) descriptor value\n",
" result_Z2.append(zagreb_index2(mol)) # calculate zagreb (Z2) descriptor value\n",
"\n",
"df['Wiener'] = result_Wiener # add the results for WienerIndex to dataframe\n",
"df['Z1'] = result_Z1 # add the results for Zagreb 1 to dataframe\n",
"df['Z2'] = result_Z2 # add the results for Zagreb 2 to dataframe\n",
"df # print the updated dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can see how each of these descriptors are related to the boiling points\n",
"of their respective compounds.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(df.Wiener, df.BP_K) # plot of BP versus Wiener index\n",
"plt.xlabel('Wiener Index')\n",
"plt.ylabel('Boiling Point in Kelvin')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(df.Z1, df.BP_K) # plot of BP versus Zagreb M1\n",
"plt.xlabel('Zagreb M1')\n",
"plt.ylabel('Boiling Point in Kelvin')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(df.Z2, df.BP_K) # plot of BP versus Zagreb M2\n",
"plt.xlabel('Zagreb M2')\n",
"plt.ylabel('Boiling Point in Kelvin')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly molecular weight was somewhat predictive, but problematic. It looks like\n",
"using the other indicators we have have some other ways to predict boiling\n",
"point.\n",
"\n",
"One option is write this data to a new CSV file and work in Microsoft Excel to\n",
"perform a regression analysis. Exporting the data is straightforward and your\n",
"instructor may provide instructions on how to analyze the data using Excel.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('bp_descriptor_data.csv', encoding='utf-8', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mulitple regression analysis using statsmodels\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [statsmodels](https://www.statsmodels.org/stable/index.html) package provides numerous tools for performaing statistical\n",
"analysis using Python. In this case, we want to perform a *multiple linear\n",
"regression* using all of our descriptors (molecular weight, Wiener index, Zagreb\n",
"indices) to help predict our boiling point.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm # import the statsmodels library as sm\n",
"X = df[[\"MW\", \"Wiener\", \"Z1\", \"Z2\"]] # select our independent variables\n",
"X = sm.add_constant(X) # add an intercept to our model\n",
"y = df[[\"BP_K\"]] # select BP as our dependent variable\n",
"model = sm.OLS(y,X).fit() # set up our model\n",
"predictions = model.predict(X) # make the predictions\n",
"print(model.summary()) # print out statistical summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note above that we now have coeffiecients for an equation that can be used for prediction of boiling points for molecules not included in our dataset. The equation would be:\n",
"\n",
" Predicted BP = 4.4325 * MW - 0.6411 * Weiner - 4.3920 * Z1 + 0.2982 * Z2 + 55.5695\n",
"\n",
"We can use this equation to predict the boiling point of a new molecule. However, before we do, we need to explore the validity of the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model summary and analysis using partial regression plots\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A quick look at the results summary shows that the model has an excellent\n",
"R-squared value. Upon more careful examination, you may notice that one of our\n",
"descriptors has a very large P value. This would indicate that perhaps the Z2\n",
"descriptor is not working well in this case. We can generate a more graphical\n",
"interpretation that will make this more obvious.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig = plt.figure(figsize=(10,8))\n",
"fig = sm.graphics.plot_partregress_grid(model, fig=fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part of the reason that Z2 may not be predictive in this model is that there is colinearity with the Z1 descriptor. Both descriptors have similar calculations (as outlined in the Libretexts page for this activity). Later on in this exercise we can explore dropping this descriptor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How good is our model?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at a plot of actual versus predicted boiling points\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_bp = model.fittedvalues.copy() # use our model to create a set of predicted bp's\n",
"fig, ax = plt.subplots(figsize=(8, 5))\n",
"lmod = sm.OLS(pred_bp, df.BP_K) # linear regression of observed vs predicted bp's\n",
"res = lmod.fit() # run fitting\n",
"plt.scatter(pred_bp, df.BP_K) # plot of of observed vs predicted bp's\n",
"plt.ylabel('observed boiling point (K)')\n",
"plt.xlabel('predicted boiling point (K)')\n",
"plt.show()\n",
"print(res.summary()) # print linear regression stats summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model appears to have very good predictability (R-squared = 1.000) within the original 30 molecule data set. One way to test this model is to use a new molecule with its descriptors to see how well it is predicted. One molecule in the dataset is 2-methylheptane. It has the following data:\n",
"MW = 114.232\n",
"Wiener Index = 79\n",
"Z1 = 28\n",
"Z2 = 26\n",
"Boiling Point = 390.6 K\n",
"\n",
"Using the equation from above we can determine that the boiling point from the equation \n",
" Predicted BP = 4.4325 * MW - 0.6411 * Weiner - 4.3920 * Z1 + 0.2982 * Z2 + 55.5695\n",
"is 396.0 K. The model gives a 1.4% error for prediction of the boiling point outide the training set. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We had mentioned earlier that Z2 may not be very predictive in this model. We can remove the variable an rerun the analysis to see if we can improve the predictability of the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm # import the statsmodels library as sm\n",
"X = df[[\"MW\", \"Wiener\", \"Z1\"]] # select our independent variables, this time without Z2\n",
"X = sm.add_constant(X) # add an intercept to our model\n",
"y = df[[\"BP_K\"]] # select BP as our dependent variable\n",
"model = sm.OLS(y,X).fit() # set up our model\n",
"predictions = model.predict(X) # make the predictions\n",
"print(model.summary()) # print out statistical summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model appears to have very good predictability (R-squared = 0.994) within the original 30 molecule data set. Let's reexamine 2-methylheptane:\n",
"MW = 114.232\n",
"Wiener Index = 79\n",
"Z1 = 28\n",
"Boiling Point = 390.6 K\n",
"\n",
"Using the equation from above we can determine that the boiling point from the equation \n",
" Predicted BP = 4.4114 * MW - 0.6397 * Weiner - 4.0260 * Z1 + 55.4979\n",
"is 396.2 K. The model also gives a 1.4% error for prediction of the boiling point outide the training set. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see here that Z2 doesn't really change the result of the calculation, and can probably be best left out to simplify the model.\n",
"\n",
"Keep in mind that we have completed this analysis with only a training set of 30 molecules. If the training set had more molecules, you should be able to develop a better model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assignment\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You originally ran this analysis on a 30 molecule data set (BP.CSV). You also have available to you a 102 molecule data set (102BP.CSV). \n",
"\n",
"* Complete the above analysis using the expanded data set to determine if a better predictive model can be obtained with a larger training set. Note that 2-methylheptane is in this new dataset so you will need to choose a new test molecule.\n",
"\n",
"When you have completed the analysis, you will create a new analysis:\n",
"* Choose four new topological and other calculated descriptors found in Mordred http://mordred-descriptor.github.io/documentation/master/api/modules.html \n",
"\n",
"*\tComplete simple linear analysis for each of your new descriptors. \n",
"*\tComplete a multiple linear regression to create an equation that best represents the data boiling point data and your descriptors. \n",
"*\tCreate a separate sheet that has your regression data. \n",
"*\tMake a plot of Actual vs Predicted BP for your regression.\n",
"*\tChoose a new molecule not in the dataset (not 2-methylheptane, be creative and use chemical intuition).\n",
"*\tUse your multiple linear equation to predict this molecule’s BP and look of the literature value.\n",
"*\tWrite a short one-two page paper that includes:\n",
" *\tWhat your new chosen descriptors mean\n",
" *\tWhich new chosen descriptors correlate\n",
" *\tWhat is the overall equation calculated\n",
" *\tHow to choose the molecule to test\n",
" *\tHow close this multiple linear regression predicts your boiling point of your molecule\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "OLCC2019",
"language": "python",
"name": "olcc2019"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"org": null
},
"nbformat": 4,
"nbformat_minor": 4
}