Studer-Imwinkelried, Matthias. NeuroCarb : artificial neural networks for NMR structure elucidation of oligosaccharides. 2006, Doctoral Thesis, University of Basel, Faculty of Science.
|
PDF
5Mb |
Official URL: http://edoc.unibas.ch/diss/DissB_7681
Downloads: Statistics Overview
Abstract
Recombinant proteins and monoclonal antibodies offer great promise as therapeutics for hundreds
of diseases. Today, there are almost 400 biotechnology drugs in development for over 200 different
conditions. Many of these drugs are glycoproteins for which the correct glycosylation patterns are
important for their structure and function. Achieving and maintaining proper glycosylation is a major
challenge in biotechnology manufacturing. Most recombinant therapeutic glycoproteins are
produced in living cells. This method is used in an attempt to correctly match the glycosylation
patterns found in the natural human form of the protein and achieve optimal in vivo functionality.
However, utilizing cell systems to produce glycoproteins requires balancing the cells ability to
produce the protein with its ability to attach the appropriate carbohydrates. One limitation of this
approach is that the expression systems do not maintain complete glycosylation under high-volume
production conditions. This results in low yields of usable product and contributes to the cost and
complexity of producing these drugs. Incorrect glycosylation also affects the half-life of the drug.
Low production yields are a significant contributor to the critical worldwide shortage of
biotechnology manufacturing capacity.
To achieve higher production yields, the required quality standards to fulfill regulations by health
authorities, fast, accurate and preferably inexpensive analytical methods are required. Nowadays
the (routine) analysis of therapeutic glycoprotein is accomplished by analytical HPLC, MS or Lectin
blotting and in conjunction with chemical derivatization, exo-glycosidases treatment, and/or other
selective chemical cleavage reactions. The fact that different carbohydrates have very similar
molecular weights and physicochemical properties makes the analysis of glycosylation slow and
complex. Conventional glycoanalysis requires multiple steps to obtain the structure, sequence and
prevalence of all glycans in a glycoprotein sample. Complete analysis typically takes several days
and highly trained personnel. Therefore, the need for more efficient and rapid glycoanalysis
methodology is fundamental to the success of biotechnologically produced drugs.
With this demand in the back of one's mind, a 13C-NMR spectra analysis system for
oligosaccharides based on multiple Back-propagation neural networks was developed during this
thesis. Before the realization of the idea, some fundamental questions had to be posed:
1. Are the monosaccharide moieties, the anomeric configuration and the substitution pattern
of an oligosaccharide shown in a NMR (13C or 1H) spectrum?
2. What kind of NMR data provides this information better (1H or 13C-NMR)?
3. How can spectroscopic data be processed, compressed and transferred into a neural
network?
4. Which neural network architecture, learning algorithm and learning parameters lead to
optimal results?
Preliminary experiments showed that the six chemical shifts of a monosaccharide moiety (from
glucose, galactose and mannose) suffice to identify the monosaccharide itself, the anomeric
configuration (if the anomeric carbon atom is substituted) and the substitution position(s). The
experiments also revealed that these compounds could be almost completely separated by the help
of Counter-propagation neural networks.
The main goal of the neural network approach was to recognize every single monosaccharide
moiety in an oligosaccharide and train specialized separated networks for each monosaccharide
moiety group. Therefore, the neural networks should be trained with the 13C-NMR spectra of these
monosaccharide moieties. During the test phase, the whole spectrum of an oligosaccharide will be
presented to the network and the specialized networks should then only recognize the
monosaccharide moieties they are trained for.
Initial attempts to train a Back-propagation neural network to identify six methyl pyranoside
compounds failed. This lack of success was because the data set used was too small and an
uncompressed NMR spectrum leads to too many input neurons. Therefore, the data foundation was
changed and enlarged with 535 monosaccharide moieties (mostly galactose, glucose and
mannose) from literature and a special data compression (JCAMP-DX for NMR files) and parsing
software tool called ANN Pattern File Generator was developed. The entire dataset was normalized
and stored in a FileMaker 13C-NMR database. Further experiments with this new dataset, different
Back-propagation network layouts and training parameters still did not achieve the designated
recognition rate of unknown test compounds. The training performance of the neural networks
seems to be insensible against major changes of training parameters. Tests with a new and
enlarged dataset (1000 oligosaccharides and approx. 2500 monosaccharide moieties) with
Kohonen networks highlighted, that separate Kohonen networks for each monosaccharide type
yield to higher recognition rates than networks, which have to deal with all three monosaccharide
types at once.
This cognition was transferred to separate back propagation networks, which now showed
recognition rates higher than 90% for unknown compounds. This separated approach worked
excellent for disaccharides with two different monosaccharide moieties. Disaccharides with similar
or identical moieties cannot be identified because the designated neural network recognizes only
one monosaccharide at once. Out of this disadvantage, the so-called 'ensemble' or 'group of
experts' approach was developed. Here, one utilizes the fact, that no trained neural network shows
exactly the same recognition characteristics. Different neural networks respond differently to the
same test inputs. Twenty trained neural networks at a time were grouped into ensembles. All these
networks are trained to recognize the same monosaccharide moiety. After presenting a test input
(e.g. disaccharide) to this group of experts, one gets at the most extreme case, twenty different
recognition results. Afterwards, the results can be statistically analyzed. In the case of a
disaccharide with two monosaccharide moieties of the same carbohydrate (e.g. α-D-Glcp-1-4-β-DGlcp-
OMe), the analysis will deliver both monosaccharide compounds because some networks
recognized one and other networks the other part of the disaccharide.
The ensemble approach brought the final breakthrough of this thesis. Disaccharide recognition
rates in the range of 85 – 96% (depending on the monosaccharide moiety – glucose, galactose or
mannose) demonstrate the feasibility of the approach. The hit rates of the different ensembles can
certainly be improved by a more subtle choice of the members of each ensemble. An ongoing
diploma work shows a recognition improvement in this direction.
of diseases. Today, there are almost 400 biotechnology drugs in development for over 200 different
conditions. Many of these drugs are glycoproteins for which the correct glycosylation patterns are
important for their structure and function. Achieving and maintaining proper glycosylation is a major
challenge in biotechnology manufacturing. Most recombinant therapeutic glycoproteins are
produced in living cells. This method is used in an attempt to correctly match the glycosylation
patterns found in the natural human form of the protein and achieve optimal in vivo functionality.
However, utilizing cell systems to produce glycoproteins requires balancing the cells ability to
produce the protein with its ability to attach the appropriate carbohydrates. One limitation of this
approach is that the expression systems do not maintain complete glycosylation under high-volume
production conditions. This results in low yields of usable product and contributes to the cost and
complexity of producing these drugs. Incorrect glycosylation also affects the half-life of the drug.
Low production yields are a significant contributor to the critical worldwide shortage of
biotechnology manufacturing capacity.
To achieve higher production yields, the required quality standards to fulfill regulations by health
authorities, fast, accurate and preferably inexpensive analytical methods are required. Nowadays
the (routine) analysis of therapeutic glycoprotein is accomplished by analytical HPLC, MS or Lectin
blotting and in conjunction with chemical derivatization, exo-glycosidases treatment, and/or other
selective chemical cleavage reactions. The fact that different carbohydrates have very similar
molecular weights and physicochemical properties makes the analysis of glycosylation slow and
complex. Conventional glycoanalysis requires multiple steps to obtain the structure, sequence and
prevalence of all glycans in a glycoprotein sample. Complete analysis typically takes several days
and highly trained personnel. Therefore, the need for more efficient and rapid glycoanalysis
methodology is fundamental to the success of biotechnologically produced drugs.
With this demand in the back of one's mind, a 13C-NMR spectra analysis system for
oligosaccharides based on multiple Back-propagation neural networks was developed during this
thesis. Before the realization of the idea, some fundamental questions had to be posed:
1. Are the monosaccharide moieties, the anomeric configuration and the substitution pattern
of an oligosaccharide shown in a NMR (13C or 1H) spectrum?
2. What kind of NMR data provides this information better (1H or 13C-NMR)?
3. How can spectroscopic data be processed, compressed and transferred into a neural
network?
4. Which neural network architecture, learning algorithm and learning parameters lead to
optimal results?
Preliminary experiments showed that the six chemical shifts of a monosaccharide moiety (from
glucose, galactose and mannose) suffice to identify the monosaccharide itself, the anomeric
configuration (if the anomeric carbon atom is substituted) and the substitution position(s). The
experiments also revealed that these compounds could be almost completely separated by the help
of Counter-propagation neural networks.
The main goal of the neural network approach was to recognize every single monosaccharide
moiety in an oligosaccharide and train specialized separated networks for each monosaccharide
moiety group. Therefore, the neural networks should be trained with the 13C-NMR spectra of these
monosaccharide moieties. During the test phase, the whole spectrum of an oligosaccharide will be
presented to the network and the specialized networks should then only recognize the
monosaccharide moieties they are trained for.
Initial attempts to train a Back-propagation neural network to identify six methyl pyranoside
compounds failed. This lack of success was because the data set used was too small and an
uncompressed NMR spectrum leads to too many input neurons. Therefore, the data foundation was
changed and enlarged with 535 monosaccharide moieties (mostly galactose, glucose and
mannose) from literature and a special data compression (JCAMP-DX for NMR files) and parsing
software tool called ANN Pattern File Generator was developed. The entire dataset was normalized
and stored in a FileMaker 13C-NMR database. Further experiments with this new dataset, different
Back-propagation network layouts and training parameters still did not achieve the designated
recognition rate of unknown test compounds. The training performance of the neural networks
seems to be insensible against major changes of training parameters. Tests with a new and
enlarged dataset (1000 oligosaccharides and approx. 2500 monosaccharide moieties) with
Kohonen networks highlighted, that separate Kohonen networks for each monosaccharide type
yield to higher recognition rates than networks, which have to deal with all three monosaccharide
types at once.
This cognition was transferred to separate back propagation networks, which now showed
recognition rates higher than 90% for unknown compounds. This separated approach worked
excellent for disaccharides with two different monosaccharide moieties. Disaccharides with similar
or identical moieties cannot be identified because the designated neural network recognizes only
one monosaccharide at once. Out of this disadvantage, the so-called 'ensemble' or 'group of
experts' approach was developed. Here, one utilizes the fact, that no trained neural network shows
exactly the same recognition characteristics. Different neural networks respond differently to the
same test inputs. Twenty trained neural networks at a time were grouped into ensembles. All these
networks are trained to recognize the same monosaccharide moiety. After presenting a test input
(e.g. disaccharide) to this group of experts, one gets at the most extreme case, twenty different
recognition results. Afterwards, the results can be statistically analyzed. In the case of a
disaccharide with two monosaccharide moieties of the same carbohydrate (e.g. α-D-Glcp-1-4-β-DGlcp-
OMe), the analysis will deliver both monosaccharide compounds because some networks
recognized one and other networks the other part of the disaccharide.
The ensemble approach brought the final breakthrough of this thesis. Disaccharide recognition
rates in the range of 85 – 96% (depending on the monosaccharide moiety – glucose, galactose or
mannose) demonstrate the feasibility of the approach. The hit rates of the different ensembles can
certainly be improved by a more subtle choice of the members of each ensemble. An ongoing
diploma work shows a recognition improvement in this direction.
Advisors: | Ernst, Beat |
---|---|
Committee Members: | Gasteiger, Johann |
Faculties and Departments: | 05 Faculty of Science > Departement Pharmazeutische Wissenschaften > Ehemalige Einheiten Pharmazie > Molekulare Pharmazie (Ernst) |
UniBasel Contributors: | Ernst, Beat |
Item Type: | Thesis |
Thesis Subtype: | Doctoral Thesis |
Thesis no: | 7681 |
Thesis status: | Complete |
Number of Pages: | 217 |
Language: | English |
Identification Number: |
|
edoc DOI: | |
Last Modified: | 02 Aug 2021 15:05 |
Deposited On: | 13 Feb 2009 15:48 |
Repository Staff Only: item control page