In 2000, the US National Institute of General Medical Sciences of the National Institutes of Health funded the Protein Structure Initiative (PSI), a ten-year project to uncover the three-dimensional shapes of a wide range of proteins. The Joint Center for Structural Genomics (JCSG), based at The Scripps Research Institute in La Jolla, California, USA, is one of four large-scale centers involved in the production phase of the PSI. Four centers focus on high-throughput protein-structure determination, six specialized centers deal with difficult-to-solve proteins, such as membrane proteins, and two others provide new approaches to molecular modeling.

Ian Wilson, Director of the JCSG, believes that the time is perfect for PSI centers to produce a large number of new protein structures for the research community: “With more and more DNA sequences available every day, the possibilities for the future protein structure determination are tremendous. "A central goal of PSI is to allow the prediction of three-dimensional structures for most proteins from knowledge of their corresponding DNA sequence. In principle, this can be done by deducing the structure of a protein based on the known structure of representative members of the protein family. "Most of the large protein families have been mapped, but even for 70% of the known families we do not have structural data," says Adam Godzik of the Burnham Institute for Medical Research in La Jolla and director of bioinformatics at JCSG. This generates a large amount of target proteins po These are important if one wants to have representatives from all the families, and therefore raises difficult questions: "How do you choose which families to target and then from which proteins within those families to get structures?"

Targeting the universe of proteins

"We are dealing with an ever-expanding universe of proteins, so we had to have some rules about targeting," says Wilson. For the PSI, seventy percent of the target protein families are community-screened through the PSI Target Selection Committee. “We all sat down and ran a draft to decide which families would receive which center,” says Wilson. "By virtue of choosing particular families, we avoid overlap, but also with this selection process, each center can optimize specific goals within families for itself," Godzik says. Another 15% of the target proteins are decided by each center, and the final 15% are community targets proposed by external researchers.

Godzik says that it is more efficient for individual centers to decide which proteins to follow within their assigned families because each center relies on different "reactive genomes" - large sets of genomic DNAs used to isolate homologous sequences. At JCSG, it is Godzik, along with his bioinformatics team, who is responsible for determining the specific proteins that JCSG will work on. By aligning a protein family with the 100 genomes available in JCSG, they first identify all homologous proteins. Then, using their own software, they assign a crystallization score to each homologous gene identified within the family, a measure of the probability of success of the corresponding protein in the structure determination process. "We take the ones we predict will be most likely to succeed from this tool, and then we work our way down the list," he says.