Create Account
Forgot Password?

Computing Cluster File Server

Proposal ID 2009-018-1
Department Linguistics
Non-core Access Campus unit gets priority
First Application? Yes
Student Initiated? No

Abstract

The Department of Linguistics proposes to install a new 3 terabyte (3,000 gigabyte) file server for use by students doing work on the department's high-performance computing cluster. This cluster has become an important resource for the Computational Linguistics, Phonetics, and Sociolinguistics programs, and increased use has begun to push the limits of the disk space currently available. A new, STF-funded file server would greatly enhance the utility of the cluster, part of which was funded by a past STF grant.

Background

Linguistics encompasses many fields, including Computational Linguistics, Sociolinguistics, and Phonetics.

Computational Linguistics applies computer technology to the analysis and processing of natural human language. It is a multidisciplinary area with branches in Linguistics, Computer Science, Information Science, and Electrical Engineering. The field has already yielded many applications that have become part of daily life: anyone who runs an email spam filter, uses Google to translate a web page from a foreign language into English, or makes a credit card payment over the phone without ever talking to a human operator is enjoying the fruits of research in this rapidly-growing field.

Sociolinguistics is the study of the interrelationships between language and social structure and is centrally concerned with how language varies and changes according to how people in society use it. Sociolinguists seek to identify patterns of language use by correlating linguistic variables with social variables like age, gender, and social class. In its most broad definition, sociolinguistics includes elements of the sociology of language, urban dialectology, social psychology, ethnography of communication, and linguistic anthropology.

Phonetics is the study of sounds used in natural speech. Phonetics focuses on three components of speech: the physiological processes involved in producing speech sounds, the neurophysiological perception of speech sounds, and the acoustic components of a speech signal. Phoneticians may be interested in how children learn the sounds of a language, what can be done in cases of speech and hearing defects, how speech sounds vary within individuals and among speech communities, and how computers can be used to manage and create speech signals. As a field, phonetics is related to the speech and hearing sciences, cognitive science, psychology, engineering, and language teaching; within linguistics, phonetics is related to phonology, language acquisition, sociolinguistics, natural language processing, and speech synthesis.

Doing meaningful work in any of these fields often means working with vast amounts of raw data, consisting of many gigabytes of text or speech. The storage and processing requirements for working with this data are large. To meet this need, the department has a 50-CPU parallel computing cluster running the Condor High Throughput Computing System. This cluster was purchased largely with STF funds (proposal 2006-058-2). It is heavily used by students in the Computational Linguistics program, and increasingly by the Phonetics and Sociolinguistics programs as well. As a result, the 1 terabyte file server that holds most of the data for the cluster, originally purchased in 2005, is near the limits of its storage capacity; only a few hundred gigabytes of free space remain, less than some people have in their desktop computers. Additional space is needed to support new student research projects and the expanded use of the cluster by students in the Phonetics and Sociolinguistics programs. Some projects have already been disrupted by the need to move data around to manage the small amount of storage space remaining.

Benefits

The additional 3 terabytes of space would support larger student projects and allow more raw data to be stored on the server where students can easily access it. It would accommodate future growth in the Computational Linguistics program and provide more storage space to Sociolinguistics and Phonetics Lab projects, which also rely on the Computational Linguistics file server for storage.

The large scope of computationally-intensive projects requires students to work together in groups. Additional space for the cluster is ideal for collaborative research, because it provides a central location for students to bring together work on the same cluster that they are using to gather and analyze data.

Student Access

The file server and its associated cluster are available on a registration basis to anyone with a UW NetID. The cluster is primarily accessed via SSH remote login, so it can be used from anywhere with an Internet connection. The three computer labs associated with the file server (the Linguistics Treehouse, the Sociolinguistics Lab, and the Phonetics Lab) are also open on a registration basis to any students on campus.

Available Resources

The department has sufficient resources and infrastructure to support the new hardware:

Rack space is available, and sufficient A/C and power capacity exists. Network equipment sufficient to support the new server is already in place.

An existing disk-based backup server will be used to back up data stored on the new file server; the proposal includes additional hard disks for this machine to ensure sufficient backup storage capacity.

The department has a full-time system administrator to manage the cluster.

Installation Timeline

* Vendor procurement - approximately 3 weeks.

* Burn-in testing and OS benchmarking and installation - 3 weeks

* Physical installation and data migration - 4 days

* Documentation updates - 2 days

Departmental Endorsement

* Julia Herschensohn, Professor and Chair, Department of Linguistics:
"The Computational Linguistics, Sociolinguistics and Phonetics Labs are research units serving undergraduates, graduate students, faculty and post-docs from several departments. As such, they are an invaluable training ground for students, both graduate and undergraduate. The resources of these Labs directly benefit students who use them for their personal research, both undergraduate and graduate, in projects such as MA theses, PhD dissertations, publications and conference papers. I'm sure there are well over a dozen conference presentations this year emanating from the Labs. Furthermore, use of the lab for weekly meetings with presentations of ongoing projects, permits student researchers to model academic dialogue, to exchange ideas and to work on new solutions thanks to collaborative advice. The students are, however aware of the limitations of our current configuration and are requesting STF funding to expand our available disk space by adding a new file server. The labs desperately need the additional computational power to meet the needs of the students' use of huge corpora (necessary in these disciplines). These expanded resources for students would not be directly related to classwork, faculty research, or other departmental uses.

"The Linguistics Department has only FWI allotments to purchase all the equipment needed for faculty, staff, students and general operation; this year our allotment is zero, due to the elimination of FWI; furthermore, the Chair's discretionary fund-which offers a bit of extra support to worthy projects-has been totally eliminated. Research labs normally only have the funds to support their attached professor's research and that of a few RAs. Although our Labs attempt to accommodate more of the student community, it currently lacks the equipment to insure ongoing access to students not covered by independent grants. These students desperately need this STF grant to provide them with their own equipment and insure that they continue to get the opportunity to conduct their independent research projects in the Linguistics Labs.

"At this time, Student Tech Fee support would be a boon for students in their personal research both for their degrees and for work they might do beyond the degree."


* Emily M. Bender, Director, Computational Linguistics Laboratory and
Faculty Director, Professional MA in Computational Linguistics:
"Research in various subfields of linguistics is being revolutionized by computational methods that allow linguists to work with datasets orders of magnitude larger than what was possible before. The Computational Linguistics Laboratory at UW provides a server cluster that supports such research, not only in computational linguistics, but also in phonetics, sociolinguistics, and increasingly in other subfields as well. Though we began with a small cluster designed to support coursework and some faculty research, an STF grant in 2006 allowed us to expand it to support independent student research. The server cluster has been enormously popular with students (we have over 100 current users) and instrumental in supporting cutting-edge student research. With this popularity, however, has come increased demand for data storage. The proposed expansion in file server space will be critical to our students fully realizing the potential of the computational methods they are applying in their research."

Student Endorsement

* Steven Moran, PhD Student, Department of Linguistics:
"I fully support this Computing Cluster File Server STF grant for several reasons. Foremost, I am a PhD student undertaking an ambitious effort to create a database of sound systems for all of the world's languages. This requires a tremendous amount of file storage that is not available on campus. For example, we have over 1500 grammars that have been scanned from rare resources on poorly documented languages. These grammars alone are over 100GB. Ideally, they need to be in a central location for our collaborative effort. Also, we are scraping the Web for additional resources on languages to extract sound system information. The Web provides an endless supply of data (since linguists are continually publishing new data from languages). Secondly, increased storage resources procure the framework for collaborative work between several departments on campus, and I believe interdisciplinary research is an emerging standard in science in general. Computational linguistics, for example, is a mash-up of computer science, linguistics and electrical engineering (signal processing). And finally, increased storage allows work on the cluster to be safely backed-up -- currently a problem. Every researcher needs to back up important data and results, but the storage required is growing exponentially."


* Jeffry Scott, Graduate Student, Computational Linguistics:
"Acquisition of a new file server for the Treehouse (Linguistics) computing cluster is critical to research work with Wikipedia. Wikipedia is a rich resource for linguistic study on a number of levels: its diversity of topics and authors provides a choice data set for linguistic analysis; its manual curation provides a dependable dataset for use as a ground truth; and its implementation in multiple languages provides further opportunities for analysis, as well as research into cross-lingual translation. However, Wikipedia must be downloaded locally for use in linguistic research, and space requirements are estimated to run up to 500 GB. Once space is made available, the Wikipedia corpus will be made available to all students in the Linguistics department."


* Jeff Ridenour, Graduate Student, Linguistics:
"For my Master's thesis, I will be doing an extensive corpora search of the syntactic and morpho-syntactic properties of the English subjunctive. In order to do this, I will need a large amount of disc space to store my data. For this reason, I wholeheartedly support the STF proposal to increase disc space."


* Note: Students Steven Moran, Meghan Oxley, Michael Scanlon, and Jeffry Scott helped in putting together this proposal.

Items

Below are the items making up the current proposal. The asterisk (*) beside items signify that they were approved by the committee. This however was not implemented correctly for our database before 2005, so earlier years may not show this.

Click an item's title to view details on that item, or show all item details.

TitleTypePriceQtySubtotal
*File serverserver$3,906.561$3,906.56

Location: Padelford Hall - A-212

Description: Silicon Mechanics Rackform iServ R267

(2) Intel Xeon Dual Core processors, 1.86 GHz

4 GB RAM

(2) 250 GB SATA hard disks (mirrored pair for operating system)

(5) 1 TB SATA hard disks (3 data + 1 parity + 1 hot spare)

Justification: This is the core of the proposal, a file server that will provide 3 TB of usable file storage space. Additional disks are included to hold the server's operating system and to provide redundancy for the stored data in the event of a disk failure.

*Disks for backup serverHardware$119.675$598.35

Location: Padelford Hall - B-5-L

Description: 1 TB SATA hard disks (3 data + 1 parity + 1 spare)

Justification: These disks will be added to an existing backup server to allow the additional data stored on the new file server to be backed up. This will protect the data in the event of disk corruption, extensive hardware failure, or administrative mishaps. Additionally, the spare disk is compatible with the file server and can serve double duty as a "cold spare" for that machine.

Requested Total: $4,504.91
Approved Total: $4,504.91
Funding Status: Fully Funded

Comments

No comments have been posted for this proposal yet.

Note: This cannot be undone.