New Publications of Linguistic Data Consortium - ☆☆ 卢伟个人主页 _____ ☆☆☆☆☆ _____ Lu Wei's Homepage ☆☆

New Publications of Linguistic Data Consortium

- LDC Incentives: Early Renewal Discounts for Membership Year (MY) 2010 -
- LDC at NWAV 38 -
New Publications:
- 2007 NIST Language Recognition Evaluation Supplemental Training Set -
- French Gigaword Second Edition -
- NXT Switchboard Annotations -
--------------------------------------------------------------------------------
LDC Incentives:  Early Renewal Discounts for Membership Year (MY) 2010

LDC appreciates the important contribution LDC members make through their continued support of the consortium.  We would like to invite all current and previous members of LDC to renew for Membership Year (MY) 2010.  For MY2010, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.  Additionally, in last month's newsletter, we announced an LDC Incentives Package which will include a host of incentives to help lower the cost of LDC membership and data licensing fees.  As part of this package, LDC will extend discounts to members who keep their membership current and who join early in the year.

The details of our Early Renewal Discounts for MY2010 are as follows:

organizations who joined for MY2009, will receive a 5% discount when renewing. This discount will apply throughout 2010, regardless of time of renewal. MY2009 members renewing before March 1, 2010 will receive an additional 5% discount, for a total 10% discount off the membership fee.
New members as well as organizations who did not join for MY2009, but who held membership in any of the previous MY's (1993-2008), will also be eligible for a 5% discount provided that they join/renew before March 1, 2010.
The Membership Fee Table provides exact pricing information.

MY2010 Fee
MY2010 Fee
with 5% Discount *
MY2010 Fee
with 10% Discount **

Not-for-Profit
  
Standard
US$2400
US$2280
US$2160
  
Subscription
US$3850
US$3657.50
US$3465

For-Profit

Standard
US$24000
US$22800
US$21600
  
Subscription
US$27500
US$26125
US$24750

*   For MY2009 Members renewing for MY2010 and any previous year Member who renews before March 1, 2010


** For MY2009 Members renewing before March 1, 2010

Publications for MY2010 are still being planned but it will be another productive year with a broad selection of publications.  The working titles of data sets we intend to provide include:

Arabic Treebank: Part 2 v 4.0
Fisher Spanish

Chinese Treebank 7.0
LCTL Bengali

Chinese Web N-gram Version 1.0
NPS Chat Corpus

In addition to receiving new publications, current year members of the LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.

This past year, nearly 100 organizations who renewed membership or joined early received a discount on membership fees for MY2009.  Taken together, these members saved over US$50,000!  Be sure to keep an eye out on your mail - all LDC members have been sent an invitation to join letter and renewal invoice for MY2010.  Renew early for MY2010 and save today!

LDC at NWAV 38

LDC exhibited at NWAV for the third straight year. We were delighted to interact with so many talented sociolinguistic researchers and to introduce numerous attendees to LDC and our data catalog. LDC distributed free copies of both the SLX Corpus of Classic Sociolinguistic Interviews, as per the terms of the Timebank grant, and the 2008 LDC Spoken Language Sampler, which is available for download here. We also distributed many of our newly minted data sheets, including one featuring the speech annotation tool XTrans. This tool is also freely available from our website in Linux and Windows formats.  

LDC’s Executive Director Chris Cieri and Senior Associate Director Stephanie Strassel presented papers on the following topics:

·         Models of Phonological Variation for Multi-dialectal Communities: the case of L’Aquila

·         Closer Still to a Robust, All Digital, Empirical, Reproducible Sociolinguistic Methodology

Thanks again to everyone who stopped by our display and we look forward to seeing you again next year!

New Publications

(1) 2007 NIST Language Recognition Evaluation Supplemental Training Set consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu and Tamil.

The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment.

The supplemental training material in this release consists of the following:

Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND and Mixer collections.
Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican) and Tamil. This material was collected by oregon Health and Science University (OHSU), Beaverton, oregon. The test segments used in the 2005 NIST Language Recognition Evaluation were derived from these full conversations.
In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation and 2005 NIST Language Recognition Evaluation.

2007 NIST Language Recognition Evaluation Supplemental Training Set is distributed on one DVD-ROM.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.

(2) French Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates French Gigaword First Edition (LDC2006T7) and adds material collected from August 1, 2006 through December 31, 2008.

The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:

Agence France-Presse (afp_fre) May 1994 - Dec 2008
Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008
The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code ("fre") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.

The overall totals for each source are summarized below. The "Totl-MB" numbers show the amount of data obtained when the files are uncompressed (i.e., approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; and the "K-wrds" numbers are the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source
#Files
Gzip-MB
Totl-MB
K-wrds
#DOCs

AFP_FRE
172
2408
4079
560000
2060803

APW_FRE
171
2280
1719
241324
0872573

TOTAL
343
4688
5789
801324
2933376

The data has undergone a consistent extent of quality control to eliminate out-of-band content and other obvious forms of corruption. Since the source data is generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data.

French Gigaword Second Edition is distributed on one DVD-ROM.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4000.

(3) NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the multiple layers of annotation performed on a transcript subset from Switchboard 1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration among researchers from Edinburgh University, Stanford University and the University of Washington.

The original Switchboard corpus is a collection of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from those conversations was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. Phonetic transcripts were generated by the International Computer Science Institute, University of California Berkeley and later corrected by the Institute for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases from that subset were annotated for animacy. The Treebank transcript was then aligned with the corresponding subset from the corrected Mississippi State (MS-State) transcript in order to provide word timing information. Focus/contrast and prosodic annotations, as well as phone/syllable alignment were next added to the annotations. The previous annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations for markables were added to provide information about their animacy and information structure, including coreferential links.

NXT is an open source toolkit that enables multiple linguistic annotations to be assembled into a unified database. It uses a stand-off XML data format that consists of several XML files that point to each other. The NXT format provides a data model that describes how the various annotations for a corpus relate to one another. For that reason, it does not impose any particular linguistic theory or any particular markup structure. Instead, users define their annotations in a "metadata" file that expresses their contents and how they relate to each other in terms of the graph structure for the corpus annotations overall. The relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs and structured in a way that helps data users. NXT's other core component is a query language designed specifically for working with data conforming to this data model. Together, the data model and query language allow annotations to be treated as one coherent set containing both structural and timing information.

NXT Switchboard Annotations is distributed via web download.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$25.  NXT Switchboard Annotations is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members under the terms of their For-Profit Membership Agreements.
--------------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
-------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

评论: 0 | 引用: 0 | 查看次数: 164
发表评论
昵 称:
密 码: 游客发言不需要密码.
内 容:
验证码: 验证码
选 项:
虽然发表评论不用注册,但是为了保护您的发言权,建议您注册帐号.
字数限制 1000 字 | UBB代码 关闭 | [img]标签 关闭